I consider John W. Tukey to be the King of Little Data. Give him a couple of colored pencils, the back of a used envelope, and some data and he could bring insight to what you were looking at by using graphic displays, eliminating “bad” data, weighting the findings, and providing charts that would allow you to explain what you were seeing to those who had never been trained in technical fields.
Tukey’s approach to “bad data” (outliers, miscodings, logical inconsistency) and downweighting data points which probably make little sense is what will save the Big Data Scientists from themselves by eliminating the likelihood that a few stupid datapoints (like those I enter into online survey databases when I want to screw them up to protect privacy) will strongly bias group findings. Medians are preferable to means most of the time; unit weighting is often to be preferred over seeing too much in the data and then using optimal (maximum likelihood, generalized least squares) data-fit weighting to further distort it.
Few remember that Tukey was also the King of Big Data. At the beginning of his career, Tukey developed a technique called the Fast Fourier Transform or FFT that permitted fairly slow computing equipment to extract key information from very complex analog data and then compress the information into a smaller digital form that would retain much of the information but not unnecessary detail. The ability to compress the data and then move it over a fairly primitive data transmission system (copper wires) made long distance telephone communications feasible. And later, the same method made cellular communications possible.
Hhmm. More than 50 years ago, Tukey pioneered the view that the way to use “sloppy” big data was to distill the necessary information from it in an imprecise but robust way rather than pretending the data were better because they were bigger and erroneously supported over-fitting statistical models.
Hopefully it will not take another 50 years for the Big Data folks to recognize that trillions of data points may hide the truth and that the solution is to pass out some red and blue pencils and used envelopes. Tukey knew that 50 years ago.
All it “costs” to adopt Tukey’s methods is a little commonsense.
Hhmm, maybe the Tukey approach is not so feasible. Big Data proponents at the current time seem to lack in aggregate the amount of commonsense necessary to implement Tukey’s methods.
Turn off the computers in third grade, pass out the pencils, and let’s teach the next generation not to worship Big Data and developing statistical models seemingly far more precise than the data.
The design has historically been considered the best way to “prove” that new medical interventions work, especially if the experiment is replicated a number of times by different research teams. By the double blind (neither the treating medical team nor the patient know whether the patient is taking a placebo or active medication) design, investigators expect to negate the placebo effects caused by patient or medical staff beliefs that the “blue pill” is working.
A key part of virtually all double-blind research designs is the assumption that all patient expectations and reports are independent. This assumption is made because of the statistical requirements necessary to determine whether a drug has had a “significantly larger effect” as compared to a placebo. Making this assumption has been a “standard research design” feature since long before I was born more than 60 years ago.
Google the name of a new drug in clinical trials. You will find many (hundreds, thousands) of posts on blogs, bulletin boards for people with the conditions being treated with the experimental drug, and social media, especially Twitter and Facebook. Early in most clinical trials participants start to post and question one another about their presumed active treatment or placebo status and whether those who guess they are in the experimental condition think the drug is working or not. Since the treatments are of interest to many people world-wide who are not being treated with effective pharmaceuticals, the interest is much greater than just among those in the study.
Google the name of a new drug being suggested for the treatment of a rare or orphan disease that has had no effective treatments to date and you will find this phenomenon particularly prevalent for both patients and caregivers. Hope springs eternal (which it SHOULD) but it also can effect the research design. Obviously data that are “self reported” from patient or caregiver questionnaires can be affected by Internet “the guy in Wyoming says” or the caregiver of “the woman in Florida.”
OK you say, but medical laboratory tests and clinical observations will not be affected because these indices cannot be changed by patient belief they are in the experimental or placebo conditions. Hhmmm, Sam in Seattle just posted that he thinks that he in the experimental condition and that his “saved my life” treatment works especially well if you walk 90 minutes a day or take a specific diet supplement or have a berry-and-cream diet. Mary in Maine blogs the observation that her treatment is not working so she must be in the placebo condition and becomes very depressed and subsequently makes a lot of changes in her lifestyle, often forgetting to take the other medications she reported using daily before the placebo or experimental assignment was made.
Do we have research designs for the amount of research participant visible (blogs, tweets, bulletin boards) and invisible (email, phone) communication going on during a clinical trial? No. Does this communication make a difference in what the statistical tests of efficacy will report? Probably. And can we ever track the invisible communications going on by email? Note that patients who do not wish to disclose their medical status will be more likely to use “private” email than the public blog and bulletin board methods.
Want an example. Google davunetide. This was supposed to be a miracle drug for the very rare neurodegenerative condition PSP. The company (Allon) that developed the drug received huge tax incentives in the USA to potentially market an effective drug for a neglected condition. The company, of course, was well aware that after getting huge tax incentives to develop the pharmaceutical, if the drug were to prove effective in reducing cognitive problems (as was thought), it would then be used with the much more common (and lucrative from the standpoint of Big Pharma) neurodegenerative disorders (Alzheimer’s, Parkinson’s) and schizophrenia.
Patients scrambled to get into the trial because an experimental medication was better than no medication (as was assumed, although not necessarily true) and the odds were 50/50 of getting the active pills.
Patients and caregivers communicated for more than a year, with the conversations involving patients from around the world. In my opinion, the communications probably increased the placebo effect, although I have no data nor statistical tests of “prove” this and it is pure conjecture on my part.
The trial failed miserably. Interestingly, within a few weeks after announcing the results, the senior investigators who developed and tested the treatment had left the employ of Allon. Immediately after the release of the results, clinical trial participants (the caregivers more than the patients) started trading stories on the Internet.
Time for getting our thinking hats on. I worked on methodological problems like this for 30+ years, and I have no solution, nor do I think this problem is going to be solved by any individual. Teams of #medical, #behavioral, #communication, and #statistical professionals need to be formed if we want to be able to accurately assess the effects of a new medication.
Can Big Data/Data Science avoid the train wreck of Big Pharma? I believe that the Big Data disaster will make the Big Pharma issues seem small in comparison.
But the issues will be about the same. A lot of the Big Pharma execs have become quite skilled at “beating the system” using “undocumented science” and many will move to Big Data and employ all of their very “best” moves and tricks. Big Data/Data Science has the potential to hurt the average individual even more than the greediness of Big Pharma.
Structural equation models, popularized by Joreskog and Sorbom within their LISREL computer program that solved long-standing mathematical estimation issues, have been recognized by many as the most powerful (or one of few most powerful) statistical model(s) available within the social and health sciences. A combination of concept mapping and mind mapping can quite effectively be used as a direct visual (or theoretical) analog of complex mathematical structural equation models. Such maps can be a good way of communicating the results of the structural equation modeling or in developing the theoretical models that will be tested. And yes, you can test the fit (appropriateness) of the models (diagrams) in a statistically rigorous way.
Yup, you heard here first (although path-like or combined mind map/concept models have been around for at least 45 years and I published hundreds of these in the late 70s though the 1990s in peer-reviewed social science and methodology journals and others simultaneously and subsequently have done the same). I am going to blog a LOT on this in the next few months. Mind and concept maps can be directly and formally simultaneously assessed if there are appropriate data available. Of course, such data are hard to come by, but not as hard as many believe.
The best way to draw concurrent concept and mind maps models is #iMindMap (using the flow chart option along with a mind map for a hybrid model). In structural equation modeling, a mind map would represent the estimation of measurement model parameters while the concurrent concept map would represent the structural model.
Mind Maps + Concept Maps can be statistically and concurrently estimated from appropriate data by Structural Equation Models.
Needs a lot of data, though. Current Macs and PCs are finally fast enough to do the statistical calculations in a notebook computer.
This is Mind Mapping 3.0 at its very best.
Much more on this topic is coming including a lot of demonstrations that mind maps and concept maps can be easily developed as a consequence of very rigorous mathematical model. And tested for goodness-of-fit.
[aaaahhh what the hell … While I am tempted to add a lot of equations, numerical analysis, and map pictures to this post, I will not do so and not mix up the media and message. This is the pilot. The series is coming this summer.]
The following generic sample illustrates what the hybrid models look like.
Constructs are measured by indicators. Indicators are usually imperfect and can be corrected in the statistical modeling. The relationship between constructs and indicators is shown in the mind map.
The constructs themselves may be related in a causal or noncausal way. The concept map shows the relationships that are present as determined by the statistical modeling.
A few months ago I made a post about comparing three web sites in terms of the usefulness of each for collating links of related materials. This is a slight expansion of my prior concept and map. Branches can be used not only to show positions (rank, size, weight) but also the “reasons” for the position.
As before, in reference to the original title “Pinterest Pins Scoop.it and PearlTrees,” I am referring to “pin” in the context of wrestling.
Pinterest, Scoop.it, and PearlTrees compete in the same space to be the best web-based way to refer readers of your blog, tweet stream, or web site to alternate sources of information.
In the past, I thought it was quite ironic that the “pad” apps on the iPad were kind of junky. In the most recent updates that has changed. I now find that there are three great choices. Each is inexpensive. Here’s what I think.
I confess. In 1979 Pete Bentler and I published an article entitled “Simple Minitheories of Love” in the highest prestige journal on personality and social psychology.
Blame it on the exploits of the greatest psychometrician of his generation and a 28 year-old wanna-be psychometrician, both active personality researchers, trying to convince the field that the new statistical modeling methods (Structural Equation Models; LISREL) they were testing would revolutionize the field (I was wrong on that one, too).
Now ask yourself why neither of these guys — nor any of the other main figures in the fields of psychometrics, sociometrics, personality, social psychology, attraction research — ever went on to start a web site to match individuals on the basis of personality and life style questionnaires (I won’t dignify them by calling them tests); such sites became quite lucrative. This was in spite of the fact that at least one (Huba) had the opportunity to do so during the years when he was the Vice President of R&D for a major psychological testing company and later when most of the other competing testing companies hired him as consultant. Or why did the major personality test developer of his generation and the owner of a psychological testing company (the late Doug Jackson) never consider developing such a product?
See a pattern here? Even the folks who made the most $$$ from psychological instruments and had the most influence in the psychological assessment journals and industry did not develop a Love Site.
I concede that a Love Site may be a good place to find people you might not never meet otherwise through your social and work friends and these might be good mates or sex partners. Or they might be psychopaths, perpetuators of sexual or domestic violence, dependent individuals, or alcoholics.
So far as I can tell from the undisclosed algorithms of the dating sites and their unpublished outcomes, I have no way of knowing for sure if the sites have a good chance of producing a good outcome and avoiding a terrible (and life-threatening) one. I suspect that if there were strong scientific evidence that the sites “work” in both cases, there would be a lot of scientific research published that supports this notion. Where is the incontrovertible evidence? Can I can read it or hear it at professional conventions? Claims on TV that a lot of people got married mean little or nothing without information about comparison groups or negative outcomes.
I would have no problem concluding that the Love Sites are effective if there were psychometric and other scientific evidence that the algorithms used are valid. Without such evidence, I worry that they are more voodoo and “smoke and mirrors” than places where you can find a mate and your date will not result in a rape. Of course I cannot prove my position is right, but neither can the Love Sites. My stance is safer for individuals.
There is that old fashioned system of “meet and greet and respect the people you meet” that did produce so many humans that we now have a problem with world-wide population growth. Sometimes older methods work better if you are patient.
BIG Data is coming (or has already come) to healthcare. [It is supposed to usher in new eras of research, economic responsibility, quality and access to healthcare, and better patient outcomes, but that is a subject for another post because it is putting the carriage before the horse to discuss it here.]
What is a data scientist? A new form of bug, a content expert who also knows data issues, an active researcher, someone trained in data analysis and statistics, someone who is acutely aware of relevant laws and ethical concerns in mining health data, a blind empiricist?
This is a tough one because it also touches on how many $$$$$ (€€€€€. ¥¥¥¥¥ , £££££, ﷼﷼﷼﷼﷼, ₩₩₩₩₩, ₱₱₱₱₱) individuals and corporations can make off the carcass of a dying healthcare system.
Never one to back away from a big issue and in search of those who value good healthcare for all over the almighty $ € ¥ £ ₨ ﷼ ₩ ₱, here are some of my thoughts on this issue.
Click image to zoom.
Content knowledge by a well-trained, ethical individual who respects privacy concerns is Queen. Now and forever.
I wouldn’t go on a bus trip with a driver who is unlicensed. Would you?
Who is driving the Big Data bus? Data scientists? Mindless algorithms? Content experts and their teams of data scientist support staff? Marketing? Security firms (including those run by governments)? Terrorists?
I say this once, I will say this a million times … Content is Queen.
Algorithms that are primarily empirical without an understanding of the validity of the data being analyzed and the theoretical issues are dangerous.
An algorithm can predict — and I have no doubt several are doing so at this minute — how happy I will be on a global question (how happy are you?) or a behavioral index (at a sporting event, at the bank cashing a check, four days after the death of a parent) or the perceptions of others (just got tagged in somebody’s photo, got mentioned in a tweet, had a happy blog entry, had birthday, just had a child born, got back a favorable medical test result, used a smiley face).
I have observed and analyzed and proposed new ways of measuring “happiness” and “anxiety” and “grieving” and “intelligence” for 40 years. I don’t really know what “happiness” or “anxiety” or “grieving” or “intelligence” is although I do know a lot about how experts have tried to define these constructs. I do know that a blind algorithm is not going to answer the question of what “happiness” is.
Do you want an algorithm driving the bus or someone who knows the limits of current data? I don’t want a blind algorithm predicting whether I am “happy” (and happy enough to buy something). I don’t want a blind algorithm predicting the economy. I don’t want a blind algorithm predicting how many healthcare visits I should receive under health insurance.
Content is Queen. The algorithms that drive the organization of Big Data need to be guided by content specialists (psychologists, sociologists, physicians, nurses, economists, physicists, chemists, bioelectrical engineers, etc.) not data scientists without expertise in one or more of the relevant content fields.
If the Queen rules, all will probably be well in the kingdom. If blind algorithms rule we probably will end up as batteries in The Matrix.
I vote (before it is too late) for the monarchy of content. I am not a battery.
Aaahhh… GiGo (garbage in/garbage out). The GiGo phenomenon haunts data analysts, statisticians, researchers, theorists, and someone who loses their identity.
So these huge [health] datasets we keep hearing about … who controls them? what is their validity? reliability? utility? who else gets to see them?
And the data mining algorithms… proprietary or public? based on which tests and algorithms? who developed? who validated? are the methods valid? reliable? have utility?
And the results coming out of big data and proprietary data mining algorithms… reliable? valid? useful? clearly interpreted? limitations stated? misinterpreted?
Is big data and data mining about using world-wide data to find solutions to some of the world’s problems or to sell more books, videos, and cola?
I don’t think anyone really understands the big data sets and their limitations. I doubt that more than a small percentage of the data mining algorithms are valid. I sure as hell do not want somebody blindly using these algorithms on data they do not understand and then helping the government limit healthcare visits for high need, low resource individuals (sound familiar to anyone?).
An experienced statistician-data analyst-methodologist knows that when analyzing a large data set you must spend 98% of your time looking at (and fixing if possible) bad data points. The final 2% of your work is then much more likely to show something that is reliable, valid, and useful.
Big Data may save us, or it might kill us first. Or it might make us Borg or batteries.
Right now the analysts are reticulating splines.
No mo …. GiGo. [Is Nicki Minaj available to record this mantra?]
For two years I have argued that mind maps can be (are) good ways to summarize complicated research into easily-understood theoretical models. The mind map below has gone through a few iterations since 2010. This is my version of November 25 2012. All pictures are of the same map.
Yup. Logistic regression and its extension into Cox regression (survival analysis).
Wanna know one population mean is significantly higher or lower than that of another population? OK. Go read a different blog post.
Wanna know how much your life expectancy decreases if you smoke cigarettes or use alcohol to excess or are 15% over the medically acceptable weight or some combination of these? Use logistic or Cox regression depending on the type of data you have (time invariant predictors or not, time-censored or not). Don’t tell me that smokers have significantly shorter lifespans than nonsmokers. That is not going to surprise me or shock me into behavior change. Do tell me how much my chances of reaching the age of 70 decrease if I smoke two humps of Camels a day.
Shock me. Make me want to change. Let me see how my behavior affects the odds I will live long, or be happy, or have a well-adjusted family.
Do a logistic regression or survival analysis. I strongly believe that the average member of the general public, press, AND EVEN US Congress Members, can understand these analyses easily if the information is presented in a straight-forward way. Of course, prepare for US politicians to call you an idiot on the daily cable news shows that air during prime time. Personally I do not give a damn what Bill O’Reilly or Chris Matthews thinks (if indeed they do think before shouting at a “guest” on their shows). Or what the scientific “hatchet” professional talking heads say to them.
Oh yeah, my (intuitive) logistic regression tells me that after conducting this statistical research only a small percent will make the necessary behavior changes and live longer, more happily, and have better adjusted families. That’s OK if it has to be that way. Every life is priceless and every small gain is huge.
Plug in the cattle prod and shock me with those results.