Aaahhh… GiGo (garbage in/garbage out). The GiGo phenomenon haunts data analysts, statisticians, researchers, theorists, and someone who loses their identity.
So these huge [health] datasets we keep hearing about … who controls them? what is their validity? reliability? utility? who else gets to see them?
And the data mining algorithms… proprietary or public? based on which tests and algorithms? who developed? who validated? are the methods valid? reliable? have utility?
And the results coming out of big data and proprietary data mining algorithms… reliable? valid? useful? clearly interpreted? limitations stated? misinterpreted?
Is big data and data mining about using world-wide data to find solutions to some of the world’s problems or to sell more books, videos, and cola?
I don’t think anyone really understands the big data sets and their limitations. I doubt that more than a small percentage of the data mining algorithms are valid. I sure as hell do not want somebody blindly using these algorithms on data they do not understand and then helping the government limit healthcare visits for high need, low resource individuals (sound familiar to anyone?).
An experienced statistician-data analyst-methodologist knows that when analyzing a large data set you must spend 98% of your time looking at (and fixing if possible) bad data points. The final 2% of your work is then much more likely to show something that is reliable, valid, and useful.
Big Data may save us, or it might kill us first. Or it might make us Borg or batteries.
Right now the analysts are reticulating splines.
No mo …. GiGo. [Is Nicki Minaj available to record this mantra?]