Info

social, health, political imagery through the lens of G J Huba PhD © 2012-2021

Posts tagged clustering

Big data this, big data that. Wow. At the end we will have better ways to sell underwear, automobiles, and “next day” pills (although in the latter case politics and religion might actually trump Amazon and Google). Blind empiricism. Every time you click a key on the Internet it goes into some big database.

“Little data” — lovingly crafted to test theories and collected and analyzed with great care by highly trained professionals — has built our theories of personality, social interactions, the cosmos, and the behavioral economics of  buying or saving.

Big data drives marketing. Little data drives the future through generalizable theory.

Click on the figure below to zoom.

in praise of little data

Aaahhh… GiGo (garbage in/garbage out). The GiGo phenomenon haunts data analysts, statisticians, researchers, theorists, and someone who loses their identity.

So these huge [health] datasets we keep hearing about … who controls them? what is their validity? reliability? utility? who else gets to see them?

And the data mining algorithms… proprietary or public? based on which tests and algorithms? who developed? who validated? are the methods valid? reliable? have utility?

And the results coming out of big data and proprietary data mining algorithms… reliable? valid? useful? clearly interpreted? limitations stated? misinterpreted?

Is big data and data mining about using world-wide data to find solutions to some of the world’s problems or to sell more books, videos, and cola?

I don’t think anyone really understands the big data sets and their limitations. I doubt that more than a small percentage of the data mining algorithms are valid. I sure as hell do not want somebody blindly using these algorithms on data they do not understand and then helping the government limit healthcare visits for high need, low resource individuals (sound familiar to anyone?).

An experienced statistician-data analyst-methodologist knows that when analyzing a large data set you must spend 98% of your time looking at (and fixing if possible) bad data points. The final 2% of your work is then much more likely to show something that is reliable, valid, and useful.

Big Data may save us, or it might kill us first. Or it might make us Borg or batteries.

Right now the analysts are reticulating splines.

No mo …. GiGo. [Is Nicki Minaj available to record this mantra?]

splines