social, health, political imagery through the lens of George J Huba PhD © 2012-2017

Posts tagged big data

screen_0089 screen_0088 screen_0087 screen_0086I consider John W. Tukey to be the King of Little Data. Give him a couple of colored pencils, the back of a used envelope, and some data and he could bring insight to what you were looking at by using graphic displays, eliminating “bad” data, weighting the findings, and providing charts that would allow you to explain what you were seeing to those who had never been trained in technical fields.

Tukey’s approach to “bad data” (outliers, miscodings, logical inconsistency) and downweighting data points which probably make little sense is what will save the Big Data Scientists from themselves by eliminating the likelihood that a few stupid datapoints (like those I enter into online survey databases when I want to screw them up to protect privacy) will strongly bias group findings. Medians are preferable to means most of the time; unit weighting is often to be preferred over seeing too much in the data and then using optimal (maximum likelihood, generalized least squares) data-fit weighting to further distort it.

Few remember that Tukey was also the King of Big Data. At the beginning of his career, Tukey developed a technique called the Fast Fourier Transform or FFT that permitted fairly slow computing equipment to extract key information from very complex analog data and then compress the information into a smaller digital form that would retain much of the information but not unnecessary detail. The ability to compress the data and then move it over a fairly primitive data transmission system (copper wires) made long distance telephone communications feasible. And later, the same method made cellular communications possible.

Hhmm. More than 50 years ago, Tukey pioneered the view that the way to use “sloppy” big data was to distill the necessary information from it in an imprecise but robust way rather than pretending the data were better because they were bigger and erroneously supported over-fitting statistical models.

Hopefully it will not take another 50 years for the Big Data folks to recognize that trillions of data points may hide the truth and that the solution is to pass out some red and blue pencils and used envelopes. Tukey knew that 50 years ago.

All it “costs” to adopt Tukey’s methods is a little commonsense.

Hhmm, maybe the Tukey approach is not so feasible. Big Data proponents at the current time seem to lack in aggregate the amount of commonsense necessary to implement Tukey’s methods.

Turn off the computers in third grade, pass out the pencils, and let’s teach the next generation not to worship Big Data and developing statistical models seemingly far more precise than the data.

John W Tukey

Big Data (in service to the NSA) wants to be able to document what you do and when and where and with whom. All of the current databases that companies and public agencies maintain can now be tightly linked to get a pretty good profile of any individual.

But, these models of what people will do when you ask them to buy a DVD of Thor 2 or a suit from Brooks Brothers, are actually fairly dumb brute force computer algorithms that break down when certain types of problematic data are fed into them.

Hhhhmmm. Some thoughts below in the mind map. Click the image twice for a full expansion.


Remember the “gold standard” research paradigm for determining if a medical treatment works: the DOUBLE BLIND, RANDOM ASSIGNMENT EXPERIMENT?

The design has historically been considered the best way to “prove” that new medical interventions work, especially if the experiment is replicated a number of times by different research teams. By the double blind (neither the treating medical team nor the patient know whether the patient is taking a placebo or active medication) design, investigators expect to negate the placebo effects caused by patient or medical staff beliefs that the “blue pill” is working.

A key part of virtually all double-blind research designs is the assumption that all patient expectations and reports are independent. This assumption is made because of the statistical requirements necessary to determine whether a drug has had a “significantly larger effect” as compared to a placebo. Making this assumption has been a “standard research design” feature since long before I was born more than 60 years ago.


Google the name of a new drug in clinical trials. You will find many (hundreds, thousands) of posts on blogs, bulletin boards for people with the conditions being treated with the experimental drug, and social media, especially Twitter and Facebook. Early in most clinical trials participants start to post and question one another about their presumed active treatment or placebo status and whether those who guess they are in the experimental condition think the drug is working or not. Since the treatments are of interest to many people world-wide who are not being treated with effective pharmaceuticals, the interest is much greater than just among those in the study.

Google the name of a new drug being suggested for the treatment of a rare or orphan disease that has had no effective treatments to date and you will find this phenomenon particularly prevalent for both patients and caregivers. Hope springs eternal (which it SHOULD) but it also can effect the research design. Obviously data that are “self reported” from patient or caregiver questionnaires can be affected by Internet “the guy in Wyoming says” or the caregiver of “the woman in Florida.”

OK you say, but medical laboratory tests and clinical observations will not be affected because these indices cannot be changed by patient belief they are in the experimental or placebo conditions. Hhmmm, Sam in Seattle just posted that he thinks that he in the experimental condition and that his “saved my life” treatment works especially well if you walk 90 minutes a day or take a specific diet supplement or have a berry-and-cream diet. Mary in Maine blogs the observation that her treatment is not working so she must be in the placebo condition and becomes very depressed and subsequently makes a lot of changes in her lifestyle, often forgetting to take the other medications she reported using daily before the placebo or experimental assignment was made.

Do we have research designs for the amount of research participant visible (blogs, tweets, bulletin boards) and invisible (email, phone) communication going on during a clinical trial? No. Does this communication make a difference in what the statistical tests of efficacy will report? Probably. And can we ever track the invisible communications going on by email? Note that patients who do not wish to disclose their medical status will be more likely to use “private” email than the public blog and bulletin board methods.

Want an example. Google davunetide. This was supposed to be a miracle drug for the very rare neurodegenerative condition PSP. The company (Allon) that developed the drug received huge tax incentives in the USA to potentially market an effective drug for a neglected condition. The company, of course, was well aware that after getting huge tax incentives to develop the pharmaceutical, if the drug were to prove effective in reducing cognitive problems (as was thought), it would then be used with the much more common (and lucrative from the standpoint of Big Pharma) neurodegenerative disorders (Alzheimer’s, Parkinson’s) and schizophrenia.

Patients scrambled to get into the trial because an experimental medication was better than no medication (as was assumed, although not necessarily true) and the odds were 50/50 of getting the active pills.

Patients and caregivers communicated for more than a year, with the conversations involving patients from around the world. In my opinion, the communications probably increased the placebo effect, although I have no data nor statistical tests of “prove” this and it is pure conjecture on my part.

The trial failed miserably. Interestingly, within a few weeks after announcing the results, the senior investigators who developed and tested the treatment had left the employ of Allon. Immediately after the release of the results, clinical trial participants (the caregivers more than the patients) started trading stories on the Internet.

Time for getting our thinking hats on. I worked on methodological problems like this for 30+ years, and I have no solution, nor do I think this problem is going to be solved by any individual. Teams of #medical, #behavioral, #communication, and #statistical professionals need to be formed if we want to be able to accurately assess the effects of a new medication.

Click on the image to expand.

Clinical Trial  Double-Blind  Treatment Evaluation  in the Era of the Internet

Trout is a program I tried to “get” for two years. Billed sometimes as a mind mapping program, its own developer says it is not really a mind mapping program. Produces odd diagrams that look like spider maps (at best).

The most recent revision for iPad and Mac just came out with greatly improved usability. I finally “got” it (or have deluded myself into believing I have finally understood the intent and uses of the program).

Trout is a brilliant tool for building maps of content links between a number of snippets of information. Get it, spend an hour with it, and you will know how to manually or AUTOMATICALLY sort a large number of text snippets into a very usable visual form.


Each of the map links in this example came from automated link building using simple default rules. Colors and shapes are arbitrary in this example. Click on images to expand.


This second version shows all possible automatic links using the default definition. Not especially useful in this form.


The third version shows all of the links involving the large central (title) circle.


The fourth version shows all of the links associated with the top yellow square.


Fast data summary if you import text snippets from a CSV file and use the automatic link building method (which can also differentiate between types of content and color and shape code automatically using your rules). I find it very useful. But you will have to spend an hour experimenting with this program to “understand” it and see how useful it is.

Unrelated except for my play on the title …



Trout Fishing in America

Big Data/Data Science 1

Big Data/Data Science 2

Big Data/Data Science 3

Big Data/Data Science 4

Big Data/Data Science 5

Big Data/Data Science 6

Big Data/Data Science 7

Big Data/Data Science 8

Big Data/Data Science 9

Big Data/Data Science 10

Big Data/Data Science 11

Big Data/Data Science 12

Big Data/Data Science 13

Big Data/Data Science 14

A few thoughts about the importance of knowing the theories and prior studies in the content area of the modeling and data collection and data analysis and generation of conclusions.

You can’t model data without knowing what the data mean.

Click on mind map to expand.

Data Scientist

We have had many data science fields in the past 50 years. Among others, the fields include applied statistics, biostatistics, psychometrics, quantitative psychology, econometrics, sociometrics, epidemiology, and many others. The new emphasis on data science ignores content knowledge about the data and their limitations and the permissible conclusions.

We do not need to replace a round wheel with a square one.

See also previous post on Big Data/Data Science adopting the mistakes of Big Pharma.

a HubaMap™ by g j huba phd

Dec 13 2013: I have been experimenting with some formatting. This is the same map content as above, but using iMindMap 7 which was recently released.

Data Scientist sketch

Can Big Data/Data Science avoid the train wreck of Big Pharma? I believe that the Big Data disaster will make the Big Pharma issues seem small in comparison.

But the issues will be about the same. A lot of the Big Pharma execs have become quite skilled at “beating the system” using “undocumented science” and many will move to Big Data and employ all of their very “best” moves and tricks. Big Data/Data Science has the potential to hurt the average individual even more than the greediness of Big Pharma.

Big Pharma

Big Pharma Train Wreck

Big DataBig Data Train Wreck


HubaMap™ by g j huba phd

This afternoon I went to the local Panera and paid by credit card. My bank declined my charge of $4.82. I figured it was the magnetic strip on the card which had failed or that the new trainee using the cash register may have made a mistake. She ran the card three more times and it was rejected. Then I got four text messages from the bank saying that they are rejected my charges. To text me, they used my phone number.

I called. They had put a hold on my card because they had some questions about my charges from the prior few days. The red flag event was that I had made an earlier charge of $9.65 at Panera about eight hours before. Their computer program was not smart enough to figure out that it was not unreasonable for someone to have breakfast at 6:30am at a Panera in Durham and then walk into a Panera in Chapel Hill later in the day with 30 minutes to kill and had a coffee (and a Danish I probably should not have had) while I played with my iPad on their free wireless connection. The computer also questioned the $1 charge at a gas station this afternoon (which the human representative immediately recognized as the established practice of gas stations opening charge lines with their automated payment systems of $1 when you swipe your card and then next day putting a $92 charge on the card for filling the tank). I was also asked if the payment made on the account was one I had made (I asked the customer service rep if she thought that if someone had paid a bill for me that I would tell her it was an erroneous transaction and she laughed for a long time) as well as a $71 charge to a software company outside the US.

They had freaked out because they could not reach me by phone at three numbers that were old ones not active (I know they have my current number because they sent me texts at it and same bank sometimes calls about my other accounts at the cell phone I never turn off and which has a voice mailbox). Of course, if they did not have a no reply text address, I could have responded to the four texts they sent.

Predictive models have been around for a decade or more in banks as they attempt to identify fraud and protect themselves. The episodes I have with my bank about every 2-3 months illustrate what happens when somebody blindly runs predictive analytic programs through big datasets without using some commonsense to guide the modeling process. Just because anyone can buy a $100,000 program from IBM or others for developing predictive analytics does not mean that the model that comes out of the Big Data and expensive program makes any sense at all.

Or that the NSA or FBI or CIA or Google or Amazon models make much sense as they probe your private information.

If a computer predictive system is going to think that somebody is committing credit card fraud because they purchase two cups of coffee at the same national restaurant chain in a day, we are in big trouble.

The bottom line is that Big Data models are going to have to be regulated before some idiot accidentally turns on Sky Net.

Or maybe the problem is that the NSA or FBI or CIA or Google has done it already.


Irv Oii is known to many international news organizations and researchers as a star data journalist. Being a home worker (although home may be the UK, Ohio, the Middle East, Central Africa, Hong Kong, or Antartica) and a fairly reclusive person, nobody seems to have met Irv. Some speculate that he might be a Jewish Asian-American. Others believe Irv is short for Irvelina, a Russian immigrant physician who went to Ohio (or was it Ojai, California) when the Soviet science programs collapsed and turned into the lower funded Russian collaborative efforts with the EU and USA. The collapse of the Soviet Union resulted in the closing of her laboratory in Minsk. Some even think Irv Oii is an acronym.

Irv is thus an enigma and no pictures of her/him seem to exist. An artist’s conception (mine) based on the writings and consultations of Irv Oii on healthcare breakthroughs is shown below. My belief is that a portrait of Irv should hang over the desk of every data journalist and researcher.

Please click the image to zoom.

Irv Oii

Click on mind map to expand.

academia and  healthcare  big data



Big data this, big data that. Wow. At the end we will have better ways to sell underwear, automobiles, and “next day” pills (although in the latter case politics and religion might actually trump Amazon and Google). Blind empiricism. Every time you click a key on the Internet it goes into some big database.

“Little data” — lovingly crafted to test theories and collected and analyzed with great care by highly trained professionals — has built our theories of personality, social interactions, the cosmos, and the behavioral economics of  buying or saving.

Big data drives marketing. Little data drives the future through generalizable theory.

Click on the figure below to zoom.

in praise of little data




Sketchnote Example: My Predictions of Changes in the Field of Psychology Over The Next 20 Years


There are lots of different applications of mind mapping methods to such areas as brainstorming, task management, scheduling, journaling, and sharing basic information (great day to play basketball!). Other mind maps may tell us about scientific experiments and theories, political arguments, historical events, anatomical features of the human body, the quality of hotels in Barcelona, or expert rankings of world football (soccer) teams projected to finish near the top in the World Cup tournament. How do you know a real expert has ranked your favorite football teams correctly? How do you know that the student who created the cute mind map of the human body as a subway map actually put in the correct names parts and names? What are the professional qualifications of the “expert” who says the world is flat? Do experts believe the purported expert who drew the mind map? Is the information in the mind map you found and downloaded from the Internet really going to tell you what you need to know for your organic chemistry test in two hours?

I sure hope my doctors studied from factually correct mind maps, not just pretty ones given away by a pharmaceutical company. And (since I have a doctorate in psychology), I am really sick of seeing mind maps that say they contain psychological principles that will make you happier, thinner, less anxious, more sexy, and help you self-diagnose whether you have bipolar disorder and which drug would be best to help you and should be ordered from an Asian or Mexican pharmacy over the Internet (URL at the bottom of the map).

Mission critical information in mind maps should be carefully reviewed by experts in the content of the maps to minimize the number of cases where misinformation hurta people . If such a review has not been done, or if the author of the mind map does not provide adequate credentials to assess professional competence, I recommend you do not use such information for making personal or business decisions. While I love artistic maps that are well-designed and “clean” in their appearance and spend a lot of time trying to emulate the best, adherence (or not) to the mind mapping rules of Tony Buzan and the use of a wonderfully artistic program, in no way does or does not make the information in the maps correct. Think about that carefully the next time you download a mind map from the Internet and try to study or make a business decision; that’s a fact, Jack.

It’s also a fact that these comments also apply to infographics, concept maps, and other information visualizations.

My next post is going to have a lot to say about the importance of content and how to assess whether that pretty map you just found contains valid, reliable, and important information.

Some more of my thoughts …

Should Mind Maps  Be Reviewed FINAL


Keyword Board

topics and subtopics: should mind maps and templates be reviewed? probably not audience you only internal work group intended use personal planning personal/group notes brainstorming journal diary task management scheduling type of information common knowledge 12 inches 1 foot green traffic light go usa flag red, white, blue shoes sold in pairs cover feet simple facts address size weight color presented as opinion no or minimal harm if misinterpreted inappropriately applied yes audience general internet textbook presentations heterogeneous broad background expertise experience general intended use present facts present theory learning tool group textbook as summary facts findings opinions consensus judgments type of information data-supported expert judgment best ice skater best baker best decision consensus presented as fact potential harm if misinterpreted inappropriately applied expert (peer) review best © 2013 g j huba phd some definitely yes opinion expert informed most probably not

BIG Data is coming (or has already come) to healthcare. [It is supposed to usher in new eras of research, economic responsibility, quality and access to healthcare, and better patient outcomes, but that is a subject for another post because it is putting the carriage before the horse to discuss it here.]

What is a data scientist? A new form of bug, a content expert who also knows data issues, an active researcher, someone trained in data analysis and statistics, someone who is acutely aware of relevant laws and ethical concerns in mining health data, a blind empiricist?

This is a tough one because it also touches on how many $$$$$ (€€€€€. ¥¥¥¥¥ , £££££, ﷼﷼﷼﷼﷼, ₩₩₩₩₩, ₱₱₱₱₱) individuals and corporations can make off the carcass of a dying healthcare system.

Never one to back away from a big issue and in search of those who value good healthcare for all over the almighty $ € ¥ £ ₨ ﷼ ₩ ₱, here are some of my thoughts on this issue.

Click image to zoom.

who is a health data scientist

Content knowledge by a well-trained, ethical individual who respects privacy concerns is Queen. Now and forever.

Keyword Board

topics and subtopics: who is a “health” data scientist? trained in healthcare? methodology research databases management information systems psychology? psychometrics other public health? epidemiology other medicine? nursing? social work? education? biostatistics? medical informatics? applied mathematics? engineering? theoretical mathematics? theoretical-academic statistics? information technology? computer science? other? conclusions must know content 70% methods 30% must honor ethics 100% laws practice privacy criminal civil federal state other greatest concerns correctness of results conclusions ethical standards meaningfulness validity reliability privacy utility expert in content field data analysis data systems ethics and privacy other member? association with ethics standards licensed? physician nurse psychologist social worker other regulated? federal hipaa state other insured? professional liability errors and omissions continuing education requirements? ethics renewal of licensure regulatory standards insurer commonsense laws go away if not well trained content field data analysis not statistics committed clean data meaningfulness subject privacy peer review openness ethics ethics ethics are arrogant narrow-minded purely commercial primarily motivated $$$$$ blind number cruncher atheoretical © 2013 g j huba

I wouldn’t go on a bus trip with a driver who is unlicensed. Would you?

Who is driving the Big Data bus? Data scientists? Mindless algorithms? Content experts and their teams of data scientist support staff? Marketing? Security firms (including those run by governments)? Terrorists?

I say this once, I will say this a million times … Content is Queen.

Algorithms that are primarily empirical without an understanding of the validity of the data being analyzed and the theoretical issues are dangerous.

An algorithm can predict — and I have no doubt several are doing so at this minute — how happy I will be on a global question (how happy are you?) or a behavioral index (at a sporting event, at the bank cashing a check, four days after the death of a parent) or the perceptions of others (just got tagged in somebody’s photo, got mentioned in a tweet, had a happy blog entry, had  birthday, just had a child born, got back a favorable medical test result, used a smiley face).

I have observed and analyzed and proposed new ways of measuring “happiness” and “anxiety” and “grieving” and “intelligence” for 40 years. I don’t really know what “happiness” or “anxiety” or “grieving” or “intelligence” is although I do know a lot about how experts have tried to define these constructs. I do know that a blind algorithm is not going to answer the question of what “happiness” is.

Do you want an algorithm driving the bus or someone who knows the limits of current data? I don’t want a blind algorithm predicting whether I am “happy” (and happy enough to buy something). I don’t want a blind algorithm predicting the economy. I don’t want a blind algorithm predicting how many healthcare visits I should receive under health insurance.

Content is Queen. The algorithms that drive the organization of Big Data need to be guided by content specialists (psychologists, sociologists, physicians, nurses, economists, physicists, chemists, bioelectrical engineers, etc.) not data scientists without expertise in one or more of the relevant content fields.

If the Queen rules, all will probably be well in the kingdom. If blind algorithms rule we probably will end up as batteries in The Matrix.

I vote (before it is too late) for the monarchy of content. I am not a battery.

candy 5codeHubaisms

Evaluation 4

Aaahhh… GiGo (garbage in/garbage out). The GiGo phenomenon haunts data analysts, statisticians, researchers, theorists, and someone who loses their identity.

So these huge [health] datasets we keep hearing about … who controls them? what is their validity? reliability? utility? who else gets to see them?

And the data mining algorithms… proprietary or public? based on which tests and algorithms? who developed? who validated? are the methods valid? reliable? have utility?

And the results coming out of big data and proprietary data mining algorithms… reliable? valid? useful? clearly interpreted? limitations stated? misinterpreted?

Is big data and data mining about using world-wide data to find solutions to some of the world’s problems or to sell more books, videos, and cola?

I don’t think anyone really understands the big data sets and their limitations. I doubt that more than a small percentage of the data mining algorithms are valid. I sure as hell do not want somebody blindly using these algorithms on data they do not understand and then helping the government limit healthcare visits for high need, low resource individuals (sound familiar to anyone?).

An experienced statistician-data analyst-methodologist knows that when analyzing a large data set you must spend 98% of your time looking at (and fixing if possible) bad data points. The final 2% of your work is then much more likely to show something that is reliable, valid, and useful.

Big Data may save us, or it might kill us first. Or it might make us Borg or batteries.

No mo …. GiGo. [Is Nicki Minaj available to record this mantra?]