Data is a great story teller. Hiding in the numbers are answers to found, and secrets to be revealed. But while data rarely lies, it can sometimes be all to easy to jump to the wrong conclusions when looking for a pattern.
Eric Siegel, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Revised and Updated Edition, Wiley, January 2016) published an interesting post about successfully finding correlations.
He starts with a simple statement: according to Harvard University, skipping breakfast is related to heart disease, and explores how such attention grabbing headlines can easily, yet inaccurately, be inferred from data.
Sure, skipping breakfast in itself may not lead to heart disease, despite what the data suggests. But look a little deeper, and it makes more sense. Skipping breakfast could be the sign of a certain type of lifestyle – busy, stressful, lazy, all lifestyles which also increase the likelihood that fast food or processed foods feature heavily in someone’s diet., and possibly a regular alcoholic drink or two to combat the stress and help unwind.
In fact, Harvard University also reported that study participants who skipped breakfast, were more likely to drink alcohol, smoke and take less exercise, than those who regularly ate breakfast.
So you see, the suggestion that ‘skipping breakfast causes heart disease , while an over-claim, does present an interesting sign-post to where the truth lies. It’s something to consider when looking at data: correlation doesn’t imply causation, but it can be a good place to start looking for it.
Tyler Vigen, author of Spurious Correlation (Hachette Books, May 2015), gives many examples that highlight the dangers of inferred causation from data.
Here are a couple of even more surprising correlations than our breakfast anecdote. Below is a chart of per capita cheese consumption in the United States from 2000 to 2009.
You’ll see there is a 94.71% correlation between the the number of people who ate cheese, and those who somehow became tangled in their bedsheets and died.
So, based on the statistics, would you say that eating cheese causes death by bedsheets? Of course not. It’s a prime example of why correlation does not imply causation.
Likewise, let’s take at the risks involved in consuming another dietary staple, margarine. According to the chart below, eating margarine has a 99.22% correlation with the divorce rate in Maine. If you believed in the correlation, then an abstinence of margarine, while making for dry toast, would also make for a long and healthy marriage.
Interestingly, it can be easy to see a connection, even when there isn’t necessarily one. A 98% statistical correlation between per capita corn syrup consumption in the US and pedestrian fatalities on the nation’s roads is clearly co-incidence. But if the same statistic were true of vodka, or beer, would you say there was a correlation?
Finally, let’s identify some date that offers an even more tempting conclusion to jump to. If you’re an avid internet addict, you may well have come across ‘Pastafarians’, the cult-like individuals (of questionable sanity) who believe the universe was created by a “Flying Spaghetti Monster” and that global warming is a consequence of the “shrinking numbers of pirates since the 1800s” . This statement supports the Pastafarian belief that pirates are good divinities, of course.
Their founder, Bobby Henderson, shown the evidence of his statement with a graph like this:
Of course, if you believed in every correlation, the graph presents a foregone conclusion. Joking aside, these examples serve as important reminders. No matter how good your data, or how much you want something to be true (Leonardo DiCaprio achieving his sustainability goals by donning an eye-patch and training a parrot does have certain appeal), correlation simply does not imply causation. Unless you’re a Pastafarian, in which case, good luck to you.