fbpx

P-hacking and Data Science

For those of you not familiar with the practice, there’s an idea called “p-hacking.” If you remember from your science fair days, you first have a hypothesis, you then design an experiment and collect data that either supports or refutes the data. In p-hacking, you just collect a large breadth of data (that is, many different data points from each observation) and see if any of the relationships turn our statistically significant. It’s become a larger and larger problem, but that’s not even the half of it.

Back in my home field of data science, it is not unusual to produce a model that has no statistical validity in any of the relationships, but model tends to predict well, therefore, it must be a “good” model. The most extreme example I’ve ever seen was a contest run on Kaggle where the data consisted of more than 700 columns for each observation, and they were just labeled “f1,” “f2,” and so on to whatever the max was. Lazy column naming is one thing, but we were given no guide to what the data meant. The purpose of the contest was to find relationships, regardless of what the data meant.

I’ve spent my entire career fighting this, pointing out that unless you have a causal theory, your model does not hold up, no matter how well it predicts over the test dataset. This problem has totally blown up now that we build models we don’t even understand, such as the more advanced neural networks. If we cannot explain it, it is not a valid model.

Back to p-hacking,the p-value may be widely misunderstood, but setting aside people gaming the system, it’s just not communicated well, either. It’s never made clear that p = 0.05 means we have a 1/20 shot of having gotten it wrong. I also think a lot of the blame for that falls on the insistent media reporting of scientific “results” without the ability to contextualize the findings for viewers or readers.