Learn from the Reproducibility Crisis in Science

Key highlights from Clare Gollnick’s talk, “The limits of inference: what data scientists can learn from the reproducibility crisis in science”, are covered in this Domino Data Science Field Note. The full video is available for viewing here.

Introduction

Within Clare Gollnick’s Strata San Jose talk, “The limits of inference: what data scientists can learn from the reproducibility crisis in science”, Gollnick discussed how turning data into insights is a challenge. While inference is a tool that data scientists may use to uncover insights, there is potential to break inference through continuous searching, p-hacking, as well as overfitting within machine learning. This has potentially contributed to the reproducibility crisis within science. Gollnick, a former data scientist and now CTO at Terbium Labs, advocated that data scientists learn from the reproducibility crisis in science, recognize inference limitations, and use inference as a tool for the appropriate problem.

Reproducibility Crisis

In 2011, a VC mentioned that an “unspoken rule is that at least 50% of the studies published in top tier academic journals — Science, Nature, Cell, PNAS, etc...—can’t be repeated with the same conclusions by an industrial lab”. In Gollnick’s talk, Gollnick cited multiple studies including the Bayer 2011 study, the Amgen study, and the Many Labs Replication Project that point to a reproducibility crisis within science. This reproducibility crisis may have resulted in hundreds of people studying and creating careers upon supposedly significant effects that potentially do not exist. As the title of the talk states, data scientists have the opportunity to learn from the reproducibility crisis in science. Gollnick also indicated “a fundamental flaw in our system of logic that we use to infer from data” has contributed to the reproducibility crisis within science.

Limitation of Inference

Gollnick indicated in the talk that “inference is broken by searching”. For example, Gollnick referenced that “p-values in hypothesis testing are quantification of a surprise”. The more people search for the surprises, the more often people will see them. Or, the more often someone runs and reruns the model on the same data until they get the result significant results they want, breaks inference, and is also known as p-hacking. Another example that breaks inference is overfitting in machine learning. Overfitting occurs when someone “search[es] for too many models, when [they are] willing to consider too many hypotheses at once, when [they] search for too many parameters, or give it too many degrees of freedom for [the] model”. Yet, cross-validation is used in machine learning to fix overfitting. Cross validation includes “separat[ing] your training data, generat[ing] hypotheses and models on that, and then test it once”. Gollnick also suggested in the talk that data scientists frame the problem so that they, if they can, “do as little inference as physically possible…. [and] rely upon learning from the data in a minimal amount that you possibly can because deduction is a much stronger system of logic than induction”.

Potential Implications of False Positives

In the talk, Gollnick covered that understanding inference limitations leads to a “powerful understanding of data as a tool that you can use to be a better data scientist.” For example, Gollnick cited how widespread screening for cancer in the past decade led to false positives. Screening protocols have shifted to being less often and considers whether the person is a part of a high-risk group. While someone may prefer to have a false positive, Gollnick also asked the audience to consider “the possibility that we may be treating people for cancers they don’t have. Think of what chemo [does]…people can die from their treatment of cancer.” Gollnick referenced that the approach of “looking less and in more targeted populations” is being used in present day “because data and evidence works better when your search is more targeted.”

Conclusion

In the Strata talk, Gollnick covered how data and inference are tools. Tools that have the potential to be used effectively for the right problem. Yet, tools also have limitations. For example, Gollnick referenced how scientists that didn’t consider how to “make sense of data, the limits of inference” contributed to the reproducibility crisis. Gollnick wrapped up the talk reiterating that data is a tool and “when you’re a practitioner of data science, pick the right problem if you have the option.”

Domino Data Science Field Notes provide highlights of data science research, trends, techniques, and more, that support data scientists and data science leaders accelerate their work or careers. If you are interested in your data science work being covered in this blog series, please send us an email at writeforus(at)dominodatalab(dot)com.