The enterprise platform to build, deliver, and govern AI
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
Data science is no longer a specialization of a single person or small group. It is now a key source of competitive advantage, and as a result, the scale of projects continues to grow. Collaboration is critical because it enables teams to take on larger problems than any individual. It also allows for specialization and a shared context that reduces dependency on "unicorn" employees who don't scale and are a major source of key-man risk. The problem is that collaboration is a vague term that blurs multiple concepts and best practices. In this post, we clarify the differences between repeatability, reproducibility, and whenever possible the golden standard of replicability. By establishing best practices of frictionless in-team and cross-team collaboration, you can dramatically improve the efficiency and impact of your data science efforts.
Starting at the base, repeatability is a critical building block of a collaborative data science process. Repeatability is the idea that a given process (whether it be a data cleaning script, a feature engineering pipeline, or a modeling algorithm) will produce the same (or nearly the same) output given similar inputs. In order for data scientists to be able to collaborate, they must be able to rely on the instruments and procedures they have been consistent. This often surfaces itself as a challenge in data science collaboration with environmental and data instability.
I often talk to scientists and researchers who have the problem of “it worked on my computer.” Even worse, some underlying system library or dependency would cause an algorithm to produce one signal on one scientist’s machine, and when run on another scientist’s machine or a production environment, it would produce a different signal altogether.
One of the scariest examples we have seen of the perils of not having an environment that guarantees repeatability was shared with me at a conference by quantitative research at a company in the financial sector. They had built a model that generated a trading strategy and backtested it extensively. This model, when run on the original author's laptop, gave a signal that a particular security should not be traded as it would almost certainly lead to a significant capital loss. When the researcher handed off the code to one of the analysts who was in charge of the actual transactions, they ran it on their laptop and it generated strong buy signals for a number of assets. This ended up costing the organization a significant amount of money, as the signals were fundamentally flawed.
After a significant amount of finger-pointing, they discovered that the two data scientists had a different point release of an underlying date parsing library. For the original scientist badly formed dates were parsed as NA, whereas for the second scientist, they were parsed as 1970-01-01. This meant that the model they generated differed in behavior, and generated completely different outputs even on the same data.
Attempting to diagnose whether a computing stack is repeatable can be a significant challenge. Modern computing environments are an unimaginably large stack of oftentimes opaque layers of abstraction, and finding the one library dependency that was updated to a different point revision (even given tools such as conda which attempt to deterministically build exact environments) can be sisyphean task.
Any practice of collaboration built without a fully repeatable environment is unlikely to be able to prove itself valuable or be adopted, as these little subtle differences and changes in behavior and expectation can erode confidence in the system and provide a high barrier to adoption.
The next step is reproducibility. Victoria Stodden, Associate Professor at the School of Information Sciences at the University of Illinois, described a powerful taxonomy of reproducibility which describes its three facets: statistical reproducibility, empirical reproducibility, and computational reproducibility.
An analysis is statistically reproducible when detailed information is provided about the choice of statistical tests, model parameters, threshold values, etc. This mostly relates to pre-registration of study design to prevent p-value hacking and other manipulations. Statistical reproducibility is something that should be enforced via top-down mandates, as well as through peer review and documentation. Teams should have access to a knowledge repository of organizational best practices regarding statistical options, and be provided with guidance regarding what is appropriate. Providing a research hub with access to peers inside of an organization that can provide a “sounding board” regarding the choices made in experimental and statistical design can often shortcut hundreds of hours of wasted effort when an improver validation procedure or statistical test is chosen.
An analysis is empirically reproducible when detailed information is provided about non-computational empirical scientific experiments and observations. In practice, this is enabled by making data freely available, as well as details of how the data was collected. In data science, this is often tied to the underlying business drivers of a study, and knowledge about the true data generating process of the data sources used to build models and analysis.
At Domino, we often say that the data science process begins and ends with questions and data from business drivers, and therefore it’s important that the teams document what these original questions and ideas were, how they came about, who asked them, and provide digital provenance of the datasets that are used in an analysis in a fully deterministic and reproducible fashion. We support a lot of this functionality with our “data projects” architecture, which allows teams to define canonical datasets which are fully revisioned and componentized, allowing collaborators to understand exactly what analyses draw inspiration from what datasets, and what were the biases encoded in those datasets through the data collection process.
An analysis is computationally reproducible if there is a specific set of computational functions/analyses (in data science, almost always specified in terms of source code) that exactly reproduce all of the results in an analysis. I tend to think of reproducibility as the orchestration of a series of repeatable steps in a deterministic guaranteed fashion.
It is important to note that this isn’t just the source code however, for an analysis to be computationally reproducible, the “tuple” that is serialized is significantly greater than just the bytes-on-disk artifact of the source files. Computational reproducibility includes reproducibility of the underlying data, the software, the sequence of operations, and the underlying hardware that it was executed on. The rOpensci organization’s list of good characteristics for reproducible research is helpful for understanding what makes up good computational reproducibility.
Here are eight tenets for good reproducibility and in-team and cross-team collaboration adapted from Sandve, Nekrutenko, Taylor, & Hovig’s Rules for Reproducible Computational Research:
Reproducibility creates confidence in data science teams. Without true frictionless reproducibility, it can often be very challenging to move the state of the art of an algorithm or model in production forward. It can even be challenging to gauge whether a new iteration of a model is actually an improvement over a model in production or is simply an equivalent model trained on more recent data.
At Domino, we believe that frictionless reproducibility and compounding of knowledge is the cornerstone to a good collaborative process, and have built our platform around this opinion. For more thoughts about this, watch my talk about providing digital provenance at UseR! 2016 in Stanford.
Once you have established a practice of repeatability and reproducibility, the gold standard of a collaborative data science process is replicability. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. Replicability is the practice that allows a model in production to be independently verified by auditors, to be reimplemented by an engineering organization for use in a real-time system, and perhaps most importantly to have confidence that updating a model across time with new data as covariates shift will still provide results that are both directionally and substantially aligned with the original effort.
Replicability in data science is often misunderstood, as its role is mostly considered in the physical and biological sciences when it comes to cutting edge breakthroughs. There have been entire careers destroyed due to the non-replicability of a study, from the famous Fleischmann-Pons debacle, to the failed replication of Baumeister and Vohs' work on ego-depletion and many other studies in between.
However, in collaborative data science, replicability is oftentimes what allows data science to truly move the state of the art of a model or an insight forward. Data scientists must be able to take a pre-existing pipeline or model, and without significant friction componentize it and rerun either the whole experiment or significant portions of it with new data, new algorithms or new approaches. This must be done with the ability to do a side-by-side comparison with the original analysis (or a rerun on current data) to gain confidence about the experimental design, and how the changes in said design are impacting the behavior during testing and likely to impact it during production.
A data science team which has:
Is a team where collaboration on analysis, models, metrics, or insights becomes second nature and “the easy path,” as opposed to a hard-to-enforce top-down mandate. Data scientists, even those working alone, are collaborating with at least two parties, their past selves, and their future selves.
Building the Domino Data Lab platform for reproducible data science has given us significant interaction with customers doing data science across almost every industry. We have seen time and time again how critical collaboration is to success in data science teams, and how important frictionless building of a shared context is to engendering this collaboration. In fact, we developed the Data Science Maturity Model framework based on the research we’ve conducted with our customers. Giving data scientists a platform that enables them to do “good” collaborative data science both makes them more productive in the short term and provides the organizational benefits of compounding knowledge and more predictable transparent outcomes and ROI from their investment in quantitative research.

Eduardo Ariño de la Rubia is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. A student of negotiation, conflict resolution, and peace building, Ed is focused on building tools that help humans work with humans to create insights for humans.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
In this article
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.