Principles of Collaboration in Data Science
By Eduardo Ariño de la Rubia2017-04-1117 min read
Data science is no longer a specialization of a single person or small group. It is now a key source of competitive advantage, and as a result, the scale of projects continues to grow. Collaboration is critical because it enables teams to take on larger problems than any individual. It also allows for specialization and a shared context that reduces dependency on "unicorn" employees who don't scale and are a major source of key-man risk. The problem is that collaboration is a vague term that blurs multiple concepts and best practices. In this post, we clarify the differences between repeatability, reproducibility, and whenever possible the golden standard of replicability. By establishing best practices of frictionless in-team and cross-team collaboration, you can dramatically improve the efficiency and impact of your data science efforts.
Starting at the base, repeatability is a critical building block of a collaborative data science process. Repeatability is the idea that a given process (whether it be a data cleaning script, a feature engineering pipeline, or a modeling algorithm) will produce the same (or nearly the same) output given similar inputs. In order for data scientists to be able to collaborate, they must be able to rely on the instruments and procedures they have been consistent. This often surfaces itself as a challenge in data science collaboration with environmental and data instability.
I often talk to scientists and researchers who have the problem of “it worked on my computer.” Even worse, some underlying system library or dependency would cause an algorithm to produce one signal on one scientist’s machine, and when run on another scientist’s machine or a production environment, it would produce a different signal altogether.
One of the scariest examples we have seen of the perils of not having an environment that guarantees repeatability was shared with me at a conference by quantitative research at a company in the financial sector. They had built a model that generated a trading strategy and backtested it extensively. This model, when run on the original author's laptop, gave a signal that a particular security should not be traded as it would almost certainly lead to a significant capital loss. When the researcher handed off the code to one of the analysts who was in charge of the actual transactions, they ran it on their laptop and it generated strong buy signals for a number of assets. This ended up costing the organization a significant amount of money, as the signals were fundamentally flawed.
After a significant amount of finger-pointing, they discovered that the two data scientists had a different point release of an underlying date parsing library. For the original scientist badly formed dates were parsed as NA, whereas for the second scientist, they were parsed as 1970-01-01. This meant that the model they generated differed in behavior, and generated completely different outputs even on the same data.
Attempting to diagnose whether a computing stack is repeatable can be a significant challenge. Modern computing environments are an unimaginably large stack of oftentimes opaque layers of abstraction, and finding the one library dependency that was updated to a different point revision (even given tools such as conda which attempt to deterministically build exact environments) can be sisyphean task.
Any practice of collaboration built without a fully repeatable environment is unlikely to be able to prove itself valuable or be adopted, as these little subtle differences and changes in behavior and expectation can erode confidence in the system and provide a high barrier to adoption.
The next step is reproducibility. Victoria Stodden, Associate Professor at the School of Information Sciences at the University of Illinois, described a powerful taxonomy of reproducibility which describes its three facets: statistical reproducibility, empirical reproducibility, and computational reproducibility.
An analysis is statistically reproducible when detailed information is provided about the choice of statistical tests, model parameters, threshold values, etc. This mostly relates to pre-registration of study design to prevent p-value hacking and other manipulations. Statistical reproducibility is something that should be enforced via top-down mandates, as well as through peer review and documentation. Teams should have access to a knowledge repository of organizational best practices regarding statistical options, and be provided with guidance regarding what is appropriate. Providing a research hub with access to peers inside of an organization that can provide a “sounding board” regarding the choices made in experimental and statistical design can often shortcut hundreds of hours of wasted effort when an improver validation procedure or statistical test is chosen.
An analysis is empirically reproducible when detailed information is provided about non-computational empirical scientific experiments and observations. In practice, this is enabled by making data freely available, as well as details of how the data was collected. In data science, this is often tied to the underlying business drivers of a study, and knowledge about the true data generating process of the data sources used to build models and analysis.
At Domino, we often say that the data science process begins and ends with questions and data from business drivers, and therefore it’s important that the teams document what these original questions and ideas were, how they came about, who asked them, and provide digital provenance of the datasets that are used in an analysis in a fully deterministic and reproducible fashion. We support a lot of this functionality with our “data projects” architecture, which allows teams to define canonical datasets which are fully revisioned and componentized, allowing collaborators to understand exactly what analyses draw inspiration from what datasets, and what were the biases encoded in those datasets through the data collection process.
An analysis is computationally reproducible if there is a specific set of computational functions/analyses (in data science, almost always specified in terms of source code) that exactly reproduce all of the results in an analysis. I tend to think of reproducibility as the orchestration of a series of repeatable steps in a deterministic guaranteed fashion.
It is important to note that this isn’t just the source code however, for an analysis to be computationally reproducible, the “tuple” that is serialized is significantly greater than just the bytes-on-disk artifact of the source files. Computational reproducibility includes reproducibility of the underlying data, the software, the sequence of operations, and the underlying hardware that it was executed on. The rOpensci organization’s list of good characteristics for reproducible research is helpful for understanding what makes up good computational reproducibility.
Here are eight tenets for good reproducibility and in-team and cross-team collaboration adapted from Sandve, Nekrutenko, Taylor, & Hovig’s Rules for Reproducible Computational Research:
- Track Results - Whenever a result may be of potential interest, keep track of how it was produced. As a minimum, you should at least record sufficient details on programs, parameters, and manual procedures to allow yourself, in a year or so, to approximately reproduce the results.
- Script Everything - Whenever possible, rely on the execution of programs instead of manual procedures to modify data. If manual operations cannot be avoided, you should as a minimum note down which data files were modified or moved, and for what purpose.
- Create Reproducible Environments - In order to exactly reproduce a given result, it may be necessary to use programs in the exact versions used originally. Leverage tools like Docker and configuration management systems to guarantee reproducibility. As a minimum, you should note the exact names and versions of the main programs you use.
- Use Version Control - Even the slightest change to a computer program can have large intended or unintended consequences. As a minimum, you should archive copies of your scripts from time to time, so that you keep a rough record of the various states the code has taken during development.However, we find that the minimum is often not enough. The use of an automated version control system removes a lot of the friction around version control best practices. You should strive to have a system that guarantees that any program execution has a full reproducible snapshot. There is a special frustration when you have a plot you want to reproduce, but you happened to not commit your code when you generated that visualization.
- Store Data and Intermediate Results - In principle, as long as the full process used to produce a given result is tracked, all intermediate data can also be regenerated. In practice, having easily accessible intermediate results may be of great value. Systems such as make and drake can manage a complex dependency graph of results. As a minimum, archive any intermediate result files that are produced when running an analysis. Preferably however use a system that can easily record all intermediate results and expose them to you for analysis without creating significant friction.
- Set a Random Number Seed - Many analyses and predictions include some element of randomness, meaning the same program will typically give slightly different results every time it is executed. For example, clustering algorithms can often find different clusters and are sensitive to initial conditions. As a minimum, note which analysis steps involve randomness, so that a level of discrepancy can be anticipated when reproducing the results.
- Store Data Visualization Inputs - From the time a figure is first generated to it being part of an analysis, it is critical to store the data and process that generated it. As data visualizations become more complex, not simply a chart but an entire application or interactive dashboard, it’s important to manage the visualization pipeline as a fully reproducible artifact. As a minimum, one should note which data formed the basis of a given plot and how this data could be reconstructed.
- Allow Levels of Analysis - In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying the summaries. Make those fluid and explorable, as a minimum at least once generate, inspect, and validate the detailed values underlying the summaries. Data science is a team sport and oftentimes your team will include very talented individuals with subject matter expertise which does not necessarily overlap with coding or the mathematics required to build models. In that case, provide approachable interfaces for those non-technical users. The ideal way would be to enable these interfaces and levels of analysis so that less technical members of your team can contribute to the analysis and help course-correct based on their subject matter expertise.
Reproducibility creates confidence in data science teams. Without true frictionless reproducibility, it can often be very challenging to move the state of the art of an algorithm or model in production forward. It can even be challenging to gauge whether a new iteration of a model is actually an improvement over a model in production or is simply an equivalent model trained on more recent data.
At Domino, we believe that frictionless reproducibility and compounding of knowledge is the cornerstone to a good collaborative process, and have built our platform around this opinion. For more thoughts about this, watch my talk about providing digital provenance at UseR! 2016 in Stanford.
Once you have established a practice of repeatability and reproducibility, the gold standard of a collaborative data science process is replicability. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. Replicability is the practice that allows a model in production to be independently verified by auditors, to be reimplemented by an engineering organization for use in a real-time system, and perhaps most importantly to have confidence that updating a model across time with new data as covariates shift will still provide results that are both directionally and substantially aligned with the original effort.
Replicability in data science is often misunderstood, as its role is mostly considered in the physical and biological sciences when it comes to cutting edge breakthroughs. There have been entire careers destroyed due to the non-replicability of a study, from the famous Fleischmann-Pons debacle, to the failed replication of Baumeister and Vohs' work on ego-depletion and many other studies in between.
However, in collaborative data science, replicability is oftentimes what allows data science to truly move the state of the art of a model or an insight forward. Data scientists must be able to take a pre-existing pipeline or model, and without significant friction componentize it and rerun either the whole experiment or significant portions of it with new data, new algorithms or new approaches. This must be done with the ability to do a side-by-side comparison with the original analysis (or a rerun on current data) to gain confidence about the experimental design, and how the changes in said design are impacting the behavior during testing and likely to impact it during production.
A data science team which has:
- People trained in the practices of repeatability, reproducibility, and replicability
- Processes in place which encourage the use of these best practices
- Tooling which allows these best practices to be leveraged without friction
Is a team where collaboration on analysis, models, metrics, or insights becomes second nature and “the easy path,” as opposed to a hard-to-enforce top-down mandate. Data scientists, even those working alone, are collaborating with at least two parties, their past selves, and their future selves.
Building the Domino Data Lab platform for reproducible data science has given us significant interaction with customers doing data science across almost every industry. We have seen time and time again how critical collaboration is to success in data science teams, and how important frictionless building of a shared context is to engendering this collaboration. In fact, we developed the Data Science Maturity Model framework based on the research we’ve conducted with our customers. Giving data scientists a platform that enables them to do “good” collaborative data science both makes them more productive in the short term and provides the organizational benefits of compounding knowledge and more predictable transparent outcomes and ROI from their investment in quantitative research.
Eduardo Ariño de la Rubia is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. A student of negotiation, conflict resolution, and peace building, Ed is focused on building tools that help humans work with humans to create insights for humans.
Subscribe to the Domino Newsletter
Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.