The "Joel Test" for Data Science
Nick Elprin2016-08-10 | 7 min read
It's the sixteenth anniversary of Joel Spolsky's "Joel Test," which he described as a "highly irresponsible, sloppy test to rate the quality of a software team."
Back then (the late 1990s), software development was:
- Being recognized across industries as an invaluable capability for improving business outcomes;
- Undergoing a change from solo practitioners and small teams to large collaborative teams;
- Undergoing a rapid evolution of best practices and tooling to support its practitioners;
- Heavily in demand, creating lucrative job opportunities for competent practitioners;
- Eating the world.
We think data science is going through a similar phase of evolution and maturation, so we thought it would be helpful to write something like the Joel Test for assessing the maturity of your data science program. It's our "highly irresponsible sloppy test to rate the quality of a data science team."
Here's our first draft, let us know what you think:
The "Joel Test" for Data Science
- Can new hires get set up in the environment to run analyses on their first day?
- Can data scientists utilize the latest tools/packages without help from IT?
- Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
- Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
- Does collaboration happen through a system other than email?
- Can predictive models be deployed to production without custom engineering or infrastructure work?
- Is there a single place to search for past research and reusable data sets, code, etc?
- Do your data scientists use the best tools money can buy?
These are not the only factors that will determine the success of your data science program. For example, the questions above don't cover anything related to the connection between data science work and business drivers ("do all your data science projects have a clear business goal and engaged business stakeholders?"). And you still need great people on your team.
However, if you answer "yes" to all or most of the questions above, then you're working in a way that makes good outcomes much more likely.
1. Can new hires get set up in the environment to run analyses on their first day?
We've seen organizations where it takes over a month for a new data scientist to even begin contributing. Onboarding can be delayed because new hires spend time getting the right software installed on their computer; finding and getting access to the right versions of internal resources (code, data sets) to use; and learning how to follow internal processes.
2. Can data scientists utilize the latest tools/packages without help from IT?
There is a flourishing ecosystem of open-source tools for data science. No single tool will be a panacea—rather, organizations will be most effective when they are agile enough to experiment with new tools and techniques. To that end, trying a new package should be possible at the speed of your natural research process, rather than becoming a bureaucratic IT approval process.
3. Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
As data volumes grow and data science algorithms become more computationally intensive, it's critical to have access to scalable compute resources. As with the point about packages above, research will progress faster if IT or dev ops processes aren't a bottleneck for data scientists.
4. Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
The first question of the original Joel Test is "do you use source control?" In our experience, source control is necessary but insufficient for robust data science, because source code alone is not enough to reproduce past work. Rather, we think it's important to have a record of experiments—including the results, parameters, data sources, and the code that were used to produce them. The most mature organizations will also be able to re-instantiate the underlying software environment (e.g., which version of language, packages) to reproduce a past result.
5. Does collaboration happen through a system other than email?
Data science is a team sport. During the course of a project, you'll likely get feedback both from technical colleagues and non-technical stakeholders. How are you sharing results and recording feedback and conversations? If it's happening over email, there's a good chance that those conversations and the organizational knowledge you're accumulating will be lost. It won't be available to new people who look back at the work later; it will be lost if the project members leave the organization; it's not searchable or discoverable later.
A good data science collaboration platform will keep work and discussion centralized, make it searchable, etc. There are plenty of ways to do it, and email is a convenient way to get work into such a platform, but email should not be the primary way that collaboration happens.
6. Can predictive models be deployed to production without custom engineering or infrastructure work?
If engineers must be involved to integrate data science output into business processes, you are delaying your time-to-market, thus reducing the value of your data science work. Infrastructure and platforms can empower data scientists to quickly "productionize" their work without an extra—and some times very long—step.
7. Is there a single place to search for past research and reusable data sets, code, etc?
Many data scientists believe they make their biggest impact when they answer a question, produce a model, or create a report. Actually, the longer-lasting, more leveraged impact is made when their work contributes to the collective knowledge of the organization in a way that can be built upon in the future. Therefore it is important that, as research progresses, it's persisted in a way that can be discovered and reused later—and the other side of that coin is that people have an easy way to find and reuse that past work.
Searching across dozens of network folders, Sharepoint sites, and repositories is not an effective way to preserve organizational knowledge. There should be a single system of record, even if that yields results that link out to auxiliary systems.
8. Do your data scientists use the best tools money can buy?
We took this one straight from Joel's list. Data scientists are expensive, value-adding people—equipping them well is a great investment.
Banner image titled "Graffiti & Street Art At Portobello (Dublin)" by William Murphy. Licensed under CC BY-SA 2.0
Nick Elprin is the CEO and co-founder of Domino Data Lab, provider of the open data science platform that powers model-driven enterprises such as Allstate, Bristol Myers Squibb, Dell and Lockheed Martin. Before starting Domino, Nick built tools for quantitative researchers at Bridgewater, one of the world's largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. He holds a BA and MS in computer science from Harvard.
Summary
- The "Joel Test" for Data Science
- 1. Can new hires get set up in the environment to run analyses on their first day?
- 2. Can data scientists utilize the latest tools/packages without help from IT?
- 3. Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
- 4. Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
- 5. Does collaboration happen through a system other than email?
- 6. Can predictive models be deployed to production without custom engineering or infrastructure work?
- 7. Is there a single place to search for past research and reusable data sets, code, etc?
- 8. Do your data scientists use the best tools money can buy?