Data science infrastructure

As data science becomes a critical capability within more organizations, engineering leaders are finding themselves responsible — implicitly or explicitly — for enabling data science teams with infrastructure and tooling. Because data science looks similar to software development (they both involve writing code!), many engineering leaders with the best intentions approach this problem with misguided assumptions, and ultimately hamstring or undermine the data science teams they are trying to support.

Two common patterns hurt data science teams:

Engineering leaders try to support and/or manage data science teams as though they were engineering teams. This approach fails because data science is more experimental than engineering, so it requires more flexibility and agility around infrastructure and tooling, and it involves tracking and collaborating on different type of artifacts (results and experiments, not code and binaries).
Recognizing a need for more specialized different tools, engineering leaders give data scientists raw access to infrastructure: “here’s an RStudio server, a Jupyter Hub server, and AWS access — go do some data science!” This approach fails because it doesn’t give data science teams any capabilities for managing their workflows or collaborating with their business stakeholders, so work becomes siloed and chaotic. The business loses confidence, data scientists are unhappy, and everyone loses.

This guide explores the differences between data science and engineering, and the implications of those differences. Key takeaways:

Data science is more like research than engineering: it’s experimental and harder to estimate and forecast.
Data science workloads need burst compute and specialized hardware (e.g., GPUs) more than software engineering.
Without input from and close collaboration with business stakeholders, data science projects are likely to stall or meander.
Data scientists need agility to experiment with new tools quickly, to stay on the cutting edge of research techniques. Data scientists are more likely to leave a job if they are constrained by IT or technical barriers.

As a result of these differences, data science teams perform best, and maximize impact to the business, when they have tooling and infrastructure that facilitates their workflows, tailored to the unique requirements of data science.

In this field guide:

Data science is Not Software Engineering
Key Requirements for Data Science Infrastructure
Aligning Data Science with IT

Data science is Not Software Engineering

Data science involves a lot of writing code, but don’t be deceived: there are fundamental differences between data science and software engineering. Understanding this difference is important, because the first step to providing great technology for data scientists is to understand what they do and how they work.

Research (and development)

Engineering involves building things where what you’re building (as distinct from how to build it) is fairly well understood. Much of data science, however, is fundamentally a research process. Data science projects have goals (e.g., to answer some question or to build a model that predicts something) but the desired end state isn’t known up front. Imagine a chemistry lab or a genomics lab: researchers are pursuing some goal, form hypotheses, run experiments, document and discuss their results — data science is more like that.

This distinction has three important implications:

Process Predictability and Control

Engineering has well established methodologies for tracking progress. Whether you’re using agile points and burndown charts, or an alternative method, there are clearly defined metrics one can use to predict and control the process.

Data science is different, because research is more exploratory. It would be hard for a research lab to predict the timing of a breakthrough drug discovery. In the same way, the inherent uncertainty of research makes it hard to track progress and predict completion of data science projects.

The world hasn’t settled on KPIs for data science teams, but we’ve observed best practices like stakeholder feedback sessions and negative result documentation are more important than traditional engineering metrics.

Key Artifacts

Because of the challenges instrumenting progress of data science work, the best way to gain insight into projects is to look directly at the artifacts and work product. Engineering teams ultimately deliver binaries — those binaries, and the source code, are the key artifacts from engineering. For data science, looking at results is critical for understanding progress.

The “results” of data science work are some human digestible explanation of how effective an idea is or how a model performs. They can and should take many forms depending on the needs of the stakeholders. For example, if a team was building a model to predict customer churn, they may experiment with dozens of ideas. The result of each experiment would be charts and diagnostic statistics that showed how accurately a model predicted customer churn. Collaborators and managers would look at these artifacts to sense if the project was making progress or not.

Visual inspection of results plays a role for data scientists similar to how engineers use unit tests. With software, there is a notion of a correct answer, so it’s possible to write tests that verify intended behavior. This doesn’t hold for data science work, because there is no “right” answer, only better or worse answers as evaluated by some target outcome. Rather than writing unit tests, data scientists inspect outputs, then obtain feedback from technical and subject matter experts to gauge the performance of their models.

Agility vs Safety

There is a tremendous amount of innovation in the data science open source ecosystem, including vibrant communities around R and Python, commercial packages like H20, and new techniques leveraging cutting edge hardware like GPUs. The ability to experiment with new techniques and tools can be the critical ingredient leading to a breakthrough on a project. With engineering projects, it’s often prudent to use tried and true components and incorporate new technology slowly. Data scientists, in contrast, need a high degree of agility around the tools they use.

Variable computational demands

The infrastructure demands of data science teams are very different from those of engineering team.

Engineers build software that may ultimately run on high-performance infrastructure. But software engineers themselves usually work on a single machine with 16-32GB of RAM and 4-8 cores. Engineering teams also use infrastructure for test, QA, and build systems, needs which are largely static and predictable.

Data science is entirely different. For a data scientist, memory and CPU can be a bottleneck on their progress, because much of their work involves computationally intensive experiments. It could take 30 minutes to write code for an experiment that would take 8 hours to run on a laptop. Many data science techniques can utilize large machines by parallelizing work across cores or loading more data into memory. Similarly, data scientists can easily utilize many machines concurrently by spreading work across them.

These capacity needs aren’t constant — they ebb and flow over the course of a project. The need for burst compute is one reason that the cloud is so compelling for data science work. At the same time, they create a different set of requirements and challenges for providing infrastructure to data scientists compared to software engineers. We saw a DevOps team scale their shared data science server down from 16GB to 8GB of RAM after they looked at average utilization, without realizing this completely prevented certain critical data science workloads. Only after they focused on peak utilization did they realize the need to scale up.

Integration with other parts of the organization

Engineering is usually able to operate mostly independently from other parts of the business: engineering’s priorities are certainly aligned with other organizations, but engineering doesn’t need to interact day to day with, say, marketing or finance or HR. In fact, the entire discipline of product management exists to help intermediate these conversations.

In contrast, a data science team is most effective when it works closely with its “clients,” i.e., the parts of the business that will use the models or analyses that data scientists build. For example, a data science team building models for employee retention must work closely with HR; or if a team is building models for customer churn, they must work closely with customer success and/or marketing. Without close alignment, the team may settle on a promising feature for the churn model only to realize it doesn’t actually get collected until the customer is already gone, eliminating any real world effectiveness.

This need for more frequent and varied cross-organization communication has implications for how you organize and equip data scientists. For example, the same “results” artifacts described above for assessing progress — stakeholders in other parts of the business may also want to see and discuss those. An analogous notion with engineering would never make sense (HR would never want to look at binaries or build logs), but this can be critical to fostering a healthy data science culture.

Key Requirements for Data Science Infrastructure

Data science is a specialized discipline that has unique workflow requirements. Infrastructure to support data scientists should provide the following features and benefits:

For Data Scientists

Scalable compute — vertically and horizontally

Data scientists should be able to access large machines, specialized hardware (e.g., GPUs) for running experiments or doing exploratory analysis. They should also be able to easily use burst/elastic compute on demand. They should be able to do this with minimal, if any, DevOps work or need for IT resources to help them.

Environment agility

Data scientists should be able to easily test new packages and techniques, without IT bottlenecks or without risking destabilizing the systems that their colleagues rely on. Similarly, they should be able to use different languages — R, Python, Scala, etc — so they can choose the right tool for the job. And they shouldn’t have to use different environments or silos when they switch languages.

Collaboration and tracking oriented around experiments

Data scientists need a way to track, organize, and discuss their results, not just their source code. Like a lab notebook, for computational experiments. Results should be tracked so they can be linked or attributed to different inputs — i.e., the original code and data that was used to produce a result, in the same way that engineers can link a build number to a code revision or a commit ID.

For Management and IT

Cost controls and attribution

As data scientists utilize powerful machines and elastic compute, managers and IT needs visibility into cost utilization, and may also want the ability to put certain limits in place to ensure responsible use. We’ve seen sophisticated organizations stunned by their cloud resource bill after interns accidentally left expensive machines running for weeks.

Visibility into activity and progress

Data science managers, as well as business stakeholders who consume data scientists’ work, need visibility into in-flight project work. This is the only way to know how things are going, to spot mistakes, and to course correct early. Working without this visibility would be like an engineering manager running a team without being able to see source code or test builds or never doing a wireframe design review with business stakeholders. Great data science teams have systems that let managers or stakeholders see the charts from the latest experiments, leave comments, and even open up and run interactive notebooks to play with ideas themselves.

Transparency into dependencies

The flip side of giving data scientists agility is the need to see what’s actually being used, and how. It should be easy to see what dependencies data science projects have taken on software packages, or even different data sources or systems within your environment. For example, can you see which data science projects use your MySQL database vs your HDFS store. That knowledge can help you allocate new hires, optimize trainings, or eliminate overhead on unused infrastructure.

Aligning Data Science with IT & Business Partners

Watch this recorded talk to learn how New York Life organization aligned data science with IT and business leaders to put the necessary infrastructure in place to succeed. They explain how to help Business Partners shift their mindset from spectator to participant, coming together to create business value. New York Life also introduce the concept of the life cycle funnel—a way to think about how many ideas flow through each development stage until a single, final model is deployed.

Next Steps

Now that you’re familiar with the infrastructure necessary to support data science work in the enterprise, continue to the next chapter to learn about data science platforms.