Data Science Platforms
Data science represents the next era of analytics driving the enterprise. Enterprises that capitalize on its potential will outcompete their rivals, increase efficiency, and generate new revenue streams. Today’s IT teams are challenged to centralize data science infrastructure in a way that will increase governance without constraining data scientists’ freedom and flexibility.
Failure to act will result in a “wild west” of siloed, inconsistent technologies sprinkled across the enterprise, operating beyond IT’s purview and hindering the business’s opportunity to drive value from its data science investment.
Successful CIOs and IT leaders move data science from the business’s periphery to its core with structure and discipline that provide unbridled access to the latest technologies, visibility and auditability, and close alignment with the business.
Implementing the right data science platform will deliver a win-win-win: IT achieves better governance while enabling innovation that unlocks new business value. Data scientists gain self-service and agility. The business earns a bigger return from its investment in data science.
What is Data Science?
Data science at large blends statistics with computer science to find patterns in big data and use those patterns to predict outcomes or to recommend actions or decisions.
Data science represents the next frontier for the data-driven business, which has been evolving for decades:
- The 1980s and 1990s saw the dominance of data storage, data management and data warehousing technologies, teaching companies the value of capturing and storing data to improve business operations.
- In the late 1990s, business intelligence (BI) technologies became prevalent, making the insights captured by data management technologies more consumable by the business.
- The 2000s saw the “big data” boom with the rise of NoSQL technologies like Hadoop, presenting an open source, low cost approach to data processing and storage that made it plausible to keep full fidelity data, indefinitely.
This evolution of data management and analytics paved the way for data science, a term popularized around 2010, sometimes also called “quantitative research” or “decision science.” Data science encompasses machine learning (ML), the computational process of making predictions based on data inputs and continually improving those predictions as data changes. ML is just one type of weapon in the broad arsenal of data science.
Data Science Will Separate Winners from Losers
For decades, organizations have aspired to become data-driven. It took years to develop technologies that make it possible to efficiently capture, store and manage data from the systems that are instrumenting today’s world. Now that the data is available, it can benefit every person and every department across the enterprise, which is driving fast and furious adoption of analytics and data science.
ata science is widely recognized as a discipline that should become a core organizational capability, with the potential to drive new revenue streams, automate decisions, improve products and enhance customer experiences to increase a firm’s competitive advantage. This potential is driving significant investment from executives.
IT organizations have an opportunity to help companies realize the full potential of this investment by providing the infrastructure that helps make data science a core organizational capability, rather than a collection of siloed people and tools.
“Based on the simple fact that there’s just a huge amount more data than ever before, our greatest challenge is making sense of that data,” Salesforce.com CEO Marc Benioff said in a 2015 interview with Fortune. “And we need a new generation of tools to be able to organize and view the data. We need a new generation of executives who understand how to manage and lead through data. And we also need a new generation of employees who are able to help us organize and structure our businesses around that data… We need more data science.”
What's Different About Data Science
Previous generations of data technologies have involved centralized, monolithic components: a BI server, a database server, a data lake platform, for example. Data science work, in contrast, involves dozens of smaller tools and technologies, many of which are designed to be used locally on data scientists’ workstations.
According to a 2017 study by KDnuggets, the most popular languages for data science are Python and R, both of which rely on IDEs and development tools that run on end users’ machines. On top of that, these languages have rich ecosystems of “packages,” which provide supplemental functionality for more specialized purposes. Many of these packages and tools are open source and available for download online, and data scientists regularly download dozens or hundreds of packages to use in their day-to-day work.
In the last several years, the open source ecosystem around these tools and packages has flourished, driving rapid innovation, frequent updates, and availability of entirely new packages every month.
In other words, modern data science work lives across dozens or hundreds of clients, not in a centralized server.
“Wild West” of Data Science
Data scientists, eager to stay on the cutting edge and utilize the latest techniques, experiment liberally with a variety of tools and packages. That pace of experimentation is increasing as the open source ecosystem innovates more rapidly. The combination of client-based work, a large number of easily accessible technologies, and a desire for rapid experimentation has created a “wild west” of data science tooling in most organizations. Inconsistent technologies are spread across disparate parts of the organization without governance or transparency around any of them.
Worse, in many organizations, “shadow IT” is cropping up to support these systems. For example, a small team might install RStudio or Jupyter (both free downloads) on a shared server to use for their group, without considering support requirements or consistency with other parts of the organization.
Beyond the obvious problems, this “wild west” of siloed data science work creates several other issues:
- Important business processes become dependent on unreliable infrastructure. Data scientists will often set up scheduled jobs to run on their own local machines, or operate shared servers as “lab” or “dev” machines. One Fortune 10 bank had a critical business process that depended on a model a data scientist had been running nightly on his laptop — only to be discovered when he left and the laptop was decommissioned.
- High-value intellectual property is improperly secured. Predictive models and analyses can encapsulate insights key to competitive advantage, and that work is often scattered throughout network drives, wikis, or Sharepoint sites.
- Compute costs can become excessive and uncontrolled. Unlike BI, data science involves computationally intensive techniques, which demand high-powered machines and specialized resources like GPUs. Especially in a cloud environment, data scientists in the wild west can unintentionally burn thousands of dollars a month by leaving expensive machines running unnecessarily.
- Data scientists waste time on DevOps work. Data scientists are precious, highly paid people, yet they often must spend 25% of their time dealing with DevOps tasks like installing packages and moving files between machines.
- Data scientists waste time duplicating effort and reinventing the wheel. Beyond individual data scientists wasting time on DevOps, entire teams can waste time pursuing projects that reinvent the wheel or don’t build upon past organizational knowledge, because that past work was siloed and undiscoverable.
The Central Tension
Data scientists will err on the side of innovation, driven by a desire to use the latest technology and largest machines to develop better models faster than competitors. They are unlikely to perceive the medium- and long-term consequences of a lack of standardization and governance. Like water flowing around rocks in a river, they will find the path of least resistance: if IT isn’t offering them what they need, they will find workarounds, install tools locally, and unintentionally put the organization at risk over the long run.
It’s natural, but overly simplistic, to view the situation as a trade off between innovation and safety/security. That framing binds the CIO or IT leader between stifling business progress and competitiveness, or endorsing chaos and risk. But this framing is a false dichotomy and misses an opportunity to align the goals and incentives of stakeholders across the business.
Within the challenges above lies a tremendous opportunity to bring order to chaos while enabling a critical business transformation. It’s a pivotal point in many organizations’ journey toward becoming truly data driven, and if built correctly, an effective data science function will transform every business.
What is a Data Science Platform?
A data science platform is where all data science work takes place. It acts as the system of record for predictive models. If databases and data lakes were the central architectural components of incumbent generations, the foundational technology for the data science era is the data science platform.
Unlike a database, a data science platform doesn’t house your data—instead, it houses the artifacts and work product associated with data science workflows. Just as sales organizations use a CRM to create maturity and scalability, and engineering organizations use version control, enterprises are deploying data science platforms to create more maturity and discipline around data science work.
Data science platforms allow IT organizations to rein in the wild west of data science tools, assets and infrastructure spread across the organization. Instead of working in disparate local environments, data scientists do their work in one central place. In order to support the range of use cases involved in data science work, an effective data science platform will provide:
- Self-service infrastructure, so data scientists can do exploratory data analysis and model development without configuring and using their own compute resources. The data science platform encompasses compute resources—as well as the languages, packages and tools necessary for modern data science work—with controls and reporting around resource usage to administer or attribute costs.
- Ways to deploy, productionize or operationalize finished models, instead of driving data scientists to set up shadow systems. This includes deploying models to power scheduled jobs, reports, APIs or dashboards in one place. The data science platform also provides a consistent baseline of non-functional requirements (security, HA, etc.) and a catalog that offers transparency into assets and utilization across the enterprise.
- Governance, collaboration and knowledge management around all the artifacts created in the process of the research and deployment work described above.
Moving data science work onto a centralized platform will ensure that:
- Any model or analysis involved in a business process is centrally persisted and monitored, even if the original creator leaves the organization.
- Data scientists work from consistent, standardized tools, reducing support burden and operational risk.
- All data science assets are permissioned, and those permissions are auditable.
Aligning With Stakeholders Across the Business
Implementing a data science platform to centralize data science work will reduce risk and support burden for IT organizations. But getting buy-in from other parts of the organization—especially data scientists who are likely to balk at talk of “governance”—will be critical. A key part of the CIO’s and IT leader’s challenge is delivering effective, tailored communications to different stakeholders; rallying the troops to align behind a shared goal for successful data science. Doing so requires empathy to understand the unique motivations and perspectives of different constituents. Fortunately, there are a wide variety of benefits that can be communicated to align interests.
To data scientists, whose priority is to innovate as quickly as possible by taking advantage of the best and newest tools in a self-service environment:
- Promote the benefits of self-service environments for data science that will allow them to independently provision infrastructure, spin up workspaces with their tools of choice (e.g. Jupyter, RStudio) and safely experiment with new packages and tools. They won’t waste time doing their own DevOps work and they won’t need IT support.
- They can run experiments faster and collaborate with others in the same place they’re doing development work, saving time that would otherwise be wasted reinventing the wheel.
To executives, whose priority is to derive ROI from investments in data science by quickly integrating insights to improve business processes:
- Promote the concept of a data science “system of record”, akin to the function a CRM fulfills for a sales organization. It centralizes all workstreams and communications between data scientists and other business stakeholders in Engineering, IT and Compliance, facilitating a more mature, predictable, scalable way for data science teams to deliver value.
- Faster experimentation will lead to more data science projects and research breakthroughs completed faster.
- Easier ways to operationalize or deploy models will reduce the time from insight to impact, turning data science work into realized business value at a faster pace.
- The flexibility to accommodate modern tools and technology to data scientists will help to recruit top talent in a competitive field.
- Automatically maintaining a complete audit log of every model’s development will reduce operational and regulatory risk of algorithmic decision-making.
To the rest of the IT organization, whose priority is to control infrastructure costs and maintain a single, integrated environment:
- Promote the idea of an infrastructure orchestration platform that integrates with existing systems and tools, offering real-time scoring, batch scoring and app hosting options.
- Risks and issues can be proactively identified by tracking hardware, tools usage and changes to production models.
- Usage of expensive compute resources (especially in a cloud environment) can be more easily monitored, limited and attributed.
By successfully navigating each internal stakeholder’s concerns and deploying a data science platform, everyone wins: IT management successfully mitigates risk through governance and centralization, while delivering productivity gains for data scientists. Establishing a data science platform leaves IT poised for success, and the business is equipped to drive faster innovation.
Build vs Buy
The “build vs. buy” decision can be a difficult one. Companies that set out to build their own typically do so for two reasons:
- Cost: “If we build our own, we won’t need to invest in a third-party software platform which will cost the company money.”
- Customization: “We can develop a data science platform from the ground up that’s purpose-built for our organization’s unique needs.”
Before heading down this path, consider several costs associated with a homegrown solution:
- Opportunity cost and comparative advantage. Chances are that your models are your core competency and your differentiation, not the platform you use to develop them. Instead of dedicating engineering resources to building a platform, what could you have those resources do instead?
- It’s harder than people think. A data science platform combines infrastructure orchestration, sophisticated workflow and UX, and capabilities for production-grade deployment. That’s a diverse set of engineering challenges and a broad surface area. Many companies have spent a year trying to build a platform and ultimately failed to deliver anything.
- You’ll be making a permanent commitment of resources to ongoing support and maintenance. It’s not just the upfront cost of your engineering resources, it’s the ongoing support and solution enhancements.
You haven’t built the CRM system that your sales team uses, or the version control system that your engineers use—a data science platform is no different.
As organizations increasingly strive to become model-driven, they recognize the necessity of a data science platform. According to a recent survey report, 86% of model-driven companies differentiate themselves by using a data science platform. And yet the question of whether to build or buy still remains.
For most organizations, purchasing a data science platform is the right choice from both a business strategy and project cost efficiency perspective. However, many organizations confuse the criticality of models to their long-term success with the need to build the underlying platform themselves. In a few select situations, the platform itself is the differentiator.
These organizations have highly specialized workflows (eg, Uber), a stellar track record of internal software development (eg, Airbnb), and deep data science expertise that recognizes the unique traits of models (eg, Google).
For the vast majority of organizations, the competitive differentiator is not the platform, but the entire organizational capability — what we call Model Management — encompassing many different technologies, stakeholders, and business processes. Buying the platform is the logical choice for most.
You’re probably thinking, “Of course Domino, the data science platform vendor, believes everyone should buy a data science platform.” We do have our opinion on the topic, but this opinion stems from thousands of interactions with organizations of all shapes and sizes around the world. Most that have opted to build their own platform have stalled or failed. Those who have purchased a platform are operationalizing data science at scale.
These interactions and experiences working with organizations trying to decide whether they should build or buy led us to develop an objective framework to facilitate the decision process. It includes three major factors:
Total Cost of Ownership
The scope of building, managing and operating a data science platform needs to be carefully examined. Many organizations underestimate the total cost of ownership in the build approach.
In a four-year scenario where an organization builds a data science platform supporting 30 data scientists at first (and growing at 20% annual rate in subsequent years), we estimated the TCO of building to be over $30 million while the TCO of buying is only a fraction of that. See Figure 1 below for a yearly side-by-side comparison of the TCOs of the two approaches.
By devoting resources to building a data science platform, an organization is inevitably choosing to divest from other projects. This choice can be unwise especially if the organization sacrifices its core competency, which will eventually hurt the organization’s revenue.
Data science is not an easy endeavor to take on, and it is wise to de-risk as much as possible. Risk factors such as talent acquisition and retention, skill requirement changes, and platform feature requirement changes need to be considered carefully before deciding to build. On the flip side, an organization should also be very careful with choosing which vendor to purchase from if they so decide.
Ultimately, organizations need to decide where their differentiation lies with data science: in the models they build and overall organizational capability, or in the underlying infrastructure? For most, it is the former, so a “buy” approach likely offers the lowest TCO and most aligned strategic choice.
Types of Data Science Platforms
The data science platform landscape can be overwhelming. There are dozens of products describing themselves using similar language despite addressing different problems for different types of users.
The three segments that have crystallized are:
- Automation tools
- Proprietary (often GUI-driven) data science platforms
- Code-first data science platforms
The table below summarizes these segments:
These solutions help data analysts build models by automating tasks in data science, including training models, selecting algorithms, and creating features. These solutions are targeted primarily at non-expert data scientists or data scientists interested in shortcutting tedious steps in their process to build quick baseline models.
These “automated machine learning” solutions help spread data science work by getting non-expert data scientists in to the model building process, offering drag-and-drop interfaces. They often include functionality for deploying the models that have been automatically trained, and they are increasingly integrating interpretability and explainability features for those models, as well. They work best when the data is cleanly prepped and consistently structured.
Proprietary (Often GUI-driven) Data Science Platforms
These tools support a breadth of use cases including data science, data engineering, and model operations. They provide both drag-and-drop and code interfaces and have strong footholds in a number of enterprises, and may even offer unique capabilities or algorithms for specific micro-verticals.
While these solutions offer great breadth of functionality, users must leverage the proprietary user interfaces or programming languages to express their logic.
Code-first Data Science Platforms
This group of solutions targets code-first data scientists who use statistical programming languages and spend their days in computational notebooks (eg, Jupyter) or IDEs (eg, RStudio), leveraging a mix of open-source and commercial packages and tools to develop sophisticated models. These data scientists require the flexibility to use a constantly-evolving software and hardware stack to optimize each step of their model lifecycle.
These code-first data science platforms orchestrate the necessary infrastructure to accelerate power users’ workflows and create a system of record for organizations with hundreds or thousands of models.
Enterprises with teams of data scientists select these solutions to enable accelerated experimentation for individuals while simultaneously driving collaboration and governance for the organization. Key features include scalable compute, environment management, auditability, knowledge management, and reproducibility.
Questions to Ask Before Buying a Data Science Platform
Data science is unlike other technical disciplines, and models are not like software or data. Therefore, a data science platform requires a different type of technology platform.
Below are the top 10 IT Leaders should ask of data science platforms to ensure the platform handles the uniqueness of data science work.
1. Where/how is the platform hosted?
An ideal data science platform should work with existing infrastructure. It provides the flexibility to be hosted in the Cloud (e.g. a VPC—a vendor-managed private cloud), on-premise, or perhaps hybrid. Either way, the platform should be based on a single code-base, regardless of where it is hosted. If business requirements call for changes in infrastructure, the ideal platform provides the flexibility to adapt to those changes.
2. How can the platform help me ensure that data scientists use tools and packages (open-source or proprietary) that have been approved?
Data science requires free-from experimentation and access to the latest revolutions in open-source tooling to achieve breakthroughs. However, enterprises need to provide guardrails on experimentation and tools to guard against breaches and protect company IP. So, a data science platform must support various native data science tools (JupyterLab, RStudio, SAS, etc.) through an open and flexible approach, while providing IT teams the capabilities to govern the data science environments and provision pre-approved environments.
This approach will remove the data science shadow IT challenge and ensures IT infrastructure is not exposed to unnecessary risks.
3. How does the platform handle the dynamic nature of data science work?
Data scientists’ work requires somewhat unpredictable access to different sizes of hardware, including GPUs, when doing intense work like deep learning. Reserving large hardware instances that sit idle is too expensive, so a data science platform should provide elastic access to different types of machines and software packages. These environments should be available with a single-click, removing DevOps tasks from data scientists’ daily work.
IT teams should be able to control which users have access to which environments, and also have complete visibility into the costs, time, and usage of each of these environments. Ultimately, the platform should provide ability for parallel execution (running multiple experiments in parallel) in resource provisioning.
4. How does the platform handle user security and increasingly complex governance requirements where data scientists have access to highly sensitive data?
An ideal platform for data science should work with existing user security practices such as Single Sign-On (SSO). However, in data science, providing authorization and authentication security isn’t enough. Data science is different, and a complete platform also provides an audit trail of all data science work (code, data, packages, environments, comments) for an individual user that ensures reproducibility and auditability of the users’ work.
Along with this visibility and auditability, IT should have access to a flexible permission model to govern access to models, projects, data, experiments, hardware, and software packages that scales to support growth to hundreds of users.
5. How does the platform help reduce regulatory and operational risks and help future-proof me from upcoming regulatory hurdles?
Keeping a comprehensive and thorough system of record in the data science lifecycle can significantly reduce regulatory and operational risks. An ideal data science platform preserves the entire lifecycle of a model for a system of record. All revisions of a project should be tracked to enable easy retrieval of any experiment for audits, risk governance, and compliance checks. For example, a model developed to predict insurance policy holder risk may need to be audited and adjusted based on new personal privacy laws.
A full model provenance log would enable one to trace back every step of model creation, understand how specific sensitive personal data impacts the model, and how that sensitive data was used in development of the model. Additionally, a data scientist could start from any point in that model creation process to fork off and develop an updated model without starting from scratch, accelerating new model development while reducing compliance risk.
6. Why do existing tools like, Git, JIRA, and Jenkins fail to meet the needs of a data science platform?
Data science is different than software development; models require re-training, are developed in an experimental fashion, and are made using lots of different software tools. There is no need to “retrain” software code, but production models do need to be retrained frequently. A data science platform provides a single and comprehensive system-of-record (SOR) for models, which is much more than keeping track of code versions and issues.
Data science assets include code, data, discussion threads, hardware tiers, software package versions, parameters, results, and more. Git and JIRA are not built for an experimental process. Furthermore, data scientists will reject GIt/Jira/Jenkins built systems since they hinder their work instead of accelerating it.
A data science platform accelerates model development and deployment, with access to elastic compute, automatic experiment tracking, full reproducibility, model-based collaboration, streamlined model-deployment, and a knowledge base of building blocks to enable rapid model development.
7. What data does the platform provide access to? And how does it handle the data versioning requirements of data science?
A data science platform needs to provide simple, fast, and secure access to ALL types of data including Hadoop, Spark, flat files, and databases. These connections must be encrypted in transit, be able to handle failover, and set up to transfer large amounts of data for model training and experimentation.
Data science also involves lots of data manipulation and creation of new “features,” which are created based on other data. Since the data and features often change in each experiment, the snapshot of that data needs to be captured and revisioned so that the model and data is auditable, reproducible, and meets compliance requirements in regulated industries.
8. How does the platform enable user-friendly, enterprise-ready model operations (ModelOps)?
Model operations involves deploying models to production and the process of monitoring, re-training, and updating them in production. Model deployment is the process of enabling a model to be used in production (for example, deploying the model as a simple visual (chart, graph), an interactive application, or as an API) so the model can be used for interactive human consumption or machine-based consumptions.
An ideal data science platform should allow data scientists to self-serve and directly deploy models in the various different modes, with IT approval and oversight. Once the model is deployed, the platform should monitor model performance, provide ability to retrain, and revision that model in production, capturing full model provenance for audit records.
Lastly, the platform should ensure that end-users have a direct feedback path, from the model to the data scientists, to ensure rapid iteration on the model.
9. How does the platform help govern cloud infrastructure costs and plan for future technology needs?
A data science platform should provide an elastic and flexible compute infrastructure to meet the dynamic resource requirements of data science projects. Poor resource provisioning can lead to unexpectedly high hardware-usage bills or unrealistic requests for additional hardware.
The platform should also provide visibility and controls to ensure compute resources are properly allocated and consumed by the correct users on data science teams. Visibility and controls of hardware are important, but the platform should also expose the usage of different software tools by users, for specific projects too. This level of detail helps IT leaders plan for future projects and adjust spend and tooling to be commensurate with the projects that drive the most value. It also enables IT leaders to have collaborative discussions with data science leaders on project ROI.
10. How does the platform work with traditional software development processes?
Even though data science platforms are built to enable their unique model development lifecycle, they should integrate with current software development processes. The platform should provide a workflow to enable a Dev-Test-Production schedule for the unique aspects of model development. This workflow should ensure the process captures all model assets, including code, data, comments, tools, packages, and even the development environments. Capturing all model asset information ensures that one can revert to previous model versions and promote to the latest model version in a seamless and auditable manner.
Designing Sustainable Data Science Platforms
If you choose to design and build your own data science platform, the video below shares recommendations and lessons on designing them to be sustainable and scalable.