How data science can fail faster to leap ahead
Nikolay Manchev2022-04-29 | 8 min read
One of the biggest challenges in data science today is finding the right tool to get the job done. The rapid change in best-in-class options makes this especially challenging - just look at how quickly R has fallen out of favor while new languages pop up. If data science is to advance as rapidly as possible in the enterprise, scientists need the tools to run multiple experiments quickly, discard approaches that aren’t working, and iterate on the best remaining options. Data scientists need a workspace where they can easily experiment, fail quickly, and determine the best data solution before they run a model through certification and deployment.
One thing that amazes me is how huge the data science ecosystem is. There are about 13,000 packages in R, and hundreds of thousands in Python. But if you look at the Kaggle survey, over four years the percentage of data scientists using R dropped more than 40 percentage points - from 64% to 23%. At the same time, the fraction of data and business analysts using Python increased dramatically from 61% to 87%. Last year, Python jumped to the top of programming languages, passing Java and C on the TIOBE index, while Swift moved a little, and Julia and Dart lost ground. With so much change, it’s hard to know what approach to take. An effective data scientist shouldn’t have to choose - they should be able to try multiple approaches.
One thing that doesn’t surprise me is how long it can take an IT department to roll out a new update. I’ll give them credit, they have a big task with validation, checking licensing and usage rights, scheduling server time, getting clusters deployed, etc. But I’ve heard many customers say that it takes them six months to get a new Python package deployed by the IT team, so some just sneak their personal laptop into the office to run their work. People are creative, and they will find ways, but a data scientist shouldn’t have to go around their IT department to get their work done.
The last big challenge here is that if data scientists are limited in the tools they can use by infrastructure or IT requirements, then they’re going to frame their research, experiments and results to fit within that software framework - which adds an artificial limit to the type of creative thinking that can lead to the biggest advancements. As they say, if all you have is a hammer, everything looks like a nail.
The solution is to expand your tool set, to build a sandbox where you can quickly try multiple approaches, so that you just need to ask for IT help for the final deployment. If it takes too much time to provision new tools, the IT roadblock will limit results, creativity and eliminate options that might provide a better solution. Without better access to software tools and new frameworks, data scientists have to choose expediency over insights.
When researchers have a well defined sandbox (or MLOps platform), they can minimize sunk costs by making it more efficient and cheaper to try a new approach than spending time on minor refinements. Instead of spending a month tuning hyperparameters for a 0.25 percent improvement, they can try four different approaches, and one of those may yield a more dramatic boost in performance.
Agile development, iterative programming, minimum viable product - this is how software is made today, and how smart businesses work. It’s all about rapid prototyping, iterating and failing fast. But for most data scientists, they’re more worried about IT confiscating their laptop than they are about how they can do more with better tools.
With the right platform, researchers will have a better option than their laptop, they’ll have a way to spin up clusters, deploy multiple models, and capitalize on GPU acceleration. And IT will know that developers have their own sandbox that’s safe, secure, and governed. I can’t emphasize the need more - 55% of data scientists reported to Kaggle that they have no Enterprise Machine Learning tools in Kaggle's State of Machine Learning and Data Science 2021 report. Without more structure, we’re flying by the seat of our pants.
Why don’t more companies take a more structured approach? There are several good reasons, mostly focused on legacy and governance concerns.
- Heavily regulated industries like banking, insurance and healthcare worry that deploying something less than perfect will cause issues with regulators. But the time to experiment is long before deployment, and taking time for certification for an effective model at the end of the process makes more sense than certifying each update.
- The theory of sunk costs causes companies to stick with the older tools they have, and to keep pursuing the same way to get things done. But that doesn’t give data scientists the freedom to give up on an approach quickly, and use the latest languages.
- Ensure that the platform you embrace will let scientists deploy test projects on their own time (not their own laptop). They need to be able to spin clusters up, deprecate them, roll back to an earlier version, all in a governed environment that can address IT concerns.
- Don’t let perfection be the enemy of progress - make sure that your stakeholders, your group leaders and business executives understand how iterative development works. They need to understand that multiple waves of improving results is better than waiting a year for a perfect model. Because by next year, that model, or even that language, may not be relevant.
By utilizing a workbench that can support rapid experimentation, data science teams can deliver better results faster, because they are able to fail faster and find a better path.
Some of the potential benefits of an MLOps platform include:
- Improved access to the latest tools - Not only the ability to use the latest tools, but also the freedom to use the right tools for each project to deliver meaningful results faster.
- Improved recruiting - Data scientists love freedom and innovation, and don’t want to be held back in their research. If a company wants to attract the best talent, job applicants need to know they can use the latest tools with the freedom to make mistakes on the path to success.
- Improved governance - Data scientists do want a safety net to catch some of these failed experiments, and need the ability to go back through code changes and different package versions to derisk failing fast by making it easier to trace the path. But most teams can’t do this today, so they need a better way of governance.
- Improved results - Along with finding more unique solutions using the latest software, failing faster also makes it easier to create new projects that build on previous work because experiments and results are stored and searchable in the future.
It’s really exciting to be in data science today - we’re seeing the value of this discipline be recognized by companies around the world, and data science teams are more important to the bottom line than ever before. But if companies can create a data science infrastructure that supports the team’s efforts to fail faster and secure better results, they can develop a world-class data science organization. Their teams will deliver the most relevant results. Their data scientists will be able to use leading edge technologies. And companies can encourage innovation, while still maintaining governance.
I think these are all important benefits, and the next step in advancing data science, that can be achieved by failing faster.
* This article was originally written for and published by TDS.
Nikolay Manchev is a former Principal Data Scientist for EMEA at Domino Data Lab. In this role, Nikolay helped clients from a wide range of industries tackle challenging machine learning use-cases and successfully integrate predictive analytics in their domain-specific workflows. He holds an MSc in Software Technologies, an MSc in Data Science, and is currently undertaking postgraduate research at King's College London. His area of expertise is Machine Learning and Data Science, and his research interests are in neural networks and computational neurobiology.