The enterprise platform to build, deliver, and govern AI
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
I was recently chatting with a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.
When I started out programming professionally in the mid-90’s, the standard for keeping track and collaborating on source code was Microsoft’s Visual SourceSafe. To give you a flavor of the experience, it didn’t have atomic check-ins, so multiple people couldn’t work on the same file, the network copy required nightly scans to avoid mysterious corruption, and even that was no guarantee the database would be intact in the morning. I felt lucky though, one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!
This is all to say, I’m no shrinking violet when it comes to version control. I’ve toughed my way through some terrible systems, and I can still monkey together a solution using rsync and chicken wire if I have to. Even with all that behind me, I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.
To explain why, here’s a typical life cycle of a machine learning model:
This is an optimistic scenario with a conscientious researcher, but you can already see how hard it would be for somebody else to come in and reproduce all of these steps and come out with the same result. Every one of these bullet points is an opportunity to for inconsistencies to creep in. To make things even more confusing, ML frameworks trade off exact numeric determinism for performance. So, if by a miracle, somebody did manage to copy the steps exactly, there would still be tiny differences in the end results!
In many real-world cases, the researcher won’t have made notes or remember exactly what she did, so even she won’t be able to reproduce the model. Even if she can, the frameworks that the model code depends on can change over time, sometimes radically, so she’d need to also snapshot the whole system she was using to ensure that things work. I’ve found ML researchers to be incredibly generous with their time when I’ve contacted them for help reproducing model results, but it’s often months-long task even with assistance from the original author.
Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.
It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.
I also know of other teams who are serious about using models in production who put similar amounts of time and effort into ensuring their training can be reproduced, but the problem is that it’s still a very manual process. There’s no equivalent to source control or even agreed best-practices about how to archive a training process so that it can be successfully re-run in the future. I don’t have a solution in mind either, but to start the discussion here are some principles I think any approach would need to follow to be successful:
I’ve been seeing some interesting stirrings in the open source and startup world around solutions to these challenges, and personally I can’t wait to spend less of my time dealing with all the related issues, but I’m not expecting to see a complete fix in the short term. Whatever we come up with will require a change in the way we all work with models, in the same way that source control meant a big change in all of our personal coding processes. It will be as much about getting consensus on the best practices and educating the community as it will be about the tools we come up with. I can’t wait to see what emerges!
Editorial Note: This post is provided courtesy of Pete and originally appeared on Pete’s blog.

Pete Warden is the technical lead of the TensorFlow Micro team at Google, and was previously founder of Jetpac, a deep learning technology startup acquired by Google in 2014. He's one of the original creators of the TensorFlow framework, wrote the leading textbook for embedded machine learning, and created a standard data set for short speech recognition.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.