Domino Unlocks the Power of Data Science with Ray 2 Clusters
Thomas Dinsmore and Yuval Zukerman2023-03-29 | 5 min read
OpenAI demonstrated the profound impact generative AI could have. Such techniques turn datasets into transformative tools and products. Tangible AI projects that are both inspiring and can save your company time and money. Better yet, you are in a good position to aim high.
Your team uses advanced machine learning to solve the most challenging problems in your business. You want to use complex ensembles, deep learning, reinforcement learning, transformers, and other cutting-edge techniques. You build massively complex models. And, you want to work with massive datasets.
An Opportunity
Distributed processing is the best way to scale machine learning. Apache Spark is an easy default option. Spark is a popular distributed framework; it works well for data processing and "embarrassingly parallel" tasks. For machine learning, however, Ray is a better option.
Ray is an open-source, unified computing framework that simplifies scaling AI and Python workloads. Ray is great for machine learning as it can leverage GPUs and handle distributed data. It includes a set of scalable libraries for distributed training, hyperparameter optimization, and reinforcement learning. In addition, its fine-grained controls let you adjust processing to the workload using distributed actors and in-memory data processing.
A Challenge
However, getting from point A to a Ray cluster may not be so simple:
- Setup: Like with any cluster, the configuration will likely be complex. The more nodes you look to use, the more demanding the skillset required to set up.
- Hardware: Ray may require access to robust infrastructure and GPUs to work efficiently and leverage its benefits.
- Data access: Data must flow seamlessly and rapidly between your storage tools and Ray cluster. Often a line between boxes on the whiteboard and data connectivity can become a significant undertaking.
- Security and governance: Ray clusters must meet access control requirements. The clusters must also comply with internal data encryption and auditing guidelines.
- Scalability: Careful planning is necessary to ensure clusters scale smoothly with larger workloads. Conversely, cost controls must be in place to prevent unnecessary infrastructure expenditures.
To use Ray, many companies look to provision and manage dedicated clusters just for Ray jobs. Your team doesn't have a lot of extra cycles for DevOps, nor does IT right now. But you will end up paying for that cluster while it sits idle between Ray jobs.
Alternatively, you can subscribe to a Ray service provider. That eliminates the DevOps problem. But you'll have to copy your data to the provider's datastore or go through the treacherous process of connecting it to your data. It also means multiple logins and collaboration platforms to manage. You want to use Ray for some of your projects, not all of them.
A Solution
Domino has a solution. Domino provisions and orchestrates a Ray cluster directly on the infrastructure backing the Domino instance. With Domino, your users spin up Ray clusters when needed. Domino automates the DevOps away; your team can focus on delivering quality work.
Domino supports the data science that matters with Ray 2 clusters on demand. In addition, Ray 2 includes numerous performance and stability improvements over Ray 1.x. These improvements matter for several important reasons:
- Ray Datasets can now shuffle more than 100TB.
- Faster task submission and improved memory management mean you get results quicker.
- Improved APIs introduce distributed stateful objects. This feature helps improve your applications' scalability, efficiency, availability, and manageability.
- Revised fault detection and recovery mechanisms translate to more jobs completed and higher overall reliability.
- And most important to your team - better support for PyTorch and TensorFlow.
Domino offers Ray 2 right now. That means your Ray clusters can be on-prem or on any of the major cloud providers without waiting for IT, DevOps, or the cloud provider to catch up with industry innovation. As always with Domino, your data is connected centrally, and access controls and audit trails are built-in. Best of all, you get a head start on the competition.
A leading auto insurer is a Domino customer. Data scientists at this company routinely work with datasets measuring tens of terabytes. They tried different distributed frameworks and chose Ray because it is easier to use, Python-centric, and has built-in optimizations. Today, they use Ray in Domino for distributed training, hyperparameter optimization, and fast batch inference.
What now?
Discover all of the powerful capabilities available in Domino 5.5, including the hybrid and multi-cloud capabilities of Domino Nexus. Follow us on LinkedIn for updates about new Domino 5.5 blogs and details about our upcoming event Rev 4, which you won't want to miss.
Thomas Dinsmore is Domino's Head of Competitor Intelligence. He is passionate about driving product vision and enabling the field. Yuval Zukerman brings technology to life. His career has put him in a variety of roles across the technology lifecycle. From developer and engineer to project manager and technology consultant to technical creative and sales teams.