Fine-Tuning for Mortals: Ray and Deepspeed Zero on Domino
Yuval Zukerman2023-12-18 | 5 min read
In the world of LLMs, big tech loves big models. They can afford to build a model that can recite the Finnish constitution in Hindi. Or one that can write a Latin haiku about 1980s rap. Something like this from ChatGPT:
Octoginta mira rap,
But do you need a model like that?
You need a smaller LLM on your terms.
You need a model you can adapt to your use cases. So, instead of 60B, 70B, or 150B or more, you need a smaller, flexible LLM that you can feasibly adapt to meet your requirements. The LLM should also run on your infrastructure. That gives you control, helps contain costs, limits risk, and protects your intellectual property.
Yet fine-tuning even a smallish LLM (7B to 20B parameters) to address your use cases and run in your environment is far from straightforward. LLMs need lots of memory and the processing power of top-tier CPUs and GPUs. That assumes you can lock in access to pricey CPU and GPU capacity with a cloud provider. Even small LLMs are big enough to require running them on a compute cluster. Another challenge comes in the form of LLM latency. Your LLM must respond quickly to user requests. Finally, you must deal with ongoing infrastructure costs and system reliability.
Solving LLM Fine-Tuning Challenges with Open Source: Domino Reference Project.
One of our previous blog posts mentioned a Domino reference project demonstrating fine-tuning on the GPTJ-6b LLM from EleutherAI. The project demonstrates how to fine-tune an LLM without compromising output precision.
In an attempt to simplify LLM fine-tuning and inference, companies often resort to quantizing. Quantizing replaces the data type used to hold model parameters from ones that require lots of memory (e.g., FP32) with ones that take less memory (e.g., FP8). Quantizing allows the model to use less memory and enables fine-tuning on older, cheaper hardware. This improvement comes, however, at the cost of precision. With fewer decimal positions, the LLM's parameters will have less room to hold granular data.
Instead, the project uses two alternative tools: Ray and Deepspeed ZeRO Stage 3. Ray is a leading distributed computing framework, enabling workloads to run concurrently on multiple machines. Ray offers fault tolerance, allowing fine-tuning processes to overcome cluster node failures. DeepSpeed ZeRO partitions LLM state data (like parameters, gradients, and optimizer states) across multiple GPUs. Partitioning reduces the memory required per GPU and enables fine-tuning to run on commonly available lower-cost hardware. Leveraging Ray and Deepspeed ZeRO is relatively simple, as both integrate with PyTorch and TensorFlow.
The fine-tuning effort aims to adapt GPTJ-6b to generate text in Isaac Newton's style. For our dataset, we use Newton's 'The Mathematical Principles of Natural Philosophy. The hardware setup will consist of an 8-worker Ray cluster on Domino. Each worker will run on an AWS g4dn.12xlarge instance. Due to the model's size, the demo uses Domino Datasets to store the fine-tuned model binary and other checkpoints.
Solving LLM Fine-Tuning Challenges: Reinventing the Wheel vs. the LLMOps/MLOps Platform
Let's turn our attention to LLM infrastructure. As mentioned, LLM training and inference require various technologies working in concert to function correctly. One way to solve those challenges is to piece together individual solutions from cloud platform components. Cobbling components together for LLMs quickly becomes an integration project. That takes time and money. Worse, you will likely need to repeat that effort for every new LLM project and its unique demands. You will need to do all that and still maintain your existing AI projects. Quite painful, indeed.
Alternatively, you can get an open, flexible MLOps and LLMOps platform like Domino.
Domino provides you with the critical foundations to overcome many LLM hurdles. Its open architecture makes incorporating the latest open-source code packages and model innovations a matter of a single download. Domino helps you set up infrastructure that runs on demand instead of racking up costs running expensive clusters nonstop. Domino automates compute cluster setup without requiring an IT superhero degree. More on clusters below. Mature organizations have to live with budgets, even for AI projects. Domino will help you monitor your cloud consumption and track spending. While there is much more to Domino, it is easy to understand how the platform fits the GenAI bill. Now, let's look at solving the rest of the challenges.
Ready to get started? Watch the webinar, download the code, or both!
While the project is available as a repo on GitHub, we recommend starting with a webinar we created just for you. Join Domino's lead data scientist, Subir Mansukhani, for a deep technical overview and code discussion.
What are you waiting for?
As Domino's content lead, Yuval makes AI technology concepts more human-friendly. Throughout his career, Yuval worked with companies of all sizes and across industries. His unique perspective comes from holding roles ranging from software engineer and project manager to technology consultant, sales leader, and partner manager.
Subscribe to the Domino Newsletter
Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.