The enterprise platform to build, deliver, and govern AI
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
Generative AI and Large Language Models (LLMs) like OpenAI's GPT series or Meta's Llama2 are revolutionizing our economy. AirBnB and JetBlue already use generative AI (GenAI) to improve customer experience through chatbots. Biopharma company Eli Lilly uses GenAI to discover novel drug compounds. And JP Morgan creates synthetic data with the technology, securing customer privacy for R&D efforts. However, deploying and making inferences using these models presents a unique set of challenges.
LLMs and their accompanying data are typically large and require considerable amounts of memory. This memory requirement limits the types of hardware capable of running LLMs to the highest-end GPUs. This hardware is in high demand, especially in cloud environments. High demand results in higher prices. Costs can skyrocket if an LLM needs to be online and inferencing constantly. Furthermore, LLM inference can be energy-intensive, particularly on CPUs or GPUs. Optimizations such as knowledge distillation and quantization can reduce model size but may affect model precision.
Enterprises looking to develop AI quickly, responsibly, and cost-effectively have turned to Domino for years. Businesses are learning rapidly GenAI is different. It requires entirely new skill sets, data at a different scale coming from multiple business units, and robust, high-performance infrastructure. It also requires new risk assessment and mitigation strategies. And operationalizing GenAI models becomes a long-term investment and commitment. As before, Domino now leads the charge, enabling enterprises to harness generative AI without compromising risk, responsibility, and security.
Practitioners use Domino to fine-tune LLMs using the latest optimization techniques. Fine-tuning is an iterative process. Domino simplifies that by managing experimentation with its reproducibility engine. Reproducibility ensures that your model is valid and those great results were not a fluke. It improves trust in the model, mitigating concerns over LLMs’ impact. And if a model audit is necessary - you’re always ready. With Domino’s help, enterprises can release LLMs responsibly.
Moving troves of data between cloud locations or on-prem is expensive, slow, and risky. Domino customers can fine-tune and deploy LLMs co-located with data wherever it is using Domino Nexus. Workload co-location also helps enterprises comply with sovereignty and privacy regulations.
Companies have trouble hosting and scaling traditional AI models, let alone LLMs. Domino can host your models - including LLMs - for inference. You can share your model as a user-friendly web application or set up a model API to help it connect with other systems. And by automating infrastructure provisioning, Domino empowers practitioners to create and deploy models quickly; no DevOps knowledge is needed.
Given the high-end infrastructure LLMs need when put into production, you must keep an eye on operational costs. Domino includes extensive cost monitoring and budgeting tools. It can help you track spending on projects, models, APIs, and more. You can even set spending alerts and limits to ensure budgets are not exceeded.
Enterprises will want to keep an eye on leveraging their LLM investment for future projects. With one successful generative AI model in production, most companies will develop an appetite. To make that easier, Domino offers broad knowledge-sharing and collaboration tools. Other teams across multiple business units can recreate all code, metadata, IDEs, packages, and development environments. This helps accelerate future development and limits the need for expensive reinvention.
Let's look specifically at how Domino makes LLM inference happen.
In a previous post, we shared reference projects highlighting efficient LLM fine-tuning techniques utilizing Domino. Now that the models are 'fine-tuned,' we look into a reference project covering model inference.
This project demonstrates the generation of text output from a fine-tuned Falcon-7b LLM using multiple inference frameworks. It showcases not just the execution but also provides guidance on Model API and web app deployment in Domino.
From the get-go, we sought to showcase realistic, cost-effective execution. Depending on your choice of hardware and LLM, execution times will vary. For benchmarking, we executed the reference project on an NVIDIA V100 GPU with 24GB of VRAM. Robust, but no longer the peak of innovation.
An increasingly important element in hosting LLMs is the choice of an inference framework. An inference framework is a set of tools and libraries developed to efficiently deploy, run, and manage LLMs in production environments. The environments aim to reduce memory footprint, accelerate execution and simplify scaling the models to serve a high request load. The frameworks also offer other significant benefits: They minimize model integration complexity with applications or systems and assist in managing different model versions.

The project's goal is to set up and run a chatbot. The LLM will run as a Domino model API. We will also set up a Streamlit app on Domino as the model's user interface.
We tested three inference frameworks on Domino:

To help practitioners looking to tailor the reference project to their requirements, here's a rundown of its critical files:

Setting up LLM models for inference is a bit of a dark art.
This reference project shows how practitioners can now use Domino to host, scale, and share LLMs. And as before, they do this without slowing down or taxing IT resources. Domino again sheds light on the forefront of AI.
As we have demonstrated, Domino is the best solution for enterprises looking to harness the power of generative AI overall, particularly LLMs.
Interested to learn more about how Domino can help you on your generative AI journey or want to see Domino in action? please reach out to us here.

Subir Mansukhani is Staff Software Engineer - 2 at Domino Data Lab. Previously he was the Co-Founder and Chief Data Scientist at Intuition.AI.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.