Domino to deploy custom LLMs across any infrastructure using NVIDIA NIM

Eda Johnson 2024-08-23 | 5 min read

Return to blog home

Co-authored by Josh Mineroff, Director of SA for Tech Alliances, Domino and David Schulman, Director of Partner Marketing, Domino

The day has come – it’s time for you to start moving those generative AI (GenAI) proof-of-concepts into production. However, the deployment of large language models (LLMs) poses considerable challenges for modern enterprises, including high resource demands, complex infrastructure requirements, and stringent security needs.

This is exacerbated in complex hybrid and multicloud environments — and increasing demand in regulated industries subject to data and AI compliance that often require you to deliver enterprise-scale production self-hosted or scaled across multiple deployment instances.

Why large language model production flexibility is critical

While commercial LLM APIs offer convenience, they lack the customization, security, and cost-effectiveness that comes with hosting your own LLM. Be it from regulatory compliance or a price-performance perspective, enterprises need flexibility in LLM production with portability and control across environments. Additionally, they demand governance with full reproducibility and traceability.

At Domino, we’ve seen many companies challenged with LLM production deployments. That’s why we are announcing an integration with NVIDIA NIM microservices, part of the NVIDIA AI Enterprise software platform. In this blog, we’ll provide a high-level introduction to NIM and show you how you can use it with Domino to fine-tune an LLM NIM before deploying it across NVIDIA accelerated infrastructure services.

What is NVIDIA NIM?

NVIDIA NIM is a cloud-native microservices suite designed to streamline the deployment of generative AI models across various environments, including clouds, data centers, and workstations. With NIMs, IT and DevOps teams can more easily self-host pre-built open-source and commercial large language models while providing application developers with industry-standard APIs. NIM delivers prebuilt containers powered by Triton Inference Server, TensorRT, and TensorRT-LLM to accelerate the time-to-market of performance-optimized production AI applications.

Why use Domino with NVIDIA NIM?

Domino provides a unified platform for managing AI projects, ensuring consistent governance and compliance across all environments. Domino’s Kubernetes-native architecture and interoperability with the NVIDIA NGC catalog allow for seamless deployment of state-of-the-art models, as well as LoRA adapters with shared infrastructure for optimized price and performance – with full reproducibility and traceability. Furthermore, Domino AI Gateway offers additional governance by securing connections with external LLMs in NVIDIA NIM through secure API key storage, LLM endpoint management, controlled user access, and detailed activity logs.

Getting started

In our upcoming webinar, we will demonstrate the integration of NVIDIA NeMo and NIM with the Domino Enterprise AI Platform, enabling efficient and secure LLM fine-tuning and deployment of an LLM at enterprise scale. We’ll fine-tune Meta’s Llama 3.1 on the PubMedQA dataset for biomedical research question answering.

Step-by-step guide:

  1. Getting started: Clone NVIDIA/nim-deploy repo to your local machine and follow the instructions in the README.md. You will need an API Key. You can get a free API key by signing up in the NVIDIA API catalog to access NIM via the NVIDIA Developer Program or via a 90-day NVIDIA AI Enterprise license.
  2. Installing NIM in a Domino cluster: Via a helm command, install NIM in the same Kubernetes cluster as Domino to leverage shared resources. Automatically spin up a container with NIM and a model inside of it and connect Domino to it.
  3. Connect Domino to NIM: Via Domino’s AI gateway, provide federated access to an LLM API.
  4. Fine-tune a NIM model using NeMo: Use the NVIDIA NeMo framework for parameter efficient fine-tuning (PEFT) in Domino to fine-tune Llama 3.1 on the PubMedQA dataset to accurately answer clinical questions.
  5. Running inference: Use provided commands to create and manage endpoints, enabling interaction with the LLM directly through Domino.

Join us for the webinar

This webinar is an excellent opportunity for data science leaders and practitioners to learn about the innovative solutions provided by Domino, using NVIDIA NIM. By attending, you'll gain valuable insights into optimizing your AI development lifecycle, reducing costs, and ensuring robust governance.

Register now to secure your spot and take the first step towards revolutionizing your AI deployment strategies with NVIDIA NIM and Domino.


Results-driven AWS (SAA, AWS Machine Learning Specialty, Cloud Practitioner), Microsoft Azure, GCP (Professional Data Engineer), Snowflake (SnowPro Advanced Architect, Data Engineer, Data Scientist), Databricks Spark 3.0 Developer, HashiCorp Terraform and PMP Certified Principal Solutions Architect with 20+ years of software development experience architecting data platforms and implementing advanced analytics solutions in different industry verticals.