Efficient Inference: From Always-On to On-Demand
Stop wasting GPUs on idle models, it's time to make inference smarter.
Sign Up for our Virtual Customer Tech Hour
July 30th, 2025 1:00 PM ET | 10:00 AM PT
Join Us for Customer Tech Hour!
Join us on July 30th at 1pm ET for a deep dive into how to deploy LLM’s in Domino.
We’ll kick things off with a baseline deployment using Domino Model Endpoints, then level it up by integrating the NVIDIA Triton Inference Server for scalable, cost-efficient inference. Rather than keeping endpoints always-on—and GPUs sitting idle—you’ll see how to convert them into lightweight, demand-triggered proxies that dynamically load models on Triton servers. We’ll also showcase how to maximize GPU utilization by serving multiple models concurrently through Triton’s advanced multi-model orchestration.
You’ll learn how to:
- Serve more models per GPU with on-demand loading
- Cut endpoint size and startup time for large models
- Separate dev/prod workflows using secure dataset mounts
- Streamline LLMOps using MLflow nested runs and shared datasets
- Deploy LLMs efficiently without compromising model governance
By the end, you’ll know how to scale inference across all types of models- CV, NLP, and LLM workloads-without overprovisioning or wasted GPU hours.
Eager to dive in early? Explore our in-depth Domino Blueprint article to get hands-on with the concepts before the live session.