Customer Tech Hour

Efficient Inference: From Always-On to On-Demand

Stop wasting GPUs on idle models, it's time to make inference smarter.

Sign Up for our Virtual Customer Tech Hour

July 30th, 2025 1:00 PM ET | 10:00 AM PT

Join Us for Customer Tech Hour!

Join us on July 30th at 1pm ET for a deep dive into how to deploy LLM’s in Domino.

We’ll kick things off with a baseline deployment using Domino Model Endpoints, then level it up by integrating the NVIDIA Triton Inference Server for scalable, cost-efficient inference. Rather than keeping endpoints always-on—and GPUs sitting idle—you’ll see how to convert them into lightweight, demand-triggered proxies that dynamically load models on Triton servers. We’ll also showcase how to maximize GPU utilization by serving multiple models concurrently through Triton’s advanced multi-model orchestration.

You’ll learn how to:

Serve more models per GPU with on-demand loading
Cut endpoint size and startup time for large models
Separate dev/prod workflows using secure dataset mounts
Streamline LLMOps using MLflow nested runs and shared datasets
Deploy LLMs efficiently without compromising model governance

By the end, you’ll know how to scale inference across all types of models- CV, NLP, and LLM workloads-without overprovisioning or wasted GPU hours.

Eager to dive in early? Explore our in-depth Domino Blueprint article to get hands-on with the concepts before the live session.