Customer Tech Hour

Efficient Inference: From Always-On to On-Demand

Stop wasting GPUs on idle models, it's time to make inference smarter.

Sign Up for our Virtual Customer Tech Hour

July 30th, 2025 1:00 PM ET | 10:00 AM PT

Join Us for Customer Tech Hour!

Join us on July 30th at 1pm ET for a deep dive into how to deploy LLM’s in Domino.

We’ll kick things off with a baseline deployment using Domino Model Endpoints, then level it up by integrating the NVIDIA Triton Inference Server for scalable, cost-efficient inference. Rather than keeping endpoints always-on—and GPUs sitting idle—you’ll see how to convert them into lightweight, demand-triggered proxies that dynamically load models on Triton servers. We’ll also showcase how to maximize GPU utilization by serving multiple models concurrently through Triton’s advanced multi-model orchestration.

You’ll learn how to:

  • Serve more models per GPU with on-demand loading
  • Cut endpoint size and startup time for large models
  • Separate dev/prod workflows using secure dataset mounts
  • Streamline LLMOps using MLflow nested runs and shared datasets
  • Deploy LLMs efficiently without compromising model governance

By the end, you’ll know how to scale inference across all types of models- CV, NLP, and LLM workloads-without overprovisioning or wasted GPU hours.

Eager to dive in early? Explore our in-depth Domino Blueprint article to get hands-on with the concepts before the live session.