The enterprise AI event for data science & IT leaders
Join us at Rev, where innovators from leading organizations share how they're driving results across industries.
What if I told you that you could save 60% or more off of the cost of your LLM API spending without compromising on accuracy? Surprisingly, now you can.
Large Language Models (LLMs) are now part of our everyday lives. Companies use the technology to automate processes, improve customer experiences, build better products, save money, and more.
Hosting your own LLMs is very complex. They offer broad capabilities but are often expensive to run. They often require complex infrastructure and massive amounts of data. Cost and complexity are why you use prompt engineering. You may even use retrieval-augmented generation (RAG) to improve context and reduce hallucinations. With both techniques, you offload running LLMs to the likes of OpenAI, Cohere, or Google. Scaling LLM adoption to new use cases, especially with the latest powerful models, can drive up a new cost that was previously unaccounted for. Weaker models may be cheaper, but can you trust them with complex questions? But now, new research shows us how to save money and get as good, sometimes better, LLM results.
In the search for lower LLM costs, researchers turned to the concept of LLM Cascades. In the dark ages, before the launch of ChatGPT, a team from Google and The University of Toronto defined this term as programs that use probability calculations to get the best results using multiple LLMs.
More recently, the FrugalGPT paper defined cascades as sending a user query to a list of LLMs, one after the other, from weaker to stronger LLMs, until the answer is good enough. FrugalGPT Cascades uses a dedicated model to determine when the answer is good enough against a quality threshold.
A recent paper titled 'Large Language Model Cascades With Mixture of Thought Representations for Cost-Efficient Reasoning' from George Mason University, Microsoft, and Virginia Tech offers an alternative: a function that can determine whether the answer is good enough without fine-tuning another model.
Instead of using several LLMs, 'Mixture of thought' (MoT) reasoning uses just two - GPT 3.5 Turbo and GPT 4. The former model is regarded as the 'weaker' LLM, while the latter is the 'strong' LLM. The authors harnessed LLM 'answer consistency' to flag whether an LLM's response is good enough. LLMs produce consistent answers to similar prompts when they are confident the answers are correct. Therefore, when weaker LLM answers are consistent, there is no need to call the stronger LLM. Conversely, these LLMs produce inconsistent answers when they lack confidence. That's when you need a stronger LLM to answer the prompt. (Note: you can use a weaker/stronger LLM pair of your choice as well)
The prompts themselves use few-shot in-context prompting to improve LLM answer quality. Such prompts guide the LLM's response by giving examples of similar questions and answers.
To improve model reasoning and simplify consistency measurement, the researchers introduce a new prompting technique for reasoning tasks by 'mixing' two prompting techniques:
To determine answer consistency, the paper introduces two methods:
Since voting requires multiple prompts, it may be more suitable when a budget exists to guide the threshold number.
Let's look at how much money the MoT technique saves and its impact on answer accuracy.
The researchers used the following sum to calculate prompt cost:
The results were dramatic:
Domino’s implementation of the paper is available in a new repository. We made some adaptations from the implementation mentioned in the paper.
The Domino project template enables you to get started with Mixture of Thought techniques right away. Domino spares you from the need to cobble together Python packages and find the proper hardware or GPU setup. And with Domino FinOps, you can track the ongoing costs of your AI work. You can even chargeback costs to individual business units and departments.
Domino’s enhanced Generative AI and Responsible AI capabilities expand its appeal and empower more enterprises to adopt transformative AI solutions. If you already have access to Domino, download the project to get started. Otherwise, sign up for our weekly demo and see how Domino can help you deliver AI value on time and budget, responsibly.

Subir Mansukhani is Staff Data Scientist at Domino Data Lab. Previously he was the Co-Founder and Chief Data Scientist at Intuition.AI. He holds a Master’s Degree in Complex Adaptive Systems from Chalmers University. Yuval Zukerman is Domino's head of content. Throughout his technology career, he helped translate technology into human. He holds an ALM in Information Technology from Harvard University.
Join us at Rev, where innovators from leading organizations share how they're driving results across industries.
Join us at Rev, where innovators from leading organizations share how they're driving results across industries.