Full LLM power and 60% cheaper: Model cascade with mixture of thought on Domino
Subir Manuskhani and Yuval Zukerman 2024-02-29 | 7 min read
What if I told you that you could save 60% or more off of the cost of your LLM API spending without compromising on accuracy? Surprisingly, now you can.
Large Language Models (LLMs) are now part of our everyday lives. Companies use the technology to automate processes, improve customer experiences, build better products, save money, and more.
Hosting your own LLMs is very complex. They offer broad capabilities but are often expensive to run. They often require complex infrastructure and massive amounts of data. Cost and complexity are why you use prompt engineering. You may even use retrieval-augmented generation (RAG) to improve context and reduce hallucinations. With both techniques, you offload running LLMs to the likes of OpenAI, Cohere, or Google. Scaling LLM adoption to new use cases, especially with the latest powerful models, can drive up a new cost that was previously unaccounted for. Weaker models may be cheaper, but can you trust them with complex questions? But now, new research shows us how to save money and get as good, sometimes better, LLM results.
Get to know LLM Cascades
In the search for lower LLM costs, researchers turned to the concept of LLM Cascades. In the dark ages, before the launch of ChatGPT, a team from Google and The University of Toronto defined this term as programs that use probability calculations to get the best results using multiple LLMs.
More recently, the FrugalGPT paper defined cascades as sending a user query to a list of LLMs, one after the other, from weaker to stronger LLMs, until the answer is good enough. FrugalGPT Cascades uses a dedicated model to determine when the answer is good enough against a quality threshold.
A recent paper titled 'Large Language Model Cascades With Mixture of Thought Representations for Cost-Efficient Reasoning' from George Mason University, Microsoft, and Virginia Tech offers an alternative: a function that can determine whether the answer is good enough without fine-tuning another model.
Mixture of Thought LLM Cascades
Instead of using several LLMs, 'Mixture of thought' (MoT) reasoning uses just two - GPT 3.5 Turbo and GPT 4. The former model is regarded as the 'weaker' LLM, while the latter is the 'strong' LLM. The authors harnessed LLM 'answer consistency' to flag whether an LLM's response is good enough. LLMs produce consistent answers to similar prompts when they are confident the answers are correct. Therefore, when weaker LLM answers are consistent, there is no need to call the stronger LLM. Conversely, these LLMs produce inconsistent answers when they lack confidence. That's when you need a stronger LLM to answer the prompt. (Note: you can use a weaker/stronger LLM pair of your choice as well)
The prompts themselves use few-shot in-context prompting to improve LLM answer quality. Such prompts guide the LLM's response by giving examples of similar questions and answers.
To improve model reasoning and simplify consistency measurement, the researchers introduce a new prompting technique for reasoning tasks by 'mixing' two prompting techniques:
- Chain of Thought (CoT) Prompting encourages LLMs to generate intermediate steps or reasonings before arriving at a final answer. Generating these steps helps the model improve complicated task results. It also increases answer accuracy.
- Program of Thought (PoT) extends Chain of Thought prompting and uses the model's output as a new input for further prompts. Prompts using this technique often request the model to answer with code instead of human language.
To determine answer consistency, the paper introduces two methods:
- Voting: This method samples multiple answers from LLM queries with similar prompts or by varying the response temperature option. It then measures how similar the LLM’s answers are to each other. The answer that agrees the most with all the other answers is assumed to be correct. The team also defined a flexible 'threshold' value that aligns answer consistency and budget constraints.
- Verification: This approach compares the LLM's most consistent answers across two distinct thought representations (e.g., CoT and PoT). The algorithm accepts the weaker LLM’s answer if the two prompt responses are identical.
Since voting requires multiple prompts, it may be more suitable when a budget exists to guide the threshold number.
The bottom line: Mixture of thought saves you money
Let's look at how much money the MoT technique saves and its impact on answer accuracy.
The researchers used the following sum to calculate prompt cost:
- The cost of prompting the weaker model (because we may prompt it several times)
- The cost of the answer evaluation process
- If the evaluation process rejects the answer, we add the cost of prompting the strong model
The results were dramatic:
- Using MoT variants - combining voting and verification with CoT and PoT - can lead to comparable performance at 40% of the cost of solely using GPT-4.
- In testing against the CREPE Q&A dataset, MoT outperformed GPT-4 at 47% of its cost.
- Mixing PoT and CoT improves decision-making compared to using one of the techniques alone.
- Increasing the threshold when using the voting method did not significantly impact quality despite the additional cost.
- The consistency model proved itself in reliably identifying correct LLM answers. It successfully predicted when to resort to using the strong model to obtain the optimal results.
Eager to try LLM Cascades with Mixture of Thought?
Domino’s implementation of the paper is available in a new repository. We made some adaptations from the implementation mentioned in the paper.
- The repository code uses Langchain to implement the thought representations.
- The implementation includes Tree of Thought (ToT) as a thought representation for complex reasoning tasks. You can use this thought representation on its own or with MoT.
- The code uses cosine similarity on the embeddings computed from responses that the LLM produces to compute the answer consistency.
The Domino project template enables you to get started with Mixture of Thought techniques right away. Domino spares you from the need to cobble together Python packages and find the proper hardware or GPU setup. And with Domino FinOps, you can track the ongoing costs of your AI work. You can even chargeback costs to individual business units and departments.
Domino’s enhanced Generative AI and Responsible AI capabilities expand its appeal and empower more enterprises to adopt transformative AI solutions. If you already have access to Domino, download the project to get started. Otherwise, sign up for our weekly demo and see how Domino can help you deliver AI value on time and budget, responsibly.
Subir Mansukhani is Staff Data Scientist at Domino Data Lab. Previously he was the Co-Founder and Chief Data Scientist at Intuition.AI. He holds a Master’s Degree in Complex Adaptive Systems from Chalmers University. Yuval Zukerman is Domino's head of content. Throughout his technology career, he helped translate technology into human. He holds an ALM in Information Technology from Harvard University.
RELATED TAGS