5 steps to lower generative AI (GenAI) costs and maximize impact
Leila Nouri2024-03-28 | 13 min read
You've likely experienced the power of generative AI (GenAI) firsthand by using ChatGPT for basic writing and research tasks. ChatGPT has whet your Board's appetite for all things GenAI. Now, your executives want GenAI embedded in every aspect of business — a costly and scary proposition for those paying the bills. For example, according to a SemiAnalysis report, running ChatGPT costs approximately $700,000 daily, approximately 36 cents per question asked.
Despite its costs, GenAI also has the potential to transform internal processes, employee productivity, and operational efficiencies like no other technology. For example, GenAI is ideal for summarizing vast amounts of written text or automating code generation. GenAI can also work well for cases where hallucinations (producing made-up answers or false content) are acceptable and human reviews are involved.
Ultimately, for the right use cases, the costs of GenAI don't have to be a deal-breaker and throttle experimentation. Now is the perfect time to pause and take a measured approach to prioritization and FinOps for GenAI, and this is how.
Step 1: Start with a strategy and iterate
First, analyze your current state — where and who uses GenAI — from a FinOps and optimization lens. Knowing the costs of existing use cases can inform your strategy and determine priorities. An enterprise AI platform like Domino can help you find where GenAI is used today. Domino can also introduce automation and orchestrate the build, operations, and management of these models.
Domino can also help you identify the top cost-driving projects. You can view spending by user, project, and business unit and take action to control project costs. Speak to each team to understand why only GenAI will work. Does the GenAI model save time, boost productivity, or improve operations? Once you know the costs and requirements, create a simple scorecard or matrix. Chart Impact (high, low) on one axis and Costs (high, low) on another and map the current state of your GenAI project portfolio.
Your use cases should leverage GenAI's strengths. For example, GenAI tends to be strong in learning patterns from data and summarizing vast amounts of text, capturing the nuance of language. However, GenAI is weaker in making predictions or use cases where accuracy matters (e.g., driving executive decisions). In other instances, GenAI is strong, but the costs outweigh the benefits; for example, shopping recommendations based on past customer behaviors could best be delivered with predictive AI, for a fraction of the cost of GenAI.
After you've mapped what's happening today, consider where future investments should go. Adopt a phased approach by prioritizing your use cases. Table low-priority, low-cost projects, while prioritizing high-priority, low-cost projects.
Projects that should get high priority include:
- Use cases that provide long-term value, deliver a competitive advantage, or produce valuable, new intellectual property.
- Use cases touching or transforming customer experience.
- Use cases that significantly drive productivity and allow the reallocation of human capital to generate significant savings.
- Use cases that are most transformative and industry-specific, according to experts and research.
Projects that should get low priority include:
- Use cases that provide short-term value and take relatively little time to execute with predictive AI.
- Use cases where GenAI results and costs match human results and expenses.
- Use cases where predictive AI is already working and delivering cost-effective results.
- Use cases where GenAI answers may be costly to obtain, but not be accurate enough or actioned by the business.
- Use cases where the question is worth asking, but the answer will not be acted upon. For example, asking a question that costs $10K in compute, whose answer will not necessarily be heeded or result in changed behaviors or actions.
With that priority project list in hand, focus on where only GenAI can boost profitability, user experience, or productivity. With a map of a possible future investment landscape, it's time to consider different models for using GenAI and the costs associated with each.
Step 2: Analyze impact vs. costs across your GenAI toolbox
First, consider the entire spectrum of tools and technologies at your disposal. Assess each tool for its ability to maximize business impact for the lowest level of effort and cost. For example, is GenAI even necessary? Would standard, predictive AI solve the same problem? GenAI is unnecessary for many use cases and, and in most cases, is quite costly.
Next, consider how you use GenAI. Project teams will likely use one or more of the three common GenAI techniques. Each has its own strengths, but all can benefit from common oversight and strategy.
Prompt engineering uses a block of text (also known as a “prompt”) to request the large language model (LLM). A prompt is simply structured natural language text that describes a task (instructions) that a GenAI model can understand and execute. This technique can go pretty far and offer good value for your money. Better yet, you can start quickly, compared to fine-tuning a foundation model. You can host an LLM from prompt engineering, but most companies rely on hosted offerings like GPT from OpenAI.
Next, consider retrieval augmented generation (RAG). RAG allows you to mix corporate information with an LLM’s generic knowledge. Developers load corporate information into a vector database. Developers also create an application to handle end-user prompts. On every prompt, the application fetches relevant information from the database and sends it with the prompt to the LLM. RAG improves the accuracy and reliability of GenAI models with facts fetched from your corporate resources. The technique gives you the flexibility and accuracy of fully-adapted, fine-tuned LLMs for a fraction of the cost.
Another option to consider is fine-tuning. Fine-tuning adjusts a foundation model's knowledge to meet corporate needs, running like a “private” ChatGPT. This is ideal for high-value datasets or sensitive data. However, it takes time and often requires you to host your own model, which can be costly. Fine-tuning also requires large data volumes to work properly.
Fine-tuning an entire model can be very expensive. To overcome budgetary barriers, consider using model compression techniques to compress and optimize the model, reducing memory footprint and inference run time. You can also leverage advances in open-source and use efficient libraries to optimize the coding effort and data quality by cleaning and preprocessing data to improve training efficiencies.
Finally, you can invest in fine-tuning smaller LLMs. Consider using smaller models with fewer parameters (~20B). Unlike larger models (70B+) that require the most sought-after and pricy infrastructure, these models have more modest requirements. They also have a more limited impact on your cloud and compute costs in the long term.
Step 3: Analyze and optimize infrastructure for GenAI
Cloud optimization: Leverage cloud services with pay-as-you-go models and use Kubernetes and containerization for more efficient resource allocation. Domino's architecture is based on Kubernetes, allowing limitless scaleup capabilities, portability, full automation, and scheduling.
Flexible hybrid and multi-cloud infrastructure: Domino Nexus offers flexible infrastructure options for GenAI. Nexus allows you to run GenAI workloads across multiple cloud vendors and locations. You can use the best or lowest cost service and avoid cloud vendor lock-in. Nexus enables you to run your model and data at the same locale. You avoid the time and cost of transferring datasets and comply with data privacy and sovereignty laws. Nexus enables a hybrid infrastructure for GenAI workloads. That way, you can use on-premises GPUs and hardware investments when available instead of relying exclusively on costly cloud resources.
In addition, Domino Nexus prevents data movement out of a cloud region by bringing compute resources to where your data resides, thereby reducing the chance of unnecessary data transfer fees and preserving data locality for compliance.
Step 4: Improve resource management
Create hardware tiers: Domino allows you to restrict the use of expensive and powerful GPUs for low-priority GenAI work and release these GPUs for situations when priority work use cases or GenAI results must be fast-tracked. This way, you can reserve specialized hardware capacity (e.g., GPUs, TPUs, or FPGAs) for workloads and projects prioritized by the business.
Use on-demand workspaces with auto-shutdown: Domino offers on-demand workspaces that shut down when not in use, and auto-shutdown and auto-pausing, so idle workspaces do not result in higher costs.
Set storage quotas and prevent data movement fees in the cloud: Domino lets you set storage quotas to minimize the risk of duplicate or redundant datasets that can increase cloud storage costs.
Harness compute clusters: Compute clusters are multiple machines that perform computations in parallel. While they can be tricky to set up and expensive to manage, Domino automates the complexity and minimizes costs. With Domino, clusters can autoscale up based on a need to deliver high-priority work faster. Once the workload execution ends, Domino automatically shuts the cluster away. Cluster frameworks differ in their strengths, but Domino offers you the choice of Ray, Spark, Dask, and MPI. All clusters are available in Domino at the click of a button. The result: faster model development and higher productivity for your team.
Tap Domino apps and model APIs: Domino apps and model APIs can reduce the need to stand up serving infrastructure for your models. These apps and model APIs run in Domino with zero involvement from DevOps and sufficient power to support enterprise needs.
Offload computation with serverless functions: Tap serverless functions or cloud-based services for specific tasks to save time and money.
Step 5: Track and control all costs with FinOps software
Once your models — GenAI or predictive — run on Domino, you can tap into these platform capabilities:
Granular attribution of infrastructure spend: Domino aggregates compute and storage spend by user, project, org, clusters, and more, so you can see real-time cost drivers and take action.
Budgets & alerts: Domino lets you be proactive by creating budgets and spending limits for teams and projects so you can quickly catch cost overruns with alerts.
Support chargebacks: Many GenAI use cases require chargebacks, so recovering the budget without much manual work is a top priority for larger organizations. Domino allows you to tie infrastructure utilization directly to teams, groups, and users for simpler chargebacks without error-prone manual work.
Organize cloud bills: Managing multiple cloud provider bills and discounts can be challenging in large organizations embarking on GenAI projects. Domino gives you a single pane of glass to view and reconcile all cloud provider bills (including specialized hardware tiers used and special discounts).
Conclusion
Ultimately, the most cost-effective GenAI strategy weighs its business impact and costs and leverages a FinOps approach to cost management. With the right strategy, GenAI experiments can take off and scale without dire cost implications and with transformative business impact. Learn more about Domino FinOps and orchestrating and governing GenAI on Domino.
Leila Nouri, Director of Product Marketing at Domino Data Lab, is an innovative and data-driven product marketing leader with 15+ years of experience building high-performing teams, go-to-market campaigns, and new revenue streams for startups and Fortune 500 companies.