In the world of large language models (LLMs), the cost of computation can be a significant barrier, especially for extensive projects. I recently embarked on a project that required running 4,000,000 prompts with an average input length of 1000 tokens and an average output length of 200 tokens. That’s nearly 5 billion tokens! The traditional approach of paying per token, as is common with models like GPT-3.5 and GPT-4, would have resulted in a hefty bill. However, I discovered that by leveraging open source LLMs, I could shift the pricing model to pay per hour of compute time, leading to substantial savings. This article will detail the approaches I took and compare and contrast each of them. Please note that while I share my experience with pricing, these are subject to change and may vary depending on your region and specific circumstances. The key takeaway here is the potential cost savings when leveraging open source LLMs and renting a GPU per hour, rather than the specific prices quoted. If you plan on utilizing my recommended solutions for your project, I’ve left a couple of affiliate links at the end of this article.
I conducted an initial test using GPT-3.5 and GPT-4 on a small subset of my prompt input data. Both models demonstrated commendable performance, but GPT-4 consistently outperformed GPT-3.5 in a majority of the cases. To give you a sense of the cost, running all 4 million prompts using the Open AI API would look something like this:
While GPT-4 did offer some performance benefits, the cost was disproportionately high compared to the incremental performance it added to my outputs. Conversely, GPT-3.5 Turbo, although more affordable, fell short in terms of performance, making noticeable errors on 2–3% of my prompt inputs. Given these factors, I wasn’t prepared to invest $7,600 on a project that was…