In the realm of large language models (LLMs), integrating these advanced systems into real-world enterprise applications is a pressing need. However, the pace at which generative AI is evolving is so quick that most can’t keep up with the advancements.
One solution is to use managed services like the ones provided by OpenAI. These managed services offer a streamlined solution, yet for those who either lack access to such services or prioritize factors like security and privacy, an alternative avenue emerges: open-source tools.
Open-source generative AI tools are extremely popular right now and companies are scrambling to get their AI-powered apps out the door. While trying to build quickly, companies oftentimes forget that in order to truly gain value from generative AI they need to build “production”-ready apps, not just prototypes.
In this article, I want to show you the performance difference for Llama 2 using two different inference methods. The first method of inference will be a containerized Llama 2 model served via Fast API, a popular choice among developers for serving models as REST API endpoints. The second method will be the same containerized model served via Text Generation Inference, an open-source library developed by hugging face to easily deploy LLMs.
Both methods we’re looking at are meant to work well for real-world use, like in businesses or apps. But it’s important to realize that they don’t scale the same way. We’ll dive into this comparison to see how they each perform and understand the differences better.
What powers LLM inference at OpenAI and Cohere
Have you ever wondered why ChatGPT is so fast?
Large language models require a ton of computing power and due to their sheer size, they oftentimes need multiple GPUs. When working with large GPU clusters, companies have to be very mindful of how their computing is being utilized.
LLM providers like OpenAI run large GPU clusters to power inferencing for their models. In order to squeeze as much…