Three challenges in deploying generative models in production | by Aliaksei Mikhailiuk | Aug, 2023


How to deploy Large Language and Diffusion models for your product without scaring the users away.

Image generated by the Author in SDXL 1.0.

OpenAI, Google, Microsoft, Midjourney, StabilityAI, CharacterAI and many more — everyone is racing to bring the best solution for text-to-text, text-to-image, image-to-image and image-to-text models.

The reason is simple — the vast field of opportunities the space offers; after all, it’s not only entertainment but also utility that was impossible to unlock. From better search engines to more impressive and personalized ad campaigns and friendly chatbots, like Snap’s MyAI.

And while the space is very fluid, with lots of moving parts and model checkpoints released every few days, there are challenges that every company working with Generative AI is looking to address.

Here, I will talk about the major challenges and how to address them in deploying generative models in production. While there are many different kinds of generative models, in this article, I will focus on the recent advancements in diffusion and GPT-based models. However, many topics discussed here would apply to other models as well.

Generative AI broadly describes a set of models that can generate new content. Widely known Generative Adversarial Networks do so by learning the distribution of real data and generating variability from the added noise.

The recent boom in Generative AI comes from the models attaining human-level quality at scale. The reason for unlocking this transformation is simple — we only now have enough compute power (hence the NVIDIA skyrocketing stock price) for training and maintaining models with enough capacity to achieve high-quality results. Current advancement is fuelled by two base architectures — transformers and diffusion models.

Perhaps the most significant breakthrough of the recent year was OpenAI’s ChatGPT — a text-based generative model, with 175 billion for one of the latest ChatGPT-3.5 versions that has a knowledge base sufficient to maintain conversations on various topics. While ChatGPT is a single modality model, as it can only support text, multimodal models can take as input and output several kinds of input, e.g. text and images.

Image-to-text and text-to-image multimodal architectures operate in a latent space shared by textual and image concepts. The latent space is obtained by training on a task requiring both concepts (for example, image captioning) by penalizing the distance in the latent space between the same concept in two different modalities. Once this latent space is obtained, it can be re-used for other tasks.

Example of an Image-to-Text model. Image by the Author.

Notable generative models released this year are DALLE/Stable-Diffusion (text-to-image / image-to-image) and BLIP (image-to-text implementation). DALLE models take as input either a prompt or an image and a prompt generates an image as a response, while BLIP-based models can answer questions about the contents of the picture.

Unfortunately, there is no free lunch when it comes to machine learning, and large-scale generative models stumble upon a few challenges when it comes to their deployment in production — size and latency, bias and fairness, and the quality of the generated results.

Model size and latency

Model size trends. Data from P. Villalobos. Image by the Author

State-of-the-art GenAI models are huge. For example, text-to-text Meta’s LLaMA models range between 7 and 65 billion parameters, and ChatGPT-3.5 is 175B parameters. These numbers are justified — in a simplified world, the rule of thumb is the larger the model the more data is used for training, the better the quality.

Text-to-image models, while smaller, are still significantly bigger than their Generative Adversarial Network predecessors — Stable Diffusion 1.5 checkpoints are just under 1B parameter (taking over three gigabytes of space), and DALLE 2.0 has 3.5B parameters. Few GPUs would have enough memory to maintain these models and typically you would need a fleet to maintain a single large model, which can become very costly very soon, not even speaking of deploying these models on mobile devices.

Generative models take time to produce the output. For some, the latency is due to their size — propagating the signal through several billions of parameters even on a fleet of GPUs takes time, while for others, it’s due to the iterative nature of producing high-quality results. Diffusion models, in their default configuration, take 50 steps to generate an image, making a smaller number of steps deteriorates the quality of the output image.

Solutions: Making the model smaller often helps make it faster — distilling, compressing and quantizing the model would also reduce the latency. Qualcomm has paved the way by compressing the stable diffusion model enough to be deployed on mobile. Recently smaller, distilled and much quicker versions of Stable Diffusion (tiny and small) have been released.

Model-specific optimization can also aid in speeding up the inference — for diffusion models; one might generate low-resolution output and then upscale it or use a lower number of steps and a different scheduler, as some work best with the lower number of steps, while others generate superior quality for a higher number of iterations. For example, Snap recently showed that eight steps would be enough to create high-quality results with Stable Diffusion 1.5, employing various optimizations at training time.

Compiling the model with, for example, NVIDIAs tensorrt and torch.compile could substantially reduce the latency with minimal engineering effort.

Bias, fairness and safety

Have you ever tried to break ChatGPT? Many have succeeded in uncovering bias and fairness issues, and kudos to OpenAI is doing a great job addressing these. Without fixes at scale, chatbots can create real-world problems by propagating harmful and unsafe ideas and behaviours.

Examples where people managed to break the model, are in politics; for instance, ChatGPT refused to create poems about Trump but would create one about Biden, gender equality and jobs in particular — implying that some professions are for men and some are for women and race.

Like text-to-text models, text-to-image and image-to-text models also contain biases and fairness issues. The Stable Diffusion 2.1 model when asked to generate images of a doctor and a nurse, produces a white male for the former and a white female for the latter. Interestingly, the bias would depend on the country specified in the prompt — e.g., a Japanese doctor or Brazilian nurse.



Source link

Leave a Comment