The CLIP Foundation Model. Paper Summary— Learning Transferable… | by Sascha Kirch | Aug, 2023


  1. Context & Background
  2. Method
  3. Experiments
  4. Further Readings & Resources

CLIP (Contrastive Language-Image Pre-Training) is a multi-modal model that learns the correspondence between natural language and images. It is trained on 400 million text-images pairs collected from the internet. As we will discover later in this article, CLIP has strong zero-shot performance, meaning it performs well on downstream tasks different to those it was trained on, without performing any fine-tuning.

CLIP aims to:

  1. Apply the success of large-scale pre-training techniques known from natural language processing (e.g. GPT family, T5 and BERT) to computer vision.
  2. Enable flexible zero-shot capabilities by using natural language instead of a fixed set class labels.

Why is this a big deal you might ask yourself? First of all, many computer vision models are trained on crowd-sourced labeled datasets. These datasets often contain hundreds of thousands samples. Some exceptions are in the region of single or double digit million samples. As you can imagine it is a very time consuming and costly process. Datasets for natural language models on the other hand are usually several orders of magnitudes larger and are scraped from the internet. Secondly, if an object detection model has been trained on certain classes and you want to add an extra class, you would need to label this new class in your data and retrain the model.

CLIP’s ability to combine natural language and image features in combination with its zero-shot performance has led to a wide adoption in many other popular foundation models such as UnCLIP, EVA, SAM, Stable Diffusion, GLIDE or VQGAN-CLIP, to name a few.

Now let’s dive into the method of CLIP. The image bellow depicted in Fig.1 shows the architecture of CLIP and the process of how it is trained

Fig. 1 — CLIP’s Architecture and training process. Image Source + annotations by author

The model architecture consists of two encoder models, one for each modality. For the text encoder a transformer was used while the image encoder uses either a version of ResNet or ViT (Vision Transformer). A learned linear transformation, one for each modality, transforms the features into embeddings of matching size. Finally, the cosine similarity is calculated between each of the embeddings of opposing modality and is scaled by a learned temperature scalar. During training, the cosine similarity between matching pairs is maximized while it is minimized for incorrect pairs, hence the term “contrastive” in the framework’s name.

There are some subtleties that are crucial for the success, beside the large dataset of course. First, the contrastive learning approach strongly depends on the batch size N. The more negative samples are provided along the correct ones, the stronger the learning signal. CLIP was trained on a batch size of 32,768, which is quite large. Second, CLIP does not learn a match of the exact wording, but an easier proxy task to only learn the text as a whole, also called bag of words (BoW).

Fun Fact: The version of CLIP using a ResNet50x64 as image encoder was trained for 18 days on 592 V100 GPUS and while the version with the ViT model was trained for 12 days on 256 V100 GPUS. In other words, over 29 years and over 8 years on a single GPU respectively (ignoring the fact a different batch size would be used).

Once the model is trained it can be used to perform object classification on a images. The question is: how to perform classification using a model that has not been trained to classify images nor does input class labels but text prompts? Fig 2. shows how:

Fig. 2 — CLIP’s Architecture for image classification. Image Source + annotations by author

A class label can be seen as a text prompt formed by a single word. To tell the model, which classes are available for the classification task, a set of N classes is input into the model. This is a huge advantage compared to classification models trained on a fixed set of labels. We can now either input 3 classes or 100; it’s our choice. As we will see later, to improve the performance of CLIP, the class label is transformed into a prompt to provide further context to the model. Each prompt is then fed to the text encoder and is then transformed into an embedding vector.

The input image is fed into the image encoder to obtain the embedding vector.

Then the cosine similarity is calculated for each pair of text and image embeddings. A Softmax is applied on the obtained similarity values to form a probability distribution. Finally, the value with the highest probability is selected as the final prediction.

The CLIP paper presents a vast number of experiments and ablations. Here we will cover five, from which I think are important to understand the success of CLIP. Upfront the take aways (as formulated by the authors of CLIP) and then we will dive into the details:

  1. Training Efficiency: CLIP is much more efficient at zero-shot transfer than our image caption baseline
  2. Text Input Format: Prompt engineering and ensembling improve zero-shot performance
  3. Zero-Shot Performance: Zero-shot CLIP is competitive with fully super-vised baseline
  4. Few-Shot Performance: Zero-shot CLIP outperforms few-shot linear probes
  5. Distribution Shift: Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models

Training Efficiency

During training, the image encoder and the text encoder are trained jointly, meaning with a single training objective and at the same time. Not only does CLIP perform a contrastive learning scheme, but the text prompts are compared as a whole against a given image, hence the order of words does not matter. It is simply a “bag of words”. The phrase “my name is Sascha” results in the same embedding as “Sascha name is my”.

Predicting a bag of words instead of the correct words and its position in a phrase is a much easier proxy objective. Fig 3. bellow shows the zero-shot accuracy on ImageNet over the number of training samples of the initial transformer model trained to predict exact words, the initial transformer model trained to predict a bag of words and the CLIP model that performs contrastive learning using bag of words.

“CLIP is much more efficient at zero-shot transfer than our image caption baseline” — CLIP Authors

Fig. 3 — Zero-shot efficiency. Image Source + annotations by author

Text Input Format

As we have seen in Fig. 2, to perform object classification, the class label has been converted into a text prompt. Of course, this was not by chance, because CLIP would be totally fine with a single word. It was done to leverage the descriptiveness of language and to provide context to resolve possible ambiguities. Let’s take the word “boxer” for example. It could be a type of dog or a type of athlete. The authors of CLIP have shown that the format of the text prompt matters a lot and can boost the performance as well increase the efficiency.

“Prompt engineering and ensembling improve zero-shot performance” — CLIP Authors

Fig. 4— Prompt engineering and ensembling vs. contextless class names. Image Source + annotations by author

Zero-Shot Performance

In another experiment, the authors compared the zero-shot image classification performance of CLIP against a model that was trained specifically on the dataset under comparison.

“Zero-shot CLIP is competitive with fully super-vised baseline” — CLIP Authors

Fig. 5— Zero-Shot CLIP vs. Supervised baseline. Image Source + annotations by author

Few-Shot Performance

While zero-shot predictors are not fine-tuned on the downstream task, few shot detectors are. The authors experimented with multiple publicly available pre-trained models and compared their few-shot performance on 20 different datasets against zero-shot and few-shot CLIP. The few-shot models have been fine-tuned on 1, 2, 4, 8 and 16 examples per class.

Interestingly, zero-shot CLIP performs roughly as good as 4-shot CLIP.

If comparing CLIP to other models, one must consider that the publicly available models under comparison (i.e. BiT, SimCLR and ResNet) have been pre-trained on different and smaller datasets as the CLIP model.

“Zero-shot CLIP outperforms few-shot linear probes” — CLIP Authors

Fig. 6— Few-shot performance. Image Source + annotations by author

Distribution Shift

Generally speaking, a model’s robustness towards distribution shifts refers to its capability to perform as good on data of a different data distribution as on the data distribution of the data it was trained on. Ideally, it would perform equally well. In reality, its performance drops.

The robustness of zero-shot CLIP has been compared to a ResNet101 ImageNet model. Both models are evaluated on natural distribution shifts of ImageNet, as depicted in Fig. 7.

“Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models” — CLIP Authors

Fig. 7 — Distribution shift. Image Source + annotations by author

As mentioned at the beginning of this article, CLIP has been widely adopted by a vast number of projects.

Following a list of papers using CLIP:

  1. [UnCLIP] Hierarchical Text-Conditional Image Generation with CLIP Latents
  2. [EVA] Exploring the Limits of Masked Visual Representation Learning at Scale
  3. [SAM] Segment Anything
  4. [Stable Diffusion] High-Resolution Image Synthesis with Latent Diffusion Models
  5. [GLIDE] Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  6. [VQGAN-CLIP] Open Domain Image Generation and Editing with Natural Language Guidance

And a list of repositories if you want to dive into the implementation and test it yourself:



Source link

Leave a Comment