A Deep Dive into the Code of the Visual Transformer (ViT) Model | by Alexey Kravets | Aug, 2023

Breaking down the HuggingFace ViT Implementation

Vision Transformer (ViT) stands as a remarkable milestone in the evolution of computer vision. ViT challenges the conventional wisdom that images are best processed through convolutional layers, proving that sequence-based attention mechanisms can effectively capture the intricate patterns, context, and semantics present in images. By breaking down images into manageable patches and leveraging self-attention, ViT captures both local and global relationships, enabling it to excel in diverse vision tasks, from image classification to object detection and beyond. In this article, we are going to break down how ViT for classification works under the hood.


The core idea of ViT is to treat an image as a sequence of fixed-size patches, which are then flattened and converted into 1D vectors. These patches are subsequently processed by a transformer encoder, which enables the model to capture global context and dependencies across the entire image. By dividing the image into patches, ViT effectively reduces the computational complexity of handling large images while retaining the ability to model complex spatial interactions.

First of all, we import the ViT model for classification from hugging face transformers library:

from transformers import ViTForImageClassification
import torch
import numpy as np

model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

patch16–224 indicates that the model accepts images of size 224×224 and each patch has width and hight of 16 pixels.

This is what the model architecture looks like:

(vit): ViTModel(
(embeddings): ViTEmbeddings(
(patch_embeddings): PatchEmbeddings(
(projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
(dropout): Dropout(p=0.0, inplace=False)
(encoder): ViTEncoder(
(layer): ModuleList(
(0): ViTLayer(
(attention): ViTAttention(
(attention): ViTSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)

Source link

Leave a Comment