Meta AI’s Another Revolutionary Large Scale Model — DINOv2 for Image Feature Extraction | by Gurami Keretchashvili | Jun, 2023

In this part, I will try to demonstrate how DINOv2 works in a real-case scenario. I will create fine-grained image classification task.

Classification workflow:

  • Download the Food101 dataset from PyTorch datasets.
  • Extract features from train and test datasets using small DINOv2
  • Train ML classifier models (SVM, XGBoost and KNN) using extracted features from training dataset.
  • Make a prediction on extracted features from test dataset.
  • Evaluate each ML model’s accuracy and F1score.

Data: Food 101 is a challenging data set of 101 food categories with 101,000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images.

Model: small DINOv2 model (ViT-S/14 distilled)

ML models: SVM, XGBoost, KNN.

Step 1 — Set up (You can use Google Colab to run the code and turn GPU on)

import torch
import numpy as np
import torchvision
from torchvision import transforms
from import Subset, DataLoader
import matplotlib.pyplot as plt
import time
import os
import random
from tqdm import tqdm

from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd

def set_seed(no):
os.environ['PYTHONHASHSEED'] = str()
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True


Step 2 — Create Transformation, download and create Food101 Pytorch datasets, create train and test dataloader objects.

batch_size = 8

transformation = transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

trainset = torchvision.datasets.Food101(root='./data', split='train',
download=True, transform=transformation)

testset = torchvision.datasets.Food101(root='./data', split='test',
download=True, transform=transformation)

# train_indices = random.sample(range(len(trainset)), 20000)
# test_indices = random.sample(range(len(testset)), 5000)

# trainset = Subset(trainset, train_indices)
# testset = Subset(testset, test_indices)

trainloader =, batch_size=batch_size,

testloader =, batch_size=batch_size,

classes = trainset.classes

print(len(trainset), len(testset))
print(len(trainloader), len(testloader))

[out] 75750 25250

[out] 9469 3157

Step 3 (Optional) — Visualize training dataloader batch

# Get a batch of images
dataiter = iter(trainloader)
images, labels = next(dataiter)

# Plot the images
fig, axes = plt.subplots(1, len(images),figsize=(12,12))
for i, ax in enumerate(axes):
# Convert the tensor image to numpy format
image = images[i].numpy()
image = image.transpose((1, 2, 0)) # Transpose to (height, width, channels)

# Normalize the image
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
normalized_image = (image * std) + mean
# Display the image
ax.set_title(f'Label: {labels[i]}')

# Show the plot

batch of images

Step 4 — load small DINOv2 model and extract features from training and test dataloaders.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14').to(device)

train_embeddings = []
train_labels = []

with torch.no_grad():
for data, labels in tqdm(trainloader):
image_embeddings_batch = dinov2_vits14(


test_embeddings = []
test_labels = []

with torch.no_grad():
for data, labels in tqdm(testloader):
image_embeddings_batch = dinov2_vits14(


#concatinate result
train_embeddings_f = np.vstack(train_embeddings)
train_labels_f = np.concatenate(train_labels).flatten()

test_embeddings_f = np.vstack(test_embeddings)
test_labels_f = np.concatenate(test_labels).flatten()

train_embeddings_f.shape, train_labels_f.shape, test_embeddings_f.shape, test_labels_f.shape

[out] ((75750, 384), (75750,), (25250, 384), (25250,))

Step 5 — Build a function for SVM, XGBoost and KNN classifiers.

def evaluate_classifiers(X_train, y_train, X_test, y_test):
# Support Vector Machine (SVM)
svm_classifier = SVC(), y_train)
svm_predictions = svm_classifier.predict(X_test)

# XGBoost Classifier
xgb_classifier = XGBClassifier(tree_method='gpu_hist'), y_train)
xgb_predictions = xgb_classifier.predict(X_test)

# K-Nearest Neighbors (KNN) Classifier
knn_classifier = KNeighborsClassifier(), y_train)
knn_predictions = knn_classifier.predict(X_test)

# Calculating Top-1
top1_svm = accuracy_score(y_test, svm_predictions)
top1_xgb = accuracy_score(y_test, xgb_predictions)
top1_knn = accuracy_score(y_test, knn_predictions)

# Calculating F1 Score
f1_svm = f1_score(y_test, svm_predictions, average='weighted')
f1_xgb = f1_score(y_test, xgb_predictions, average='weighted')
f1_knn = f1_score(y_test, knn_predictions, average='weighted')

return pd.DataFrame({
'SVM': {'Top-1 Accuracy': top1_svm, 'F1 Score': f1_svm},
'XGBoost': {'Top-1 Accuracy': top1_xgb,'F1 Score': f1_xgb},
'KNN': {'Top-1 Accuracy': top1_knn, 'F1 Score': f1_knn}

X_train = train_embeddings_f # Training data features
y_train = train_labels_f # Training data labels
X_test = test_embeddings_f # Test data features
y_test = test_labels_f # Test data labels

results = evaluate_classifiers(X_train, y_train, X_test, y_test)


Result of small DINOv2 + SVM/XGBoost/KNN (image by the author)

Wow, the results are great! As demonstrated, SVM model trained on small DINOv2 extracted features outperformed other ML models and achieved almost 90% accuracy.

Even though we used small DINOv2 model to extract features, ML models (especially SVM) trained on extracted features demonstrated great performance on the fine grained classification task. The model can classify objects with almost 90% accuracy out of 101 different classes.

The accuracy would improve if it was used big, large or giant DINOv2 models. You just need to change the dinov2_vits14 in step 4 with dinov2_vitb14, dinov2_vitl14 or dinov2_vitg14. You can have a try and feel free to share the accuracy result in the comment section 🙂

Source link

Leave a Comment