How to Send SLURM Jobs to a Cluster | by François Porcher | Aug, 2023


A tutorial on how to send SLURM jobs to a cluster, especially for deep learning and data science

Photo by imgix on Unsplash

So you are used to train Deep Learning models with the free GPUs of Google Colab, but you are ready to level up and harness the power of a cluster, and you have no idea how to do that? You’re in the right place! 🚀

During my Research internship in Neurosciences at Cambridge University, I was training large models for Computer Vision tasks, and the free GPU provided by Google were not enough, so I decided to use the local cluster.

However very little documentation was available and I had to ask for the scripts of other people to try to understand them, and more or less compiled several things that were useful for me. Now I have compiled everything that is necessary to run basic python scripts. This guide is the one I wish I had during my time there.

Let’s say you want to train a bird classifier, with 500 different classes and high resolution pictures. Something that would never run on Google colab.

The very first thing you need to do is ensure your deep learning model training script is prepared. This script should contain the necessary code for loading your dataset, defining your neural network architecture, and setting the training loop.

You should be able to run this script from your terminal.

For example let’s say you have a script called train_bird_classifier.py, you should be able to run it with:

python train_bird_classifier.py

This script could look like something like this:

# train_bird_classifier.py

import torch
from torch.utils.data import DataLoader

# Assuming necessary functions, models, and transformations are defined in various files.
from utils import build_model, BirdDataset, collate_fn, train_model
from transformations import train_transforms, test_transforms

def main():
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Dataset and DataLoader setup
train_dataset = BirdDataset('data/train/', transform=train_transforms)
train_loader =…



Source link

Leave a Comment