Managing Deep Learning Models Easily With TOML Configurations | by Shubham Panchal | Jun, 2023


You may never need those long CLI args for your train.py

Photo by Scott Graham on Unsplash

Managing deep learning models can be difficult due to the huge number of parameters and settings that are needed for all modules. The training module might need parameters like batch_size or the num_epochs or parameters for the learning rate scheduler. Similarly, the data preprocessing module might need train_test_split or parameters for image augmentation.

A naive approach to manage or introduce these parameters into pipeline is to use them as CLI arguments while running the scripts. Command line arguments could be difficult to enter and managing all parameters in a single file may not be possible. TOML files provide a cleaner way to manage configurations and scripts can load necessary parts of the configuration in the form of a Python dict without needing boilerplate code to read/parse command-line args.

In this blog, we’ll explore the use of TOML in configuration files and how we can efficiently use them across training/deployment scripts.

TOML, stands for Tom’s Obvious Minimal Language, is file-format designed specifically for configuration files. The concept of a TOML file is quite similar to YAML/YML files which have the ability to store key-value pairs in a tree-like hierarchy. An advantage of TOML over YAML is its readability which becomes important when there are multiple nested levels.

Fig.1. The same model configurations written in TOML (left) and YAML (right). TOML allows us to write key-value pairs at the same indentation level regardless of the hierarchy.

Personally, except for enhanced readability, I find no practical reason to prefer TOML over YAML. Using YAML is absolutely fine, here a Python package for parsing YAML.

There are two advantages of using TOML for storing model/data/deployment configuration for ML models:

Managing all configurations in a single file: With TOML files, we can create multiple groups of settings that are required for different modules. For instance, in figure 1, the settings related to the model’s training procedure are nested under the [train] attribute, similarly the port and host required for deploying the model are stored under deploy . We need not jump between train.py or deploy.py to change their parameters, instead we can globalize all settings from a single TOML configuration file.

This could be super helpful if we’re training the model on a virtual machine, where code-editors or IDEs are not available for editing files. A single config file is easy to edit with vim or nano available on most VMs.

To read the configuration from a TOML files, two Python packages can be used, toml and munch . toml will help us read the TOML file and return the contents of the file as a Python dict . munch will convert the contents of the dict to enable attribute-style access of elements. For instance, instead of writing, config[ "training" ][ "num_epochs" ] , we can just write config.training.num_epochs which enhances readability.

Consider the following file structure,

- config.py
- train.py
- project_config.toml

project_config.toml contains the configuration for our ML project, like,

[data]
vocab_size = 5589
seq_length = 10
test_split = 0.3
data_path = "dataset/"
data_tensors_path = "data_tensors/"

[model]
embedding_dim = 256
num_blocks = 5
num_heads_in_block = 3

[train]
num_epochs = 10
batch_size = 32
learning_rate = 0.001
checkpoint_path = "auto"

In config.py , we create a function which returns the munchified-version of this configuration, using toml and munch ,

$> pip install toml munch
import toml
import munch

def load_global_config( filepath : str = "project_config.toml" ):
return munch.munchify( toml.load( filepath ) )

def save_global_config( new_config , filepath : str = "project_config.toml" ):
with open( filepath , "w" ) as file:
toml.dump( new_config , file )

Now, now in any of our project files, like train.py or predict.py , we can load this configuration,

from config import load_global_config

config = load_global_config()

batch_size = config.train.batch_size
lr = config.train.learning_rate

if config.train.checkpoint_path == "auto":
# Make a directory with name as current timestamp
pass

The output of print( toml.load( filepath ) ) ) is,

{'data': {'data_path': 'dataset/',
'data_tensors_path': 'data_tensors/',
'seq_length': 10,
'test_split': 0.3,
'vocab_size': 5589},
'model': {'embedding_dim': 256, 'num_blocks': 5, 'num_heads_in_block': 3},
'train': {'batch_size': 32,
'checkpoint_path': 'auto',
'learning_rate': 0.001,
'num_epochs': 10}}

If you’re using MLOps tools like W&B Tracking or MLFlow, maintaining configuration as a dict could be helpful as we can directly pass it as an argument.

Hope you will consider using TOML configurations in your next ML project! Its a clean way of managing settings that are both global or local to your training / deployment or inference scripts.

Instead of writing long CLI arguments, the scripts could directly load the configuration from the TOML file. If we wish to train two versions of a model with different hyperparameters, we just need to change the TOML file in config.py . I have started using TOML files in my recent projects and experimentation has become faster. MLOps tools can also manage versions of a model along with their configurations, but the simplicity of the above discussed approach is unique and required minimal change in existing projects.

Hope you’ve enjoyed reading. Have a nice day ahead!



Source link

Leave a Comment