Stop Hard Coding in a Data Science Project — Use Config Files Instead | by Khuyen Tran | May, 2023


This straightforward approach allows you to effortlessly retrieve the desired parameters.

Let’s say you are experimenting with different test_size. It is time-consuming to repeatedly open your configuration file and modify the test_size value.

Image by Author

Luckily, Hydra makes it easy to directly overwrite configuration from the command line. This flexibility allows for quick adjustments and fine-tuning without modifying the underlying configuration files.

python src/process_data.py processs.test_size=0.3

Imagine you want to experiment with various combinations of data processing methods and model hyperparameters. While you could manually edit the configuration file each time you run a new experiment, this approach can be time-consuming.

Image by Author

Hydra enables the composition of configurations from multiple sources with config groups. To create a config group for data processing, create a directory called process to hold a file for each processing method:

.
└── conf/
├── process/
│ ├── process1.yaml
│ └── process2.yaml
└── main.yaml
Image by Author

If you want to use the process1.yaml file by default, add it to Hydra’s default list.

Image by Author

Follow the same procedures to create a config group for training hyperparameters:

.
└── conf/
├── process/
│ ├── process1.yaml
│ └── process2.yaml
├── train/
│ ├── train1.yaml
│ └── train2.yaml
└── main.yaml
Image by Author

Set train1 as the default config file:

Image by Author

Now running the application will use the parameters in process1.yaml file and model1.yaml file by default:

$ python src/process.py --help

process:
cols_to_drop:
- free sulfur dioxide
feature: quality
test_size: 0.2
train:
hyperparameters:
svm__kernel:
- rbf
svm__C:
- 0.1
- 1

This capability is particularly useful when different configuration files need to be combined seamlessly.

Suppose you want to conduct experiments with multiple processing methods, applying each configuration one by one can be a time-consuming task.

$ python src/process.py process=process1 # wait for this to finish

$ python src/process.py process=process2 # then run the application with another config

Luckily, Hydra allows you to run the same application with different configurations simultaneously.

$ python src/process.py --multirun process=process1,process2

This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.

Congratulations! You have just learned about the importance of using configuration files and how to create ones using Hydra. I hope this article will give you the knowledge needed to create your own configuration files.



Source link

Leave a Comment