Okay, You’ve Trained the Best Machine Learning Model. What’s Next? | by Albers Uzila | Jun, 2023


Table of Contents
· Initialize a Repository
· Migrate Your Codebase
config/config.py
config/args.json
tagolym/utils.py
tagolym/data.py
tagolym/train.py
tagolym/predict.py
tagolym/evaluate.py
tagolym/main.py
· Package Your Codebase
·
Setup Data Source Credential
·
Run Your Pipeline
·
Miscellaneous
·
Push Your Project to GitHub
·
Wrapping Up

Let’s say you’re building a data science project, maybe for work, college, portfolio, hobby, or whatever it is. You’ve spent your days solving a problem statement and experimenting with Jupyter notebooks. Now, you’re wondering, “How do I deploy my work as a useful product?”.

To be concrete, assume you have a website that hosts forums. Users can add tags to a thread in a forum to ease navigating between forums with different topics. You want to better the user experience by suggesting predefined tags hence giving context to what the discussion is about.

The forum can be anything, so let’s be more specific; it always starts with a post explaining a math problem, followed by thoughts, questions, hints, or answers around it. Below is what a thread looks like and its 3 tags, i.e. induction, combinatorics unsolved, and combinatorics.

An example of a thread in a forum | Image by author

At this point, you’ve done everything in your notebooks, from understanding the problem statement, defining metrics, querying data, cleaning it, preprocessing, EDA, building a model, to evaluating and optimizing the model.

You notice there are so many posts with a huge total number of tags. So to simplify, you filter only 10 tags. The models you develop are simple linear classifiers (SVM, logistic regression, etc.) preceded by TF-IDF vectorization, and you train them with Stochastic Gradient Descent (SGD).

First 30 frequent tag count | Image by author
Final tag distribution. Notice that geometry is the most common one | Image by author

While notebooks are great and can help you conduct experiments very fast, they’re not production-ready and are sometimes hard to maintain. So, you need to migrate your codes into individual Python files, and then step-by-step add other utilities while collaborating with your team members.

This story will guide you to do exactly that with bite-sized simple steps. Before that, you might want to refresh your mind about linear models, TF-IDF, and SGD:

First, let’s create a new repo on GitHub called tagolym-ml, complete with README.md, .gitignore, and LICENSE.

Creating a new GitHub repository | Image by author

To work with the repo, do these steps:

  1. Clone the repo and a folder named tagolym-ml will be created.
  2. Change the working directory into this folder.
  3. Build a virtual environment called venv.
  4. Activate the environment.
  5. Upgrade pip.
  6. Optionally, you can check the packages currently installed in your environment using pip list, there will be pip and setuptools.
  7. Create a new git branch called code_migration and switch to it.
  8. Create a file setup.py.
  9. Make some new folders named config, tagolym, and credentials.
  10. Create files config.py and args.json inside the config folder.
  11. Create files main.py, utils.py, data.py, train.py, evaluate.py, and predict.py inside the tagolym folder.

If you don’t understand how to do those, don’t worry. Here are all the commands you need that you can run on your favorite terminal:

$ git clone https://github.com/dwiuzila/tagolym-ml.git
$ cd tagolym-ml
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install --upgrade pip
$ pip list
Package Version
---------- -------
pip 23.1.2
setuptools 58.0.4
$ git checkout -b code_migration
$ touch setup.py
$ mkdir config tagolym credentials
$ touch config/config.py config/args.json
$ cd tagolym
$ touch main.py utils.py data.py train.py evaluate.py predict.py
$ cd ..

You now have a git repo in your local connected to the remote repo in GitHub. The local repo directories will look like this.

config/
├── args.json - preprocessing/training parameters
└── config.py - configuration setup
credentials/ - keys and passwords
tagolym/
├── data.py - data processing components
├── evaluate.py - evaluation components
├── main.py - training/optimization pipelines
├── predict.py - inference components
├── train.py - training components
└── utils.py - supplementary utilities
venv/ - virtual environment
.gitignore - files/folders that git will ignore
LICENSE - project license
README.md - longform description of the project
setup.py - code packaging

Almost all of these files are empty for now. You’ll fill them in one by one, starting with the folder config.

There are two main folders within your project, i.e. config and tagolym. You want to copy the necessary codes from your notebooks into files inside these folders. Let’s do it.

config/config.py

Here, you define global variables related to seeds, directories, experiment tracking, preprocessing, and label names.

When this file is imported somewhere in your code, it will make two new folders if not yet created:

  1. data, in which you store labeled data for the project,
  2. stores/model, in which you store the model registry,

and then connect stores/model to a tracking URI for experiment tracking with MLflow.

You also define stopwords and additional command words here. The stopwords will be the default from nltk package, while command words are ["prove", "let", "find", "show", "given"], which often come up in the post but don’t give any useful signals.

Regex patterns are used for preprocessing. They look intimidating, but you don’t need to understand them. What they basically do is catch any mathematical expressions and asymptote syntax in LaTeX, which are the bread and butter in a math problem post.

Lastly, remember you selected only 10 shortlisted labels to work with? You list all of them in this file. Some tags have a similar meaning to your labels (e.g. “inequalities” → “inequality”), so you also have 10 partial labels to catch these tags and replace them with appropriate labels. See tagolym/data.py below for how you do this.

config/args.json

This is where you store all initial arguments for the entire process. They are coming from different parts of the pipeline.

What do they mean?

  1. nocommand and stem — booleans for preprocessing the posts, whether to exclude command words and to implement word stemming.
  2. ngram_max_range — the upper boundary of the range of n values for different n-grams to be extracted during TF-IDF vectorization.
  3. loss, l1_ratio, alpha, learning_rate, eta0, power_t — hyperparameters for models using the SGD classifier.

tagolym/utils.py

The pipeline is a bit convoluted, so you need some utility functions and Python classes to streamline your codes. This file contains those:

  1. load_dict and save_dict — to load a dictionary from a JSON file and, the other way around, dump a dictionary into a JSON file.
  2. NumpyEncoder — to encode objects with Numpy instances into Python built-in instances, used in save_dict.
  3. IterativeStratification — when you’re dealing with multilabel classification like this project, the vanilla train-test-split method is not ideal for the data. In return, you need what we call iterative stratification, which aims to provide well-balanced distribution of evidence of label relations up to a given order. In this project, the order is set to 2.

tagolym/data.py

All functions regarding data are written in this file, including data split, preprocessing, and transformation.

  1. preprocess — creates a mapping from tags containing partial labels to one of the 10 labels defined in config/config.py, then does text processing on all posts and labels. This function also drops all samples with an empty post after text processing.
  2. binarize — based on model requirements, you may want to binarize your labels if you’re working on a multilabel classification problem. This function converts labels into a binary matrix of size (#samples × #labels) indicating the presence of a tag in the labels. For example, the label containing two tags ["algebra", "inequality"] will be transformed into [1, 0, 0, 0, 0, 1, 0, 0, 0, 0]. Besides returning the transformed labels, it also returns the MultiLabelBinarizer object used later, especially for converting the matrix back to labels.
  3. split_data — using IterativeStratification from tagolym/utils.py, this function splits the posts and labels into 3 parts with 70/15/15 proportions, each respectively for model training, validation, and testing.

tagolym/train.py

It’s a best practice to have separate files for model training, validation, and testing. As the file name suggests, you do all the training here. Since you want users to use the model’s tag recommendations confidently, you want to have low false positives.

On the other hand, false negatives are not your top priority for now. To see why, let’s take an extreme example: the model predicts all 10 labels as negatives, hence no tags are recommended and you have a high number of false negatives. But then, the users can just create their own tags without hesitation. No big deal.

So, your objective will be to have a model with high precision.

Now, let’s discuss what this file has to offer:

  1. train — preprocess the data, binarize the labels, and split the data using functions from tagolym/data.py. Then, initialize a model, train it, predict the labels on all three splits using the trained model, and evaluate the predictions. This function accepts args which contains all arguments in config/args.json, to which an additional argument threshold may be added before being returned. Basically, threshold is a list of the best threshold for every label calculated by tune_threshold.
  2. objective — f1 score is a metric chosen to be optimized in hyperparameter tuning. Using an args chosen in a trial, this function trains the model and returns the f1 score of the validation split. It also sets additional attributes to the trial, including precision, recall, and the f1 score of all three splits.
  3. tune_threshold — the default decision boundary for a binary classification problem is 0.5, which may not be optimal depending on the problem. So, besides tuning args, you also tune the threshold for every label while optimizing the f1 score. What it does is try all possible values of the threshold from a grid of 0 to 1 and pick the one that has the maximum f1 score.

tagolym/predict.py

What next after model training? Predict! There are two functions in this file:

  1. custom_predict — if the model has predict_proba attribute, this function will predict the probability of each label being a tag. Otherwise, it predicts the tag directly using 0.5 threshold. In the former case, if the true labels are given, the function will use them to tune the threshold using tune_threshold from tagolym/train.py.
  2. predict — load args, the label binarizer, and the trained model. Then, preprocess given texts and do predictions on them using custom_predict. After that, transform the prediction matrix back into tags.

tagolym/evaluate.py

Given prediction and true label matrices, the purpose of this file is to calculate the precision, recall, f1 score, and number of samples. The performance is computed on the overall samples, per-class samples, and per-slice samples. There are 8 slices you consider:

  1. short posts, i.e. those that have less than 5 words after preprocessed,
  2. six slices in which the posts are tagged as a subtopic but not tagged as the bigger topic covering the subtopic, and
  3. posts that don’t have frequent four-letter-or-more words.

tagolym/main.py

This is the main file that runs everything end-to-end. Here are 5 functions you have in this file and what instructions you should write in them:

  1. elt_data — query labeled data and save it to data folder in JSON format.
  2. train_model — load labeled data from data folder and train your model. Don’t forget to log the metrics, artifacts, and parameters using MLflow. Save also MLflow run_id and metrics to the config folder.
  3. optimize — load labeled data from data folder and optimize given arguments. For search efficiency, the optimization is done in two steps: a) hyperparameters in preprocessing, vectorization, and modeling; and b) hyperparameters in the learning algorithm. Save also the best arguments based on the objective to the config folder, name it as args_opt.json.
  4. load_artifacts — load the artifacts of a specific run_id into memory, including arguments, metrics, model, and label binarizer.
  5. predict_tag — given a specific run_id, predict the tags of every text it receives using preloaded artifacts.

Phew! You just did all migration needed. Now, how do you use these codes?

Photo by Jason Strull on Unsplash

When you used notebooks, you had a preloaded set of packages for experimentation. To reproduce it locally and deploy it to production, you want to define your environment explicitly instead.

You import many open-source packages in this project but only have pip and setuptools in your environment. So, before running the pipeline, you need to install those packages too.

Below is a handy command to do that. Notice I added pip-chill at the end to clean up the generation of the package requirements file later.

$ pip install mlflow nltk regex scikit-learn snorkel joblib optuna pandas google-cloud-bigquery google-auth numpy scipy pip-chill

What’s cool about pip-chill is that it lets you generate a requirements file without any packages that depend on other packages in the file, making the requirements file clean and accurate. Let’s just run it.

$ pip-chill --no-chill > requirements.txt

This will create a file requirements.txt containing all packages you actually need. Note that there are no pandas, scikit-learn, regex, and several other packages since these are dependencies of packages already listed in the file.

Now you’ll use setup.py to package your codebase, wrapping up all dependencies. Inside this file, load all libraries you have in requirements.txt, and define your package using the setup function from setuptools.

Your package name will be tagolym. You can see other details like its version and description in the code below. Libraries you load from requirements.txt will be used in the install_requires parameter and become dependencies of tagolym.

You can then install tagolym using this command below. A new folder named tagolym.egg-info will be created which contains the project’s metadata.

$ python3 -m pip install -e .

Notice the -e or --editable flag installs a package in editable mode from a local project path. In other words, if you use some functions in the current working directory, e.g. using from tagolym import main, then you make some changes to tagolym/main.py, you will able to use this updated version without re-installing your package using pip install.

There’s one small problem. The data used in this project is my own data which is available in my BigQuery. After creating and downloading a service account key, I rename it to bigquery-key.json, and place it in the credentials folder.

To access the data, you’d need my credential, which unfortunately is not to be shared. But worry not, I’ll provide samples for you to work with.

Creating a service account key | Image by author

What you need to do is simple: download the samples labeled_data.json here and save the file in a folder named data in the working directory.

Now you’re ready! Type python3 command in the terminal and you’re good to run everything in Python. You’ll only use tagolym/main.py file.

First, I query the data using my credential and the elt_data function. When I see ✅ Saved data!, I know that the process ran smoothly. As mentioned above, you can skip this step and manually put the samples I provided in the data folder.

Then, you optimize the model using the optimize function by reading the initial arguments config/args.json. I set the number of trials to 10, but you can try something else. A new MLflow study will be created with 20 trials in total since you have a two-step optimization process. The best validation f1 score found is 0.7765.

With a set of optimized arguments config/args_opt.json, you train the model once again using the train_model function and do inference on a list of texts using the predict_tag function. You can see below that the predictions are spot on!

$ python3
>>> from pathlib import Path
>>> from config import config
>>> from tagolym import main
>>>
>>> # query data
>>> key_path = "credentials/bigquery-key.json"
>>> main.elt_data(key_path)
✅ Saved data!
>>>
>>> # optimize model
>>> args_fp = Path(config.CONFIG_DIR, "args.json")
>>> main.optimize(args_fp, study_name="optimization", num_trials=10)
2023/06/03 17:42:12 INFO mlflow.tracking.fluent: Experiment with name 'optimization' does not exist. Creating a new experiment.
[I 2023-06-03 17:41:45,657] A new study created in memory with name: optimization
[I 2023-06-03 17:42:12,343] Trial 0 finished with value: 0.7519199358796977 and parameters: {'nocommand': False, 'stem': True, 'ngram_max': 2, 'loss': 'modified_huber', 'l1_ratio': 0.6011150117432088, 'alpha': 0.001331121608073689}. Best is trial 0 with value: 0.7519199358796977.
[I 2023-06-03 17:42:38,441] Trial 1 finished with value: 0.7629559140596291 and parameters: {'nocommand': False, 'stem': True, 'ngram_max': 2, 'loss': 'modified_huber', 'l1_ratio': 0.43194501864211576, 'alpha': 7.476312062252303e-05}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:42:57,713] Trial 2 finished with value: 0.7511576441724478 and parameters: {'nocommand': True, 'stem': False, 'ngram_max': 3, 'loss': 'hinge', 'l1_ratio': 0.5924145688620425, 'alpha': 1.3783237455007187e-05}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:43:19,108] Trial 3 finished with value: 0.7106573336158825 and parameters: {'nocommand': True, 'stem': False, 'ngram_max': 4, 'loss': 'hinge', 'l1_ratio': 0.6842330265121569, 'alpha': 0.00020914981329035596}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:43:37,349] Trial 4 finished with value: 0.741392879377292 and parameters: {'nocommand': False, 'stem': False, 'ngram_max': 2, 'loss': 'hinge', 'l1_ratio': 0.5467102793432796, 'alpha': 3.585612610345396e-05}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:44:04,235] Trial 5 finished with value: 0.7426444422157734 and parameters: {'nocommand': True, 'stem': True, 'ngram_max': 3, 'loss': 'hinge', 'l1_ratio': 0.045227288910538066, 'alpha': 9.46217535646148e-05}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:44:30,104] Trial 6 finished with value: 0.7337258988967691 and parameters: {'nocommand': True, 'stem': True, 'ngram_max': 2, 'loss': 'modified_huber', 'l1_ratio': 0.07455064367977082, 'alpha': 0.009133995846860976}. Best is trial 1 with value: 0.7629559140596291.
[I 2023-06-03 17:44:51,778] Trial 7 finished with value: 0.7700323704566581 and parameters: {'nocommand': True, 'stem': False, 'ngram_max': 4, 'loss': 'log_loss', 'l1_ratio': 0.3584657285442726, 'alpha': 2.2264204303769678e-05}. Best is trial 7 with value: 0.7700323704566581.
[I 2023-06-03 17:45:18,125] Trial 8 finished with value: 0.7559495178348377 and parameters: {'nocommand': True, 'stem': True, 'ngram_max': 2, 'loss': 'log_loss', 'l1_ratio': 0.8872127425763265, 'alpha': 0.00026100256506134784}. Best is trial 7 with value: 0.7700323704566581.
[I 2023-06-03 17:45:47,029] Trial 9 finished with value: 0.7730089901544794 and parameters: {'nocommand': False, 'stem': True, 'ngram_max': 4, 'loss': 'log_loss', 'l1_ratio': 0.02541912674409519, 'alpha': 2.1070472806578224e-05}. Best is trial 9 with value: 0.7730089901544794.
[I 2023-06-03 17:45:47,056] A new study created in memory with name: optimization
[I 2023-06-03 17:46:16,061] Trial 0 finished with value: 0.7730089901544794 and parameters: {'learning_rate': 'optimal'}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:46:48,008] Trial 1 finished with value: 0.7701884982320516 and parameters: {'learning_rate': 'adaptive', 'eta0': 0.15930522616241014}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:47:18,651] Trial 2 finished with value: 0.7331091235928242 and parameters: {'learning_rate': 'invscaling', 'eta0': 0.0265875439832727, 'power_t': 0.17272998688284025}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:47:49,429] Trial 3 finished with value: 0.7196639813595901 and parameters: {'learning_rate': 'invscaling', 'eta0': 0.038234752246751866, 'power_t': 0.34474115788895177}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:48:21,601] Trial 4 finished with value: 0.7727673901952036 and parameters: {'learning_rate': 'adaptive', 'eta0': 0.3718364180573207}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:48:51,330] Trial 5 finished with value: 0.7576010292654753 and parameters: {'learning_rate': 'invscaling', 'eta0': 0.16409286730647918, 'power_t': 0.16820964947491662}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:49:21,906] Trial 6 finished with value: 0.7428637006524251 and parameters: {'learning_rate': 'invscaling', 'eta0': 0.040665633135147955, 'power_t': 0.13906884560255356}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:49:52,034] Trial 7 finished with value: 0.746701310091385 and parameters: {'learning_rate': 'constant', 'eta0': 0.011715937392307063}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:50:21,383] Trial 8 finished with value: 0.7683160697730758 and parameters: {'learning_rate': 'constant', 'eta0': 0.10968217207529521}. Best is trial 0 with value: 0.7730089901544794.
[I 2023-06-03 17:50:51,373] Trial 9 finished with value: 0.7338062675694838 and parameters: {'learning_rate': 'invscaling', 'eta0': 0.7568292060167615, 'power_t': 0.4579309401710595}. Best is trial 0 with value: 0.7730089901544794.
Best value (f1): 0.7730089901544794
Best hyperparameters: {
"nocommand": false,
"stem": true,
"ngram_max": 4,
"loss": "log_loss",
"l1_ratio": 0.02541912674409519,
"alpha": 2.1070472806578224e-05,
"learning_rate": "invscaling",
"eta0": 0.7568292060167615,
"power_t": 0.4579309401710595,
"threshold": [
0.59,
0.79,
0.55,
0.7000000000000001,
0.5,
0.72,
0.76,
0.63,
0.7000000000000001,
0.77
]
}
>>>
>>> # train model
>>> args_fp = Path(config.CONFIG_DIR, "args_opt.json")
>>> main.train_model(args_fp, experiment_name="baselines", run_name="sgd")
2023/06/03 17:52:01 INFO mlflow.tracking.fluent: Experiment with name 'baselines' does not exist. Creating a new experiment.
Run ID: fbdba0c7cab640bc853611ba6cd75cee
>>> text = [
... "Let $c,d geq 2$ be naturals. Let ${a_n}$ be the sequence satisfying $a_1 = c, a_{n+1} = a_n^d + c$ for $n = 1,2,cdots$.Prove that for any $n geq 2$, there exists a prime number $p$ such that $p|a_n$ and $p not | a_i$ for $i = 1,2,cdots n-1$.",
... "Let $ABC$ be a triangle with circumcircle $Gamma$ and incenter $I$ and let $M$ be the midpoint of $overline{BC}$. The points $D$, $E$, $F$ are selected on sides $overline{BC}$, $overline{CA}$, $overline{AB}$ such that $overline{ID} perp overline{BC}$, $overline{IE}perp overline{AI}$, and $overline{IF}perp overline{AI}$. Suppose that the circumcircle of $triangle AEF$ intersects $Gamma$ at a point $X$ other than $A$. Prove that lines $XD$ and $AM$ meet on $Gamma$.",
... "Find all functions $f:(0,infty)rightarrow (0,infty)$ such that for any $x,yin (0,infty)$, $$xf(x^2)f(f(y)) + f(yf(x)) = f(xy) left(f(f(x^2)) + f(f(y^2))right).$$",
... "Let $n$ be an even positive integer. We say that two different cells of a $n times n$ board are [b]neighboring[/b] if they have a common side. Find the minimal number of cells on the $n times n$ board that must be marked so that any cell (marked or not marked) has a marked neighboring cell."
... ]
>>> main.predict_tag(text=text)
[
{
"input_text": "Let $c,d \geq 2$ be naturals. Let $\{a_n\}$ be the sequence satisfying $a_1 = c, a_{n+1} = a_n^d + c$ for $n = 1,2,\cdots$.Prove that for any $n \geq 2$, there exists a prime number $p$ such that $p|a_n$ and $p not | a_i$ for $i = 1,2,\cdots n-1$.",
"predicted_tags": [
"number theory"
]
},
{
"input_text": "Let $ABC$ be a triangle with circumcircle $\Gamma$ and incenter $I$ and let $M$ be the midpoint of $\overline{BC}$. The points $D$, $E$, $F$ are selected on sides $\overline{BC}$, $\overline{CA}$, $\overline{AB}$ such that $\overline{ID} \perp \overline{BC}$, $\overline{IE}\perp \overline{AI}$, and $\overline{IF}\perp \overline{AI}$. Suppose that the circumcircle of $triangle AEF$ intersects $\Gamma$ at a point $X$ other than $A$. Prove that lines $XD$ and $AM$ meet on $\Gamma$.",
"predicted_tags": [
"geometry"
]
},
{
"input_text": "Find all functions $f:(0,\infty)rightarrow (0,\infty)$ such that for any $x,y\in (0,\infty)$, $$xf(x^2)f(f(y)) + f(yf(x)) = f(xy) \left(f(f(x^2)) + f(f(y^2))right).$$",
"predicted_tags": [
"algebra",
"function"
]
},
{
"input_text": "Let $n$ be an even positive integer. We say that two different cells of a $n times n$ board are [b]neighboring[/b] if they have a common side. Find the minimal number of cells on the $n times n$ board that must be marked so that any cell (marked or not marked) has a marked neighboring cell.",
"predicted_tags": [
"combinatorics"
]
}
]
>>> exit()

You can explore your experiments in a beautiful MLflow UI:

$ mlflow ui --backend-store-uri stores/model



Source link

Leave a Comment