0. Three partitions
Almost everything we do in machine learning is in service of avoiding overfitting. And one of the greatest tools in your arsenal to fight it is splitting your data into not two but three sets!
Cassie Kozyrkov, the head of Decision Intelligence at Google, says that data splitting is the most powerful idea in machine learning and you agree with her.
You are aware that overfitting can occur not only on the train set but also on the validation set. You’ve observed that using the same set for both testing and hyperparameter tuning often introduces subtle data leakage.
By continuously tweaking the hyperparameters based on the performance of the model on that specific test set, there is a risk of overfitting the model to that particular set.
So, you train your selected model using 50% of the available data. Then, you fine-tune and evaluate the model using a separate validation set containing 25% of the data. Finally, just when your baby model is ready to be deployed into the wild, you test it one final time using a completely untouched and pristine (I mean you haven’t even looked at the first five rows) test set.
With this rule in mind, you’ve saved this code snippet on your desktop to copy/paste any time you want:
from sklearn.model_selection import train_test_split
def split_dataset(data, target, train_size=0.5, random_state=42):
# Splitting the dataset into training set and remaining data
X_train, remaining_data, y_train, remaining_target = train_test_split(
data, target, train_size=train_size, random_state=random_state
# Splitting the remaining data equally into test and validation sets
X_val, X_test, y_val, y_test = train_test_split(
remaining_data, remaining_target, test_size=0.5, random_state=random_state
return X_train, X_val, X_test, y_train, y_val, y_test
SYDD! If an unexamined life is not worth living, then here are the four words to live by: Split Your Damned Data. — Cassie Kozyrkov