Sklearn Pipelines for the Modern ML Engineer: 9 Techniques You Can’t Ignore | by Bex T. | May, 2023


7. Inside other objects

Even though a pipeline contains a variety of transformers, at the end of the day, it is an estimator:

isinstance(my_pipe, BaseEstimator)
True

This means it can be used anywhere a typical stand-alone estimator could be used. For example, pipelines are often inserted into cross-validators to guard the machine learning model from data leakage:

from sklearn.model_selection import cross_validate

results = cross_validate(
estimator=full_pipeline_clf,
X,
y,
cv=5,
n_jobs=-1,
scoring=["accuracy", "logloss"],
)

Or into hyperparameter tuners such as HalvingGridSearch (for the same reasons):

from sklearn.model_selection import HalvingGridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVC

# Define the pipeline with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("numeric", num_pipe, num_cols),
("categorical", cat_pipe, cat_cols),
]
)

pipe = Pipeline(
[("preprocessor", preprocessor), ("classifier", SVC())]
)

param_grid = {
"preprocessor__numeric__with_mean": [True, False],
"preprocessor__categorical__min_frequency": [2, 4, 6],
"classifier__C": [0.1, 1, 10],
"classifier__kernel": ["linear", "rbf"],
}

search = HalvingGridSearchCV(
pipe, param_grid, cv=5, factor=2, random_state=42
)

At this point, I want to draw your attention to the definition of the parameter grid. Take a look at how it is defined:

param_grid = {
"preprocessor__numeric__with_mean": [True, False],
"preprocessor__categorical__min_frequency": [2, 4, 6],
"classifier__C": [0.1, 1, 10],
"classifier__kernel": ["linear", "rbf"],
}

The first parameter, with_mean, of StandardScaler serves as an example of a nested parameter. It is preceded by two specifiers: preprocessor and numeric, separated by double underscores.

Nested parameters follow the <step_name>__<parameter> syntax. In this case, with_mean is a parameter of a transformer that is two levels deep. The inner pipeline’s name is numeric, and the outer one’s name is preprocessor, resulting in preprocessor__numeric__with_mean.

By writing nested parameters in this syntax, you can optimize not only for the parameters of the model but also for the parameters of the inner transformers themselves.



Source link

Leave a Comment