The power of `sklearn`

relies on the fact that it is a consistent library. What does it mean? Well, it means that it has:

**Easy integration with external DL/ML libraries**. It can be integrated with external ML/DL libraries, such as`Keras`

,`Tensorflow`

, and`XGBoost`

, just to name a few. This compatibility gives us the possibility to combine`sklearn`

‘s features with the features of these libraries, making the development phase for us Data Professionals easy and painless.**Simplicity of usage**.`sklearn`

maintains the same unified interface across the various ML algorithms it provides. This way we don’t need to learn new syntax or patterns: we just need to stick to the “same rules” and think to develop our ML algorithms. Can you imagine dealing with different syntax for different algorithms, while dealing with complex ML problems? Well, I don’t (and, sorry if I wouldn’t!).**Well documented**.`sklearn`

provides extensive documentation, including tutorials and examples of usage. The documentation also covers every aspect of the library, from data preprocessing to model evaluation. You can refer to it here.

The `sklearn`

library provides a very wide range of ML algorithms, ranging from Supervised to Unsupervised Learning, implementing both regression and classification models for SL cases.

This wide possibility that we, as Data Scientists have, makes `sklearn`

the perfect library for ML: one environment for all the models we need in our everyday job.

We just need to choose the most suitable algorithm for our specific problem domain, without complaining about integrations and stuff like that:a great advantage!

Data preprocessing and feature engineering are important parts of an ML problem because they involve preparing the data for the ML algorithms. Here are the features provided by `sklearn`

for these tasks:

## Data preprocessing

**Dealing with missing data**. If you have practiced ML for a while I’m sure you find dealing with missing data somehow painful. Or, at least, so it’s to me.`sklearn`

gives us different features for dealing with missing data like: a) the`SimpleImputer()`

class which allows us to impute missing values using different strategies such as mean, median, mode, or a constant value. The imputation can be applied to specific columns or to the entire dataset. b) the`IterativeImputer()`

class which utilizes ML models to impute missing values iteratively. It fills in missing values using the predictions from a chosen model. c) The`MissingIndicator()`

class that can be used to create binary indicators for missing values in the dataset. It adds a new feature that flags whether a value was originally missing or not, providing additional information to the model.**Feature scaling**. Features with different scales can affect the performance of many ML algorithms in a bad way. Luckily for us,`sklearn`

offers some methods to scale the features, like: a) standardization (`StandardScaler()`

) which scales the features to have zero mean and unit variance, and b) normalization (`MinMaxScaler()`

) which scales the features to a specific range.**Encoding categorical variables**. Many ML algorithms cannot directly handle categorical variables. For this reason,`sklearn`

provides a couple of encoding techniques like: a) one-hot encoding (`OneHotEncoder()`

) and b) ordinal encoding (`OrdinalEncoder()`

) to convert categorical variables into numerical representations. This makes it easy to deal with data in the form of strings and transform them into numbers, so that we can feed the ML algorithms with those.**Handling Outliers**. Outliers are another general pain for Data Scientists as they affect the performance of Machine Learning models, so they need a particular treatment.`sklearn`

offers some techniques for outlier detection and removal, such as the Local Outlier Factor (`LOF()`

) algorithm and the Isolation Forest algorithm (`IsolationForest()`

). Note that these are both unsupervised ML algorithms.

## Feature Engineering:

**Feature Extraction**.`sklearn`

provides different methods for extracting meaningful features from our data. As we know, in fact, not all the features we have are meaningful for our ML problems, and it’s a general rule of thumb that we’ve better deal only with the meaningful ones. In the case of feature extraction,`sklearn`

provides: a) the`CountVectorizer()`

class that can convert text documents into a matrix of token counts, and b) the`TfidfVectorizer()`

class that calculates the Term Frequency-Inverse Document Frequency (TF-IDF) to represent text data.**Dimensionality Reduction**.`sklearn`

offers various dimensionality reduction techniques that are useful to reduce the number of features, while preserving essential information. Two popular methods are: a) Principal Component Analysis (`PCA()`

) and b) t-Distributed Stochastic Neighbor Embedding (`TSNE()`

).**Feature Selection**. Finally,`sklearn`

provides also multiple methods for feature selection, such as: a) univariate feature selection (`SelectKBest()`

), b) recursive feature elimination (`RFE()`

), and c) feature importance ranking based on tree-based models (`SelectFromModel()`

).

Model evaluation and validation are crucial activities in ML. As Data Scientists, in fact, our role is to find the best model that fits the data; but what does this actually mean?

It means that we need to train some ML models and find the one that makes the best predictions (or classifications) based on unseen data. This means that we first need to evaluate some ML models on the training data, and then validate them on new, unseen, data. `sklearn`

provides different features to help us in these fundamental tasks like:

**Cross-Validation techniques**. Cross-validation is an important technique for model evaluation that helps us deal with overfitting. It works by partitioning the data into multiple subsets, training the model on different combinations of these subsets, and averaging the results to estimate the model’s performance.`sklearn`

provides us with the following cross-validation methods: a) k-fold cross-validation , b) stratified k-fold cross-validation, and c) leave-one-out cross-validation.**Evaluation Metrics**.`sklearn`

offers a wide range of evaluation metrics that allow us to numerically quantify the performance of classification and regression models. For classification tasks, we can name, between the others, metrics like`accuracy`

,`precision`

,`recall`

, and AUC/ROC. Regression tasks can be evaluated, instead, using metrics such as mean squared error (`MSE`

), mean absolute error (`MAE`

), and R², and others. If you need a deepening of how to use them (and all the main metrics for both cases, provided by`sklearn`

), here’s the definitive guide on the use of classification metrics. And here’s the definitive guide in the case of regression cases.**Hyperparameters tuning**. Model evaluation and validation are closely related to hyperparameters tuning, which involves selecting the best combination of the model parameters for getting the optimal performance of an ML algorithm.`sklearn`

provides tools like`GridSearchCV`

and`RandomizedSearchCV`

for hyperparameters tuning using cross-validation. These techniques systematically explore the hyperparameters space and identify the best set of hyperparameters that result in the best model’s performance, based on the chosen evaluation metric.

One of the superpowers of `sklearn`

is the integration with the Python ecosystem for Data Science. This allows Data Scientists to leverage the power of `sklearn`

with libraries and tools for data manipulation, visualization, and analysis. These are some examples:

**Pandas and Numpy.**The possibility to integrate`sklearn`

with`Pandas`

data frames allow users to pass data frames directly to`sklearn`

‘s functions and classes. This makes it convenient to preprocess and clean data using`Pandas`

before feeding the data into`sklearn`

models.`sklearn`

, also, relies heavily on`NumPy`

arrays for its data representation. In fact, it can directly accept`NumPy`

arrays as input for training tasks.**Matplotlib and Seaborn**.`sklearn`

provides built-in support for plotting the metrics and the model evaluation results using`Matplotlib`

. It allows users to visualize model performance, feature importance, and other relevant statistics, helping us understand our ML models. Also, since`Seaborn`

is a data visualization library built on top of`Matplotlib`

,`sklearn`

also works well with it, allowing users to combine the power of`sklearn`

‘s ML algorithms with`Seaborn`

‘s aesthetic plots for exploratory data analysis and model evaluation.

In this final paragraph, I want to stress out some usage examples. If you have some experience using `sklearn`

I know you know how to use it. But, please, consider that here the point is not to show how these features work: it’s just to underline how the workflow works, based on what we said in the previous paragraphs. Because, again: often we use some tools just for the sake of using them and because everyone does, without asking any questions.

## Example #1: integration with Pandas and NumPy

Let’s see the integration with `sklearn`

,`Pandas`

, and `NumPy`

. For the sake of simplicity, we can use a prompt from ChatGPT. But before it, you may need to know how to effectively use prompt engineering as a Data Scientist:

The prompt could be the following:”* Create a Pandas data frame for regression analysis and make it with 5 features. Then, normalize the data, and split it into train and test sets. Fit the test set with a linear regression model. Print R² for both train and test sets*”.

And we get:

Certainly! I will create a sample Pandas data frame with five features, normalize the data, split it into train and test sets, fit a linear regression model, and print the R² scores for both the train and test sets. Here’s an example implementation:

`import numpy as np`

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

from sklearn.preprocessing import StandardScaler# Generating random features with at least 50 occurrences each

np.random.seed(42)

feature1 = np.random.randint(0, 10, size=100)

feature2 = np.random.randint(0, 10, size=100)

feature3 = np.random.randint(0, 10, size=100)

feature4 = np.random.randint(0, 10, size=100)

feature5 = np.random.randint(0, 10, size=100)

target = np.random.randint(0, 100, size=100)

# Creating a Pandas data frame

data = {

'Feature1': feature1,

'Feature2': feature2,

'Feature3': feature3,

'Feature4': feature4,

'Feature5': feature5,

'Target': target

}

df = pd.DataFrame(data)

# Splitting the data into features and target

X = df.drop('Target', axis=1)

y = df['Target']

# Normalizing the data using StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Splitting the normalized data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fitting the linear regression model

reg_model = LinearRegression()

reg_model.fit(X_train, y_train)

# Predicting the target variable for both train and test sets

y_train_pred = reg_model.predict(X_train)

y_test_pred = reg_model.predict(X_test)

# Calculating R² scores for train and test sets

r2_train = r2_score(y_train, y_train_pred)

r2_test = r2_score(y_test, y_test_pred)

# Printing the R² scores

print("R² score for train set:", r2_train)

print("R² score for test set:", r2_test)

In this example, I created a Pandas data frame with five features (

`Feature1`

to`Feature5`

) and one target variable (`Target`

). Then, I used the`StandardScaler`

from scikit-learn to normalize the feature data. After that, I split the normalized data into train and test sets using a 80:20 ratio. Next, I trained a linear regression model on the train set and predicted the target variables for both the train and test sets. Finally, I calculated and printed the R² scores for both the train and test sets.

So, here’s what happened on a workflow level:

- Creation of the features and labels with
`NumPy`

. - The features and labels became a
`Pandas`

data frame. - The
`Pandas`

data frame became`X`

and`y`

: the data frame of the features and the`Pandas`

series of the label, respectively. `X`

and`y`

have been scaled and splitted. Then, a linear regression model makes predictions and is evaluated using the R², metric on both the train and test sets.

So, as we can see, we have a straight workflow that starts from `NumPy`

and ends with `sklearn`

, giving us the needed results. No “strange manipulations” or integrations are needed.

## Example #2: integration with Matplotlib

To show the integration between `sklearn`

and `Matplotlib`

we make use, again, of ChatGPT with this prompt:” *Create a binary classification problem in Python with data generated in NumPy. Standardize and split the data into train and test sets. Fit the train set with a random forest and KNN classifiers. Print ROC curve for the train set*.”

And we get:

Sure! Let’s create a binary classification problem using randomly generated data in Python, and then we’ll standardize and split the data into training and testing sets. We’ll fit the train set using random forest and K-nearest neighbors (KNN) classifiers, and finally, we’ll plot the ROC curve for the train set.

Here’s the code:

`import numpy as np`

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import roc_curve, roc_auc_score

import matplotlib.pyplot as plt# Set a random seed for reproducibility

np.random.seed(42)

# Generate random data

num_samples = 1000

num_features = 10

# Generate features (X)

X = np.random.randn(num_samples, num_features)

# Generate labels (y)

y = np.random.randint(2, size=num_samples)

# Standardize the features

X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the train set using random forest classifier

rf_clf = RandomForestClassifier()

rf_clf.fit(X_train, y_train)

# Fit the train set using K-nearest neighbors (KNN) classifier

knn_clf = KNeighborsClassifier()

knn_clf.fit(X_train, y_train)

# Predict probabilities for the positive class (class 1)

y_train_rf_probs = rf_clf.predict_proba(X_train)[:, 1]

y_train_knn_probs = knn_clf.predict_proba(X_train)[:, 1]

# Calculate the false positive rate (FPR), true positive rate (TPR), and thresholds for the ROC curve

rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_train, y_train_rf_probs)

knn_fpr, knn_tpr, knn_thresholds = roc_curve(y_train, y_train_knn_probs)

# Calculate the AUC score for the ROC curve

rf_auc = roc_auc_score(y_train, y_train_rf_probs)

knn_auc = roc_auc_score(y_train, y_train_knn_probs)

# Plot the ROC curve

plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.2f})")

plt.plot(knn_fpr, knn_tpr, label=f"KNN (AUC = {knn_auc:.2f})")

plt.plot([0, 1], [0, 1], 'k--') # Diagonal line for random classifier

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve - Train Set')

plt.legend(loc='lower right')

plt.show()

This code generates random data with 10 features and 1000 samples. It then standardizes the features, splits the data into train and test sets, fits the train set using random forest and KNN classifiers, predicts the probabilities for the positive class, calculates the ROC curve and AUC score for each classifier, and finally, plots the ROC curve for the train set.

And if we run the above generated code we get:

So, here’s the process:

- Creation of the features and label with
`NumPy`

. - Features and label arrays became
`X`

and`y`

. - Features and label arrays are splitted into the train and test sets. Then, the train set has been fitted with a KNN and a Random Forest classifiers.
`Matplotlib`

library has been used to plot the ROC curve, making it easy for us to compare the performance of the two classifiers.

And again, we have a straight workflow that uses all of the libraries we needed, printing the results we wanted with no “additional manipulation”.