The Data Scientist’s Toolbox: Leveraging scikit-learn’s Top Features for Success


The power of sklearn relies on the fact that it is a consistent library. What does it mean? Well, it means that it has:

  1. Easy integration with external DL/ML libraries. It can be integrated with external ML/DL libraries, such as Keras, Tensorflow, and XGBoost, just to name a few. This compatibility gives us the possibility to combine sklearn‘s features with the features of these libraries, making the development phase for us Data Professionals easy and painless.
  2. Simplicity of usage. sklearn maintains the same unified interface across the various ML algorithms it provides. This way we don’t need to learn new syntax or patterns: we just need to stick to the “same rules” and think to develop our ML algorithms. Can you imagine dealing with different syntax for different algorithms, while dealing with complex ML problems? Well, I don’t (and, sorry if I wouldn’t!).
  3. Well documented. sklearn provides extensive documentation, including tutorials and examples of usage. The documentation also covers every aspect of the library, from data preprocessing to model evaluation. You can refer to it here.

The sklearn library provides a very wide range of ML algorithms, ranging from Supervised to Unsupervised Learning, implementing both regression and classification models for SL cases.

This wide possibility that we, as Data Scientists have, makes sklearn the perfect library for ML: one environment for all the models we need in our everyday job.

We just need to choose the most suitable algorithm for our specific problem domain, without complaining about integrations and stuff like that:a great advantage!

Data preprocessing and feature engineering are important parts of an ML problem because they involve preparing the data for the ML algorithms. Here are the features provided by sklearn for these tasks:

Data preprocessing

  1. Dealing with missing data. If you have practiced ML for a while I’m sure you find dealing with missing data somehow painful. Or, at least, so it’s to me. sklearn gives us different features for dealing with missing data like: a) the SimpleImputer() class which allows us to impute missing values using different strategies such as mean, median, mode, or a constant value. The imputation can be applied to specific columns or to the entire dataset. b) the IterativeImputer() class which utilizes ML models to impute missing values iteratively. It fills in missing values using the predictions from a chosen model. c) The MissingIndicator() class that can be used to create binary indicators for missing values in the dataset. It adds a new feature that flags whether a value was originally missing or not, providing additional information to the model.
  2. Feature scaling. Features with different scales can affect the performance of many ML algorithms in a bad way. Luckily for us, sklearn offers some methods to scale the features, like: a) standardization (StandardScaler()) which scales the features to have zero mean and unit variance, and b) normalization (MinMaxScaler()) which scales the features to a specific range.
  3. Encoding categorical variables. Many ML algorithms cannot directly handle categorical variables. For this reason, sklearn provides a couple of encoding techniques like: a) one-hot encoding (OneHotEncoder()) and b) ordinal encoding (OrdinalEncoder()) to convert categorical variables into numerical representations. This makes it easy to deal with data in the form of strings and transform them into numbers, so that we can feed the ML algorithms with those.
  4. Handling Outliers. Outliers are another general pain for Data Scientists as they affect the performance of Machine Learning models, so they need a particular treatment. sklearn offers some techniques for outlier detection and removal, such as the Local Outlier Factor (LOF()) algorithm and the Isolation Forest algorithm (IsolationForest()). Note that these are both unsupervised ML algorithms.

Feature Engineering:

  1. Feature Extraction. sklearn provides different methods for extracting meaningful features from our data. As we know, in fact, not all the features we have are meaningful for our ML problems, and it’s a general rule of thumb that we’ve better deal only with the meaningful ones. In the case of feature extraction, sklearn provides: a) theCountVectorizer() class that can convert text documents into a matrix of token counts, and b) the TfidfVectorizer() class that calculates the Term Frequency-Inverse Document Frequency (TF-IDF) to represent text data.
  2. Dimensionality Reduction. sklearn offers various dimensionality reduction techniques that are useful to reduce the number of features, while preserving essential information. Two popular methods are: a) Principal Component Analysis (PCA()) and b) t-Distributed Stochastic Neighbor Embedding (TSNE()).
  3. Feature Selection. Finally, sklearn provides also multiple methods for feature selection, such as: a) univariate feature selection (SelectKBest()), b) recursive feature elimination (RFE()), and c) feature importance ranking based on tree-based models (SelectFromModel()).

Model evaluation and validation are crucial activities in ML. As Data Scientists, in fact, our role is to find the best model that fits the data; but what does this actually mean?

It means that we need to train some ML models and find the one that makes the best predictions (or classifications) based on unseen data. This means that we first need to evaluate some ML models on the training data, and then validate them on new, unseen, data. sklearn provides different features to help us in these fundamental tasks like:

  1. Cross-Validation techniques. Cross-validation is an important technique for model evaluation that helps us deal with overfitting. It works by partitioning the data into multiple subsets, training the model on different combinations of these subsets, and averaging the results to estimate the model’s performance. sklearn provides us with the following cross-validation methods: a) k-fold cross-validation , b) stratified k-fold cross-validation, and c) leave-one-out cross-validation.
  2. Evaluation Metrics. sklearn offers a wide range of evaluation metrics that allow us to numerically quantify the performance of classification and regression models. For classification tasks, we can name, between the others, metrics like accuracy, precision, recall, and AUC/ROC. Regression tasks can be evaluated, instead, using metrics such as mean squared error (MSE), mean absolute error (MAE), and R², and others. If you need a deepening of how to use them (and all the main metrics for both cases, provided by sklearn), here’s the definitive guide on the use of classification metrics. And here’s the definitive guide in the case of regression cases.
  3. Hyperparameters tuning. Model evaluation and validation are closely related to hyperparameters tuning, which involves selecting the best combination of the model parameters for getting the optimal performance of an ML algorithm. sklearn provides tools like GridSearchCV and RandomizedSearchCV for hyperparameters tuning using cross-validation. These techniques systematically explore the hyperparameters space and identify the best set of hyperparameters that result in the best model’s performance, based on the chosen evaluation metric.

One of the superpowers of sklearn is the integration with the Python ecosystem for Data Science. This allows Data Scientists to leverage the power of sklearn with libraries and tools for data manipulation, visualization, and analysis. These are some examples:

  1. Pandas and Numpy. The possibility to integrate sklearn with Pandas data frames allow users to pass data frames directly to sklearn‘s functions and classes. This makes it convenient to preprocess and clean data using Pandas before feeding the data into sklearn models. sklearn, also, relies heavily on NumPy arrays for its data representation. In fact, it can directly accept NumPy arrays as input for training tasks.
  2. Matplotlib and Seaborn. sklearn provides built-in support for plotting the metrics and the model evaluation results using Matplotlib. It allows users to visualize model performance, feature importance, and other relevant statistics, helping us understand our ML models. Also, sinceSeaborn is a data visualization library built on top of Matplotlib,sklearn also works well with it, allowing users to combine the power of sklearn‘s ML algorithms with Seaborn‘s aesthetic plots for exploratory data analysis and model evaluation.

In this final paragraph, I want to stress out some usage examples. If you have some experience using sklearn I know you know how to use it. But, please, consider that here the point is not to show how these features work: it’s just to underline how the workflow works, based on what we said in the previous paragraphs. Because, again: often we use some tools just for the sake of using them and because everyone does, without asking any questions.

Example #1: integration with Pandas and NumPy

Let’s see the integration with sklearn,Pandas, and NumPy. For the sake of simplicity, we can use a prompt from ChatGPT. But before it, you may need to know how to effectively use prompt engineering as a Data Scientist:

The prompt could be the following:” Create a Pandas data frame for regression analysis and make it with 5 features. Then, normalize the data, and split it into train and test sets. Fit the test set with a linear regression model. Print R² for both train and test sets”.

And we get:

Certainly! I will create a sample Pandas data frame with five features, normalize the data, split it into train and test sets, fit a linear regression model, and print the R² scores for both the train and test sets. Here’s an example implementation:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

# Generating random features with at least 50 occurrences each
np.random.seed(42)

feature1 = np.random.randint(0, 10, size=100)
feature2 = np.random.randint(0, 10, size=100)
feature3 = np.random.randint(0, 10, size=100)
feature4 = np.random.randint(0, 10, size=100)
feature5 = np.random.randint(0, 10, size=100)
target = np.random.randint(0, 100, size=100)

# Creating a Pandas data frame
data = {
'Feature1': feature1,
'Feature2': feature2,
'Feature3': feature3,
'Feature4': feature4,
'Feature5': feature5,
'Target': target
}

df = pd.DataFrame(data)

# Splitting the data into features and target
X = df.drop('Target', axis=1)
y = df['Target']

# Normalizing the data using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting the normalized data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fitting the linear regression model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Predicting the target variable for both train and test sets
y_train_pred = reg_model.predict(X_train)
y_test_pred = reg_model.predict(X_test)

# Calculating R² scores for train and test sets
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Printing the R² scores
print("R² score for train set:", r2_train)
print("R² score for test set:", r2_test)

In this example, I created a Pandas data frame with five features (Feature1 to Feature5) and one target variable (Target). Then, I used the StandardScaler from scikit-learn to normalize the feature data. After that, I split the normalized data into train and test sets using a 80:20 ratio. Next, I trained a linear regression model on the train set and predicted the target variables for both the train and test sets. Finally, I calculated and printed the R² scores for both the train and test sets.

So, here’s what happened on a workflow level:

  1. Creation of the features and labels with NumPy.
  2. The features and labels became a Pandas data frame.
  3. The Pandas data frame became X and y: the data frame of the features and the Pandas series of the label, respectively.
  4. X and y have been scaled and splitted. Then, a linear regression model makes predictions and is evaluated using the R², metric on both the train and test sets.

So, as we can see, we have a straight workflow that starts from NumPy and ends with sklearn, giving us the needed results. No “strange manipulations” or integrations are needed.

Example #2: integration with Matplotlib

To show the integration between sklearn and Matplotlib we make use, again, of ChatGPT with this prompt:” Create a binary classification problem in Python with data generated in NumPy. Standardize and split the data into train and test sets. Fit the train set with a random forest and KNN classifiers. Print ROC curve for the train set.”

And we get:

Sure! Let’s create a binary classification problem using randomly generated data in Python, and then we’ll standardize and split the data into training and testing sets. We’ll fit the train set using random forest and K-nearest neighbors (KNN) classifiers, and finally, we’ll plot the ROC curve for the train set.

Here’s the code:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(42)

# Generate random data
num_samples = 1000
num_features = 10

# Generate features (X)
X = np.random.randn(num_samples, num_features)

# Generate labels (y)
y = np.random.randint(2, size=num_samples)

# Standardize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the train set using random forest classifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# Fit the train set using K-nearest neighbors (KNN) classifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

# Predict probabilities for the positive class (class 1)
y_train_rf_probs = rf_clf.predict_proba(X_train)[:, 1]
y_train_knn_probs = knn_clf.predict_proba(X_train)[:, 1]

# Calculate the false positive rate (FPR), true positive rate (TPR), and thresholds for the ROC curve
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_train, y_train_rf_probs)
knn_fpr, knn_tpr, knn_thresholds = roc_curve(y_train, y_train_knn_probs)

# Calculate the AUC score for the ROC curve
rf_auc = roc_auc_score(y_train, y_train_rf_probs)
knn_auc = roc_auc_score(y_train, y_train_knn_probs)

# Plot the ROC curve
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.2f})")
plt.plot(knn_fpr, knn_tpr, label=f"KNN (AUC = {knn_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--') # Diagonal line for random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Train Set')
plt.legend(loc='lower right')
plt.show()

This code generates random data with 10 features and 1000 samples. It then standardizes the features, splits the data into train and test sets, fits the train set using random forest and KNN classifiers, predicts the probabilities for the positive class, calculates the ROC curve and AUC score for each classifier, and finally, plots the ROC curve for the train set.

And if we run the above generated code we get:

The ROC curve resulting from the above-generated code. Image by Author.

So, here’s the process:

  1. Creation of the features and label with NumPy.
  2. Features and label arrays became X and y.
  3. Features and label arrays are splitted into the train and test sets. Then, the train set has been fitted with a KNN and a Random Forest classifiers.
  4. Matplotlib library has been used to plot the ROC curve, making it easy for us to compare the performance of the two classifiers.

And again, we have a straight workflow that uses all of the libraries we needed, printing the results we wanted with no “additional manipulation”.



Source link

Leave a Comment