Exposing sklearn machine learning models in Power BI | by Mark Graus | May, 2023


In some cases we want to have a supervised learning model to play around with. While any data scientist can quite easily build an SKLearn model and play around with it in a Jupyter notebook, when you want to have other stakeholders interact with your model you will have to create a bit of a front-end. This can be done in a simple Flask webapp, providing a web interface for people to feed data into an sklearn model or pipeline to see the predicted output. This post however will focus on how to use Python visuals in Power BI to interact with a model.

The post will consist of two main parts:

  • Building the SKLearn Model / Building a Pipeline
  • Building the Power BI Interface

The code is really straightforward and you can copypaste whatever you need from this post, but it is also available on my Github. To use it, you have to do two things. Run the code in the Python Notebook to serialize the pipeline and alter the path to that pipeline in the Power BI file.

For this example we’ll use the Titanic dataset and build a simple predictive model. The model will be a classification model, using one categorical (‘sex’) and one numeric feature (‘age’) as predictors. To demonstrate the approach we will use the RandomForestClassifier as the classification model. This is because a Random Forest Classifier is a bit harder to implement in Power BI than for example a logistic regression that could be coded in MQuery or DAX. In addition, since this post is not aimed at really building the best model, I am relying on parts of the scikit-learn documentation quite a bit and I will not be looking at performance that much.

The code we create does a couple of things. First of all, it loads and preprocesses the Titanic dataset. As mentioned before, we’re only using the ‘sex’ and the ‘age’ features, but those still need to be processed. The categorical variable ‘sex’ has to be transformed into Dummy Variables or has to be One Hot Encoded (i.e. the one column has to be recoded into a set of columns) for any sklearn model to be able to handle it. For the numerical feature ‘age’ we do a standard MinMaxScaling, as it goes from about 0 to 80, while ‘sex’ goes from 0 to 1. Once all of that is done, we drop all observations with missing values, do a Train/Test split and build and serialize the pipeline.

#Imports

from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#Load the dataset
X,y = fetch_openml("titanic", version = 1, as_frame=True, return_X_y=True)

#Create the OneHotEncoding for the categorical variable 'sex'

categorical_feature = ["sex"]
categorical_transformer = Pipeline(
steps = [
("encoder",OneHotEncoder(drop="first"))
])
preprocessor = ColumnTransformer(
transformers = [
("categorical", categorical_transformer, categorical_feature)
])

#Creating the Pipeline, with preprocessing and the Random Forest Classifier
clf = Pipeline(
steps = [
("preprocessor", preprocessor),
("classifier", RandomForestClassifier())
]
)

#Select only age and sex as predictors
X = X[["age","sex"]]

#Drop rows with missing values
X = X.dropna()

#Keep only observations corresponding to rows without missing values
y = y[X.index]

#Create Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

#Fit the Pipeline
clf.fit(X_train, y_train)

#Score the Pipeline
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

The code above creates a model that scores not really good, but good enough for the purpose of this post.

              precision    recall  f1-score   support

0 0.85 0.84 0.85 159
1 0.76 0.77 0.76 103

accuracy 0.81 262
macro avg 0.80 0.80 0.80 262
weighted avg 0.81 0.81 0.81 262

What will help us later, is to check how the model predicts. To do that we create a DataFrame with the Cartesian product age and sex (i.e. all possible ‘age’/’sex’ combinations). We use that DataFrame to calculate predictions from the pipeline and we subsequently plot these predictions as a heatmap. The code to do that looks as follows.

from pandas import DataFrame

# Create a DataFrame with all possible ages
ages = DataFrame({'age':range(1,80,1)})

# Create a DataFrame with all possible sexes
sexes = DataFrame({'sex':["male","female"]})

# Create a DataFrame with all possible combinations.
combinations = ages.merge(sexes, how='cross')

# Predict survival for combinations
combiations["predicted_survival"] = clf.predict(combinations)

# Plot the Heatmap
sns.heatmap(pd.pivot_table(results, values="predicted_survival", index=["age"],columns=["sex"]), annot=True)

The corresponding heatmap looks as follows and shows that for example for females from 13–33 years old, the prediction is survival (1). While a female aged exactly 37 is predicted not to survive. For males, the predictions are mostly no survival, except for age 12 and some younger ages. This information will be useful when debugging the Power BI report.

Now that this is done, we can serialize the model to start embedding it into a Power BI report.

from joblib import dump
dump(clf, "randomforest.joblib")

Creating the Power BI Interface consists of two steps. The first step is that of creating the controls to feed data into the model. The second is that of creating the visualization that takes the inputs from the controls, feeds it into the model and shows the prediction.

2a. Controls

A couple of concepts are important to be aware of when using Power BI. First of all, there are Parameters, or variables that contain values in Power BI. These Parameters can be controlled through slicers and the values they contain can be accessed through visualization elements in Power BI, which in our case will be a Python visualization.



Source link

Leave a Comment