Creating Incredible Decision Tree Visualizations with dtreeviz


How to visualize decision tree models with this useful library

Image by the Author, created using dtreeviz.

When it comes to model explainability, decision trees are some of the most intuitive and explainable models. Every decision tree model can be explained as a set of human-interpretable rules. Being able to visualize decision tree models is important for model explainability and can help stakeholders and business managers gain trust in these models.

Luckily, we can easily visualize and interpret decision trees with the dtreeviz library. In this article, I will demonstrate how you can use dtreeviz to visualize tree-based models for regression and classification.

You can easily install dtreeviz with pip using the following command:

pip install dtreeviz

For a detailed list of dependencies and additional libraries that may need to be installed depending on your operating system, please refer to this GitHub repository.

In this section, we will train a decision tree regressor on the diabetes dataset. Note that you can find all of the code for this tutorial in this GitHub repository. Keep in mind that I am using Jupyter as my environment for running this Python code. You can find all of the code I have written for this tutorial in this Github repository.

Import Libraries

In the code block below I simply imported a few common libraries including the scikit-learn DecisionTree modules and dtreeviz.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
import dtreeviz

Read Data

The diabetes dataset is available in scikit-learn so we can use the code below to import the dataset and store the features and target values in numpy arrays named X and y.

from sklearn.datasets import load_diabetes

diabetes_data = load_diabetes()
X = pd.DataFrame(data = diabetes_data['data'], columns=diabetes_data['feature_names'])
y = diabetes_data['target']

Training the Decision Tree Model

For the purpose of making the tree easy to visualize, we can limit the max depth of the decision tree and train it on the data as follows.

dtree_reg = DecisionTreeRegressor(max_depth=3)
dtree_reg.fit(X, y)

Visualizing the Tree

One of the key features of dtreeviz is the ability to visualize decision tree models. Using the code below we can create a cool decision tree visualization that also visually depicts the decision boundaries at each node.

viz_model = dtreeviz.model(dtree_reg,
X_train=X, y_train=y,
feature_names=list(X.columns),
target_name='diabetes')
viz_model.view()
Diabetes regression tree visualization. Image created with dtreeviz by the author.

Notice how the visualization above also gives us the decision boundaries and feature space at each node as well as the regression outputs and sample size at each leaf.

Visualizing the Leaf Distributions

Another useful function that dtreeviz provides is the abilty to visualize leaf distributions. The leaf nodes of a decision tree contain the actual values that a decision tree will predict depending on each set of conditions. Using the rtree_leaf_distributions function, we can create this visualization for our regression tree.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

viz_model.rtree_leaf_distributions()

Based on the visualization above, we can see that the decision tree can either predict 268.9, 208.6, 176.9, 137.7, 154.7, 274.0, 83.4, or 108.8 for the target variable named diabetes. The horizontally scattered points represent the distribution of values for the diabetes target variable and the small black line represents the average value which is used for predictions at that leaf node. Ideally, the leaf distributions should have low variance so that we can have more confidence in the average values used for predictions.

Visualizing the Leaf Sizes

We can also visualize the leaf sizes, or the number of samples at each leaf node, as demonstrated with the function below.

viz_model.leaf_sizes()

Based on the plot above, we can see the number of samples at each leaf. This visualization is a good tool for evaluating how confident we can be in the regression tree predictions.

We can also visualize classification trees with dtreeviz and the visualizations look slightly different from those created for regression trees. For this section, we will train and visualize a decision model using the Breast Cancer Wisconsin dataset.

Read Data

The Breast Cancer Wisconsin dataset is available in scikit-learn so we can just load it using the code below.

from sklearn.datasets import load_breast_cancer

cancer_data = load_breast_cancer()
X = pd.DataFrame(data = cancer_data['data'], columns=cancer_data['feature_names'])
y = cancer_data['target']

Training a Decision Tree Model

As usual, training a decision tree model with scikit-learn is straightforward. We can also place a constraint on the maximum tree depth to make it easier to visualize the decision tree.

dtree_clf = DecisionTreeClassifier(max_depth=4)
dtree_clf.fit(X, y)

Visualizing the Decision Tree

We can use the exact same function from the regression tree section to visualize the classification tree. However, the visualization will look slightly different.

viz_model = dtreeviz.model(dtree_clf,
X_train=X, y_train=y,
feature_names=list(X.columns),
target_name='cancer')
viz_model.view()
Cancer classification decision tree.

Notice how the classification tree visualization above is different from the regression tree visualization in the previous section. Instead of seeing a scatter plot at each node with the selected feature and the target, we see colored histograms that show the class distribution at each node.

Visualizing the Leaf Distributions

We can also visualize the class distributions for the leaves using the same function for visualizing the leaf distributions for the regression tree.


viz_model.ctree_leaf_distributions()
Leaf distribution plot for the classification tree. Image created by the author using dtreeviz.

Each leaf has a stacked bar graph associated with it that presents the distribution of class labels for the samples at that leaf. Most of the leaves have samples that overwhelmingly belong to one class, which is a good sign and helps us gain some confidence in the model’s predictions.

Visualizing the Feature Space

We can also visualize the feature space of the classifier using the function below.

viz_model.ctree_feature_space()

The feature space plot above gives us the training accuracy of the classification tree as well as a scatterplot of two features and a linear decision boundary that can be used for separating the two classes.

When it comes to visualizing tree-based models dtreeviz is a powerful library that provides several useful visualization functions. I have only covered a few of the functions provided in this library and there are many additional features that you can read about in the dtreeviz GitHub repository. As usual, you can find all of the code for this article on my GitHub.

Join my mailing list to get updates on my data science content. You’ll also get my free Step-By-Step Guide to Solving Machine Learning Problems when you sign up! You can also follow me on Twitter for content updates.

And while you’re at it, consider joining the Medium community to read articles from thousands of other writers as well.

  1. Terence Parr, dtreeviz: Decision Tree Visualization, (2023), GitHub.





Source link

Leave a Comment