The Good, The Bad, and the Ugly of Pd.Get_Dummies | by Adam Ross Nelson | Jul, 2023


This is for the pd.get_dummies diehards

Howdy folks 🤠

Okay, I get it. One of the easiest ways to convert a categorial to an array of dummies in Python is with the Pandas pd.get_dummies(). Why would you take the time to import OneHotEncoderfrom sklearn, execute a .fit_transform() etc, etc, etc? Talk about tedious!

This article will first introduce a simple data set for demonstration purposes that consists of a testing set that contains categoricals not found in the training set. Then, it will demonstrate how using pd.get_dummies() can lead to problems with the demonstration data. And, finally show how to avoid that problem with sklearn’s OneHotEncoder.

Three panda bears that look like country western cowboys. Two bears have hats. They’re on a green field.
Image Credit: Author’s illustration using text to image in Canva. Prompted: “Three panda bears dressed as country western cowboys.”

Here we have a simple dataset that includes a categorical feature called OS. The OS column lists computer operating systems. We will use this fictional data for purposes of demonstration. In train_df will be fictional demonstration training data. While in test_df we have fictional demonstration testing data.

In our fictional demonstration case, the testing set contains categorical values not present in the training set. This mis-match will cause problems.

import pandas as pd

train_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Linux', 'Windows', 'MacOS']})
test_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Android', 'Unix' 'iOS']})

In our training data, we have three operating systems: Windows, MacOS, and Linux. But in our testing data, we have the additional categories including Android, Unix, and iOS.

A model fit on train_df.get_dummies() will not work with testing data from test_df.get_dummies(). The results do not match.

A woden dummie model used in art shown on a blue background.
Image Credit: Author’s illustration created in Canva using Canva stock images. An art supply dummy.

When applying the pd.get_dummies() function to both our training and testing datasets here is what you’ll get.



Source link

Leave a Comment