Most educational and real-world datasets contain categorical features. Today we will cover gradient boosted decision trees from the CatBoost library, which provides native support for categorical data. We will use a dataset of mushrooms that are either edible or poisonous. The mushrooms are described by categorical features such as their color, odor, and shape, and the question we want to answer is:
Is it safe to eat this mushroom — based on its categorical features?
As you can see, the stakes are high. We want to make sure that we get the machine learning model right so that our mushroom omelet does not end in a disaster. As a bonus, at the end we will provide a feature importance ranking that tells you which categorical feature is the strongest predictor of mushroom safety.
Introducing the mushroom dataset
The mushroom dataset is available here: https://archive.ics.uci.edu/dataset/73/mushroom . For clarity of presentation, we create a pandas DataFrame from the original cryptic short-form variables and annotate it with proper column names and long-form variables. We use pandas’
replace function with long-form variables taken from the dataset description. The target variable can only take True and False values — the dataset creators played it safe and classified questionable mushrooms as inedible.