A guide to handling categorical variables in Python | by Andrea D’Agostino | Jun, 2023


A guide on how to approach categorical variables for machine learning and data science purposes

Photo by Thomas Haas / Unsplash

Handling categorical variables in a data science or machine learning project is no easy task. This type of work requires deep knowledge of the field of application and a broad understanding of the multiple methodologies available.

For this reason, the present article will focus on explaining the following concepts

  • what are categorical variables and how to divide them into the different types
  • how to convert them to numeric value based on their type
  • tools and technologies for their management mainly using Sklearn

Proper handling of categorical variables can greatly improve the result of our predictive model or analysis. In fact, most of the information relevant to learning and understanding data could be contained in the available categorical variables.

Just think of tabular data, split by the variable gender or by a certain color. These spits, based on the number of categories, can bring out significant differences between groups and which can inform the analyst or the learning algorithm.

Let’s start by defining what they are and how they can present themselves.

Categorical variables are a type of variable used in statistics and data science to represent qualitative or nominal data. These variables can be defined as a class or category of data that cannot be quantified continuously, but only discretely.

For example, an example of a categorical variable might be a person’s eye color, which can be blue, green, or brown.

Most learning models don’t work with data in a categorical format. We must first convert them into numeric format so that the information is preserved.

Categorical variables can be classified into two types:

Nominal variables are variables that are not constrained by a precise order. Gender, color, or brands are examples of nominal variables since they are not sortable.

Ordinal variables are instead categorical variables divided into logically orderable levels. A column in a dataset that consists of levels such as First, Second, and Third can be considered an ordinal categorical variable.

You can go deeper into the breakdown of categorical variables by considering binary and cyclic variables.

A binary variable is simple to understand: it is a categorical variable that can only take on two values.

A cyclic variable, on the other hand, is characterized by a repetition of its values. For example, the days of the week are cyclical, and so are the seasons.

Now that we’ve defined what categorical variables are and what they look like, let’s tackle the question of transforming them using a practical example — a Kaggle dataset called cat-in-the-dat.

The dataset

This is an open source dataset at the basis of an introductory competition to the management and modeling of categorical variables, called the Categorical Feature Encoding Challenge II. You can download the data directly from the link below.

The peculiarity of this dataset is that it contains exclusively categorical data. So it becomes the perfect use case for this guide. It includes nominal, ordinal, cyclic, and binary variables.

We will see techniques for transforming each variable into a format usable by a learning model.

The dataset looks like this

Image by author.

Since the target variable can only take on two values, this is a binary classification task. We will use the AUC metric to evaluate our model.

Now we are going to apply techniques for managing categorical variables using the mentioned dataset.

1. Label Encoding (mapping to an arbitrary number)

The simplest technique there is for converting a category into a usable format is to assign each category to an arbitrary number.

Take for example the ord_2 column which contains the categories

array(['Hot', 'Warm', 'Freezing', 'Lava Hot', 'Cold', 'Boiling Hot', nan],
dtype=object)

The mapping could be done like this using Python and Pandas:

df_train = train.copy()

mapping = {
"Cold": 0,
"Hot": 1,
"Lava Hot": 2,
"Boiling Hot": 3,
"Freezing": 4,
"Warm": 5
}

df_train["ord_2"].map(mapping)

>>
0 1.0
1 5.0
2 4.0
3 2.0
4 0.0
...
599995 4.0
599996 3.0
599997 4.0
599998 5.0
599999 3.0
Name: ord_2, Length: 600000, dtype: float64

However, this method has a problem: you have to manually declare the mapping. For a small number of categories this is not a problem, but for a large number it could be.

For this we will use Scikit-Learn and the LabelEncoder object to achieve the same result in a more flexible way.

from sklearn import preprocessing

# we handle missing values
df_train["ord_2"].fillna("NONE", inplace=True)
# init the sklearn encoder
le = preprocessing.LabelEncoder()
# fit + transform
df_train["ord_2"] = le.fit_transform(df_train["ord_2"])
df_train["ord_2"]

>>
0 3
1 6
2 2
3 4
4 1
..
599995 2
599996 0
599997 2
599998 6
599999 0
Name: ord_2, Length: 600000, dtype: int64

Mapping is controlled by Sklearn. We can visualize it like this:

mapping = {label: index for index, label in enumerate(le.classes_)}
mapping

>>
{'Boiling Hot': 0,
'Cold': 1,
'Freezing': 2,
'Hot': 3,
'Lava Hot': 4,
'NONE': 5,
'Warm': 6}

Note the .fillna(“NONE") in the code snippet above. In fact, Sklearn’s label encoder does not handle empty values and will give an error when applying it if any are found.

One of the most important things to keep in mind for the correct handling of categorical variables is to always handle the empty values. In fact, most of the relevant techniques don’t work if these aren’t taken care of.

The label encoder maps arbitrary numbers to each category in the column, without an explicit declaration of the mapping. This is convenient, but introduces a problem for some predictive models: it introduces the need to scale the data if the column is not the target one.

In fact, machine learning beginners often ask what the difference is between label encoder and one hot encoder, which we will see shortly. The label encoder, by design, should be applied to the labels, ie the target variable we want to predict and not to the other columns.

Having said that, some models also very relevant in the field work well even with an encoding of this type. I’m talking about tree models, among which XGBoost and LightGBM stand out.

So feel free to use label encoders if you decide to use tree models, but otherwise, we have to use one hot encoding.

2. One Hot Encoding

As I already mentioned in my article about vector representations in machine learning, one hot encoding is a very common and famous vectorization technique (i.e. converting a text into a number).

It works like this: for each category present, a square matrix is created whose only possible values are 0 and 1. This matrix informs the model that among all possible categories, this observed row has the value denoted by 1.

An example:

             |   |   |   |   |   |   
-------------|---|---|---|---|---|---
Freezing | 0 | 0 | 0 | 0 | 0 | 1
Warm | 0 | 0 | 0 | 0 | 1 | 0
Cold | 0 | 0 | 0 | 1 | 0 | 0
Boiling Hot | 0 | 0 | 1 | 0 | 0 | 0
Hot | 0 | 1 | 0 | 0 | 0 | 0
Lava Hot | 1 | 0 | 0 | 0 | 0 | 0

The array is of size n_categories. This is very useful information, because one hot encoding typically requires a sparse representation of the converted data.

What does it mean? It means that for large numbers of categories, the matrix could become equally large. Being populated only by values of 0 and 1 and since only one of the positions can be populated by a 1, this makes the one hot representation very redundant and cumbersome.

A sparse matrix solves this problem — only the positions of the 1’s are saved, while values equal to 0 are not saved. This simplifies the mentioned problem and allows us to save a huge array of information in exchange for very little memory usage.

Let’s see what such an array looks like in Python, applying the code from before again

from sklearn import preprocessing

# we handle missing values
df_train["ord_2"].fillna("NONE", inplace=True)
# init sklearn's encoder
ohe = preprocessing.OneHotEncoder()
# fit + transform
ohe.fit_transform(df_train["ord_2"].values.reshape(-1, 1))

>>
<600000x7 sparse matrix of type '<class 'numpy.float64'>'
with 600000 stored elements in Compressed Sparse Row format>

Python returns an object by default, not a list of values. To get such a list, you need to use .toarray()

ohe.fit_transform(df_train["ord_2"].values.reshape(-1, 1)).toarray()

>>
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 0., 0.]])

Don’t worry if you don’t fully understand the concept: we will soon see how to apply the label and one hot encoder to the dataset to train a predictive model.

Label encoding and one hot encoding are the most important techniques for handling categorical variables. Knowing these two techniques will allow you to handle most cases that involve categorical variables.

3. Transformations and aggregations

Another method of converting from categorical to numeric format is to perform a transformation or aggregation on the variable.

By grouping with .groupby() it is possible to use the count of the values present in the column as the output of the transformation.

df_train.groupby(["ord_2"])["id"].count()

>>
ord_2
Boiling Hot 84790
Cold 97822
Freezing 142726
Hot 67508
Lava Hot 64840
Warm 124239
Name: id, dtype: int64

using .transform() we can replace these numbers to the corresponding cell

df_train.groupby(["ord_2"])["id"].transform("count")

>>
0 67508.0
1 124239.0
2 142726.0
3 64840.0
4 97822.0
...
599995 142726.0
599996 84790.0
599997 142726.0
599998 124239.0
599999 84790.0
Name: id, Length: 600000, dtype: float64

It is possible to apply this logic also with other mathematical operations — the method that most improves the performance of our model should be tested.

4. Create new categorical features from categorical variables

We look at the ord_1 column together with ord_2

image by author.

We can create new categorical variables by merging existing variables. For example, we can merge ord_1 with ord_2 to create a new feature

df_train["new_1"] = df_train["ord_1"].astype(str) + "_" + df_train["ord_2"].astype(str)
df_train["new_1"]

>>
0 Contributor_Hot
1 Grandmaster_Warm
2 nan_Freezing
3 Novice_Lava Hot
4 Grandmaster_Cold
...
599995 Novice_Freezing
599996 Novice_Boiling Hot
599997 Contributor_Freezing
599998 Master_Warm
599999 Contributor_Boiling Hot
Name: new_1, Length: 600000, dtype: object

This technique can be applied in practically any case. The idea that must guide the analyst is to improve the performance of the model by adding information that was originally difficult to understand to the learning model.

5. Use NaN as a categorical variable

Very often null values are removed. This is typically not a move I recommend, as the NaNs contain potentially useful information to our model.

One solution is to treat NaNs as a category in their own right.

Let’s look at the ord_2 column again

df_train["ord_2"].value_counts()

>>
Freezing 142726
Warm 124239
Cold 97822
Boiling Hot 84790
Hot 67508
Lava Hot 64840
Name: ord_2, dtype: int64

Now let’s try applying the .fillna(“NONE") to see how many empty cells exist

df_train["ord_2"].fillna("NONE").value_counts()

>>
Freezing 142726
Warm 124239
Cold 97822
Boiling Hot 84790
Hot 67508
Lava Hot 64840
NONE 18075

As a percentage, NONE represents about 3% of the entire column. It’s a pretty noticeable amount. Exploiting the NaN makes even more sense and can be done with the One Hot Encoder mentioned earlier.

Let’s remember what the OneHotEncoder does: it creates a sparse matrix whose number of columns and rows is equal to the number of unique categories in the referenced column. This means that we must also take into account the categories that could be present in the test set and that could be absent in the train set.

The situation is similar for the LabelEncoder — there may be categories in the test set but which are not present in the training set and this could create problems during the transformation.

We solve this problem by concatenating the datasets. This will allow us to apply the encoders to all data and not just the training data.

test["target"] = -1
data = pd.concat([train, test]).reset_index(drop=True)
features = [f for f in train.columns if f not in ["id", "target"]]
for feature in features:
le = preprocessing.LabelEncoder()
temp_col = data[feature].fillna("NONE").astype(str).values
data.loc[:, feature] = le.fit_transform(temp_col)

train = data[data["target"] != -1].reset_index(drop=True)
test = data[data["target"] == -1].reset_index(drop=True)

Image by author.

This methodology helps us if we have the test set. If we don’t have the test set, we will take into account a value like NONE when a new category becomes part of our training set.

Now let’s move on to the training of a simple model. We will follow the steps from the article on how to design and implement a cross-validation at the following link 👇

We start from scratch, importing our data and creating our folds with Sklearn’s StratifiedKFold.

train = pd.read_csv("/kaggle/input/cat-in-the-dat-ii/train.csv")
test = pd.read_csv("/kaggle/input/cat-in-the-dat-ii/test.csv")

df = train.copy()

df["kfold"] = -1
df = df.sample(frac=1).reset_index(drop=True)
y = df.target.values

kf = model_selection.StratifiedKFold(n_splits=5)

for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, 'kfold'] = f

This little snippet of code will create a Pandas dataframe with 5 groups to test our model against.

Image by author.

Now let’s define a function that will test a logistic regression model on each group.

def run(fold: int) -> None:
features = [
f for f in df.columns if f not in ("id", "target", "kfold")
]

for feature in features:
df.loc[:, feature] = df[feature].astype(str).fillna("NONE")

df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)

ohe = preprocessing.OneHotEncoder()

full_data = pd.concat([df_train[features], df_valid[features]], axis=0)
print("Fitting OHE on full data...")
ohe.fit(full_data[features])

x_train = ohe.transform(df_train[features])
x_valid = ohe.transform(df_valid[features])
print("Training the classifier...")
model = linear_model.LogisticRegression()
model.fit(x_train, df_train.target.values)

valid_preds = model.predict_proba(x_valid)[:, 1]

auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)

print(f"FOLD: {fold} | AUC = {auc:.3f}")

run(0)

>>
Fitting OHE on full data...
Training the classifier...
FOLD: 0 | AUC = 0.785

I invite the interested reader to read the article on cross-validation to understand in more detail the functioning of the code shown.

Now let’s see how instead to apply a tree model like XGBoost, which also works well with a LabelEncoder.

def run(fold: int) -> None:
features = [
f for f in df.columns if f not in ("id", "target", "kfold")
]

for feature in features:
df.loc[:, feature] = df[feature].astype(str).fillna("NONE")

print("Fitting the LabelEncoder on the features...")
for feature in features:
le = preprocessing.LabelEncoder()
le.fit(df[feature])
df.loc[:, feature] = le.transform(df[feature])

df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)

x_train = df_train[features].values
x_valid = df_valid[features].values

print("Training the classifier...")
model = xgboost.XGBClassifier(n_jobs=-1, n_estimators=300)
model.fit(x_train, df_train.target.values)

valid_preds = model.predict_proba(x_valid)[:, 1]

auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)

print(f"FOLD: {fold} | AUC = {auc:.3f}")

# execute on 2 folds
for fold in range(2):
run(fold)

>>
Fitting the LabelEncoder on the features...
Training the classifier...
FOLD: 0 | AUC = 0.768
Fitting the LabelEncoder on the features...
Training the classifier...
FOLD: 1 | AUC = 0.765

In conclusion, there are also other techniques worth mentioning for handling categorical variables:

  • Target-based encoding, where the category is converted into the average value assumed by the target variable in correspondence with it
  • The embeddings of a neural network, which can be used to represent the textual entity

In summary, here are the essential steps for a correct management of categorical variables

  • always treat null values
  • apply LabelEncoder or OneHotEncoder based on the type of variable and template we want to use
  • reason in terms of variable enrichment, considering NaN or NONE as categorical variables that can inform the model
  • Model the data!

Thank you for your time,
Andrea



Source link

Leave a Comment