Naive Bayes Classification. In-depth explanation of the Naive Bayes… | by Dr. Roi Yehoshua | Jun, 2023

The event models described above can also be combined in case we have a heterogenous data set, i.e., a data set that contains different types of features (for example, both categorical and continuous features).

The module sklearn.naive_bayes provides implementations for all the four Naive Bayes classifiers mentioned above:

  1. BernoulliNB implements the Bernoulli Naive Bayes model.
  2. CategoricalNB implements the categorical Naive Bayes model.
  3. MultinomialNB implements the multinomial Naive Bayes model.
  4. GaussianNB implements the Gaussian Naive Bayes model.

The first three classes accept a parameter called alpha that defines the smoothing parameter (by default it is set to 1.0).

In the following demonstration we will use MultinomialNB to solve a document classification task. The data set we are going to use is the 20 newsgroups dataset, which consists of 18,846 newsgroups posts, partitioned (nearly) evenly across 20 different topics. This data set has been widely used in research of text applications in machine learning, including document classification and clustering.

Loading the Data Set

You can use the function fetch_20newsgroups() in Scikit-Learn to download the text documents with their labels. You can either download all the documents as one group, or download the training set and the test set separately (using the subset parameter). The split between the training and the test sets is based upon messages posted before or after a specific date.

By default, the text documents contain some metadata such as headers (e.g., the date of the post), footers (signatures) and quotes to other posts. Since these features are not relevant for the text classification task, we will strip them out by using the remove parameter:

from sklearn.datasets import fetch_20newsgroups

train_set = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
test_set = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

Note that the first time you call this function it may take a few minutes to download all the documents, after which they will be cached locally in the folder ~/scikit_learn_data.

The output of the function is a dictionary that contains the following attributes:

  • data — the set of documents
  • target — the target labels
  • target_names — the names of the document categories

Let’s store the documents and their labels in proper variables:

X_train, y_train =,
X_test, y_test =,

Data Exploration

Let’s do some basic exploration of the data. The number of documents we have in the training and the test sets is:

print('Documents in training set:', len(X_train))
print('Documents in test set:', len(X_test))
Documents in training set: 11314
Documents in test set: 7532

A simple calculation shows that 60% of the documents belong to the training set, and 40% to the test set.

Let’s print the list of categories:

categories = train_set.target_names

As evident, some of the categories are closely related to each other (e.g., comp.sys.mac.hardware and, while others are highly uncorrelated (e.g., sci.electronics and soc.religion.christian).

Finally, let’s examine one of the documents in the training set (e.g., the first one):

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Unsurprisingly, the label of this document is:


Converting Text to Vectors

In order to feed text documents into machine learning models, we first need to convert them into vectors of numerical values (i.e., vectorize the text). This process typically involves preprocessing and cleaning of the text, and then choosing a suitable numerical representation for the words in the text.

Text preprocessing consists of various steps, amongst which the most common ones are:

  1. Cleaning and normalizing the text. This includes removing punctuation marks and special characters, and converting the text into lower-case.
  2. Text tokenization, i.e., splitting the text into individual words or terms.
  3. Removal of stop words. Stop words are a set of commonly used words in a given language. For example, stop words in English include words like “the”, “a”, “is”, “and”. These words are usually filtered out since they do not carry useful information.
  4. Stemming or lemmatization. Stemming reduces the word to its lexical root by removing or replacing its suffix, while lemmatization reduces the word to its canonical form (lemma) and also takes into account the context of the word (its part-of-speech). For example, the word computers has the lemma computer, but its lexical root is comput.

The following example demonstrates these steps on a given sentence:

Text preprocessing example

After cleaning the text, we need to choose how to vectorize it into a numerical vector. The most common approaches are:

  1. Bag-of-words (BOW) model. In this model, each document is represented by a word counts vector (similar to the one we have used in the spam filter example).
  2. TF-IDF (Term Frequency times Inverse Document Frequency) measures how relevant a word is to a document by multiplying two metrics:
    (a) TF (Term Frequency) — how many times the word appears in the document.
    (b) IDF (Inverse Document Frequency) — the inverse of the frequency in which the word appears in documents across the entire corpus.
    The idea is to decrease the weight of words that occur frequently in the corpus, while increasing the weight of words that occur rarely (and thus are more indicative of the document’s category).
  3. Word embeddings. In this approach, words are mapped into real-valued vectors in such a way that words with similar meaning have close representation in the vector space. This model is typically used in deep learning and will be discussed in a future post.

Scikit-Learn provides the following two transformers, which support both text preprocessing and vectorization:

  1. CountVectorizer uses the bag-of-words model.
  2. TfIdfVectorizer uses the TF-IDF representation.

Important hyperparameters of these transformers include:

  • lowercase — whether to convert all the characters to lowercase before tokenizing (defaults to True).
  • token_pattern — the regular expression used to define what is a token (the default regex selects tokens of two or more alphanumeric characters).
  • stop_words — if ‘english’, uses a built-in stop word list for English. If None (the default), no stop words will be used. You can also provide your own custom stop words list.
  • max_features — if not None, build a vocabulary that includes only the top max_features with the highest term frequency across the training corpus. Otherwise, all the features are used (this is the default).

Note that these transformers do not provide advanced preprocessing techniques such as stemming or lemmatization. To apply these techniques, you will have to use other libraries such as NLTK (Natural Language Toolkit) or spaCy.

Since Naive Bayes models are known to work better with TF-IDF representations, we will use the TfidfVectorizer to convert the documents in the training set into TF-IDF vectors:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)

The shape of the extracted TF-IDF vectors is:

(11314, 101322)

That is, there are 101,322 unique tokens in the vocabulary of the corpus. We can examine these tokens by calling the method get_feature_names_out() of the vectorizer:

vocab = vectorizer.get_feature_names_out()
print(vocab[50000:50010]) # pick a subset of the tokens
['innacurate' 'innappropriate' 'innards' 'innate' 'innately' 'inneficient'
'inner' 'innermost' 'innertubes' 'innervation']

Evidently, there was no automatic spell checker back in the 90s 🙂

The TF-IDF vectors are very sparse, with an average of 67 non-zero components out of more than 100,000:

print(X_train_vec.nnz / X_train_vec.shape[0])

Let’s also vectorize the documents in the test set (note that on the test set we call the transform method instead of fit_transform):

X_test_vec = vectorizer.transform(X_test)

Building the Model

Let’s now build a multinomial Naive Bayes classifier and fit it to the training set:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.01), y_train)

Note that we need to set the smoothing parameter α to a very small number, since the TF-IDF values are scaled to be between 0 and 1, so the default α = 1 would cause a dramatic shift of the values.

Evaluating the Model

Next, let’s evaluate the model on both the training and the test sets.

The accuracy and F1 score of the model on the training set are:

from sklearn.metrics import f1_score

accuracy_train = clf.score(X_train_vec, y_train)
y_train_pred = clf.predict(X_train_vec)
f1_train = f1_score(y_train, y_train_pred, average='macro')

print(f'Accuracy (train): {accuracy_train:.4f}')
print(f'F1 score (train): {f1_train:.4f}')

Accuracy (train): 0.9595
F1 score (train): 0.9622

And the accuracy and F1 score on the test set are:

accuracy_test = clf.score(X_test_vec, y_test)
y_test_pred = clf.predict(X_test_vec)
f1_test = f1_score(y_test, y_test_pred, average='macro')

print(f'Accuracy (test): {accuracy_test:.4f}')
print(f'F1 score (test): {f1_test:.4f}')

Accuracy (test): 0.7010
F1 score (test): 0.6844

The scores on the test set are relatively low compared to the training set. To investigate where the errors come from, let’s plot the confusion matrix of the test documents:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(ax=ax, cmap='Blues')

The confusion matrix on the test set

As we can see, most of the confusions occur between highly correlated topics, for example:

  • 74 confusions between topic 0 (alt.atheism) and topic 15 (soc.religion.christian)
  • 92 confusions between topic 18 (talk.politics.misc) and topic 16 (talk.politics.guns)
  • 89 confusions between topic 19 (talk.religion.misc) and topic 15 (soc.religion.christian)

In light of these findings, it seems that the Naive Bayes classifier did a pretty good job. Let’s examine how it compares to other standard classification algorithms.


We will benchmark the Naive Bayes model against four other classifiers: logistic regression, KNN, random forest and AdaBoost.

Let’s first write a function that gets a set of classifiers and evaluates them on the given data set and also measures their training time:

import time

def benchmark(classifiers, names, X_train, y_train, X_test, y_test, verbose=True):
evaluations = []

for clf, name in zip(classifiers, names):
evaluation = {}
evaluation['classifier'] = name

start_time = time.time(), y_train)
evaluation['training_time'] = time.time() - start_time

evaluation['accuracy'] = clf.score(X_test, y_test)
y_test_pred = clf.predict(X_test)
evaluation['f1_score'] = f1_score(y_test, y_test_pred, average='macro')

if verbose:
return evaluations

We will now call this function with our five classifiers:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

classifiers = [clf, LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), AdaBoostClassifier()]
names = ['Multinomial NB', 'Logistic Regression', 'KNN', 'Random Forest', 'AdaBoost']

evaluations = benchmark(classifiers, names, X_train_vec, y_train, X_test_vec, y_test)

The output we get is:

{'classifier': 'Multinomial NB', 'training_time': 0.06482672691345215, 'accuracy': 0.7010090281465746, 'f1_score': 0.6844389919212164}
{'classifier': 'Logistic Regression', 'training_time': 39.38498568534851, 'accuracy': 0.6909187466808284, 'f1_score': 0.6778246092753284}
{'classifier': 'KNN', 'training_time': 0.003989696502685547, 'accuracy': 0.08218268720127456, 'f1_score': 0.07567337211476842}
{'classifier': 'Random Forest', 'training_time': 43.847145318984985, 'accuracy': 0.6233404142326076, 'f1_score': 0.6062667217793061}
{'classifier': 'AdaBoost', 'training_time': 6.09197473526001, 'accuracy': 0.36563993627190655, 'f1_score': 0.40123307742451064}

Let’s plot the accuracy and F1 scores of the classifiers:

df = pd.DataFrame(evaluations).set_index('classifier')

plt.xlabel('Accuracy (test)')

Accuracy scores on the test set
plt.xlabel('F1 score (test)')
F1 scores on the test set

Multinomial NB achieves both the highest accuracy and F1 scores. Notice that the classifiers have been used with their default parameters without any tuning. For a more fair comparison, the algorithms should be compared after fine tuning their hyperparameters. In addition, some algorithms such as KNN suffer from the curse of dimensionality, and dimensionality reduction is required in order to make them work well.

Let’s also plot the training times of the classifiers:

plt.xlabel('Training time (sec)')
Training time of the different classifiers

The training of Multinomial NB is so fast that we cannot even see its time in the graph! By examining the function’s output from above, we can see that its training time is only 0.064 seconds. Note that the training of KNN is also very fast (since no model is actually built), but its prediction time (not shown) is very slow.

In conclusion, Multinomial NB has shown superiority over the other classifiers in all the examined criteria.

Finding the Most Informative Features

The Naive Bayes model also allows us to get the most informative features of each class, i.e., the features with the highest likelihood P(xⱼ|y).

The MultinomialNB class has an attribute named feature_log_prob_, which provides the log probability of the features for each class in a matrix of shape (n_classes, n_features).

Using this attribute, let’s write a function to find the 10 most informative features (tokens) in each category:

def show_top_n_features(clf, vectorizer, categories, n=10):
feature_names = vectorizer.get_feature_names_out()

for i, category in enumerate(categories):
top_n = np.argsort(clf.feature_log_prob_[i])[-n:]
print(f"{category}: {' '.join(feature_names[top_n])}")

show_top_n_features(clf, vectorizer, categories)

The output we get is:

alt.atheism: islam atheists say just religion atheism think don people god looking format 3d know program file files thanks image graphics card problem thanks driver drivers use files dos file windows monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac using windows x11r5 use application thanks widget server motif window asking email sell price condition new shipping offer 00 sale don ford new good dealer just engine like cars car don just helmet riding like motorcycle ride bikes dod bike braves players pitching hit runs games game baseball team year league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: don thanks voltage used know does like circuit power use skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg just lunar earth shuttle like moon launch orbit nasa space
soc.religion.christian: believe faith christian christ bible people christians church jesus god
talk.politics.guns: just law firearms government fbi don weapons people guns gun
talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel
talk.politics.misc: know state clinton president just think tax don government people
talk.religion.misc: think don koresh objective christians bible people christian jesus god

Most of the words seem to be strongly correlated with their corresponding category. However, there are a few generic words such as “just” and “does” that do not provide valuable information. This suggests that our model may be improved by having a better stop-words list. Indeed, Scikit-Learn recommends not to use its own default list, quoting from its documentation: “There are several known issues with ‘english’ and you should consider an alternative”. 😲

Let’s summarize the pros and cons of Naive Bayes as compared to other classification models:


  • Extremely fast both in training and prediction
  • Provides class probability estimates
  • Can be used both for binary and multi-class classification problems
  • Requires a small amount of training data to estimate its parameters
  • Highly interpretable
  • Highly scalable (the number of parameters is linear in the number of features)
  • Works well with high-dimensional data
  • Robust to noise (the noisy samples are averaged out when estimating the conditional probabilities)
  • Can deal with missing values (the missing values are ignored when computing the likelihoods of the features)
  • No hyperparameters to tune (except for the smoothing parameter, which is rarely changed)


  • Relies on the Naive Bayes assumption which does not hold in many real-world domains
  • Correlation between the features can degrade the performance of the model
  • Generally outperformed by more complex models
  • The zero frequency problem: if a categorical feature has a category that was not observed in the training set, the model will assign a zero probability to its occurrence. Smoothing alleviates this problem but does not solve it completely.
  • Cannot handle continuous attributes without discretization or making assumptions on their distribution
  • Can be used only for classification tasks

This is the longest article I have written on Medium so far. I hope you enjoyed reading it at least as much as I enjoyed writing it. Let me know in the comments if something was not clear.

You can find the code examples of this article on my github:

All images unless otherwise noted are by the author.

The 20 newsgroups data set info:

Source link

Leave a Comment