Practical Introduction to Transformer Models: BERT | by Shashank Kapadia | Jul, 2023

In NLP, the transformer model architecture has been a revolutionary that greatly enhanced the ability to understand and generate textual information.

In this tutorial, we are going to dig-deep into BERT, a well-known transformer-based model, and provide an hands-on example to fine-tune the base BERT model for sentiment analysis.

BERT, introduced by researchers at Google in 2018, is a powerful language model that uses transformer architecture. Pushing the boundaries of earlier model architecture, such as LSTM and GRU, that were either unidirectional or sequentially bi-directional, BERT considers context from both past and future simultaneously. This is due to the innovative “attention mechanism,” which allows the model to weigh the importance of words in a sentence when generating representations.

The BERT model is pre-trained on the following two NLP tasks:

  • Masked Language Model (MLM)
  • Next Sentence Prediction (NSP)

and is generally used as the base model for various downstream NLP tasks, such as sentiment analysis which we will cover in this tutorial.

The power of BERT comes from its two-step process:

  • Pre-training is the phase where BERT is trained on large amounts of data. As a result, it learns to predict masked words in a sentence (MLM task) and to predict if a sentence follows another one (NSP task). The output of this stage is a a pre-trained NLP model with a general-purpose “understanding” of the language
  • Fine-tuning is where the pre-trained BERT model is further trained on a specific task. The model is initialized with the pre-trained parameters, and the entire model is trained on a downstream task, allowing BERT to fine-tune its understanding of language to the specifics of the task at hand.

The complete code is available as a Jupyter Notebook on GitHub

In this hands-on exercise, we will train the sentiment analysis model on the IMDB movie reviews dataset [4] (license: Apache 2.0), which comes labeled whether a review is positive or negative. We will also load the model using the Hugging Face’s transformers library.

Let’s load all the libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Variables to set the number of epochs and samples
num_epochs = 10
num_samples = 100 # set this to -1 to use all data

First, we need to load the dataset and the model tokenizer.

# Step 1: Load dataset and model tokenizer
dataset = load_dataset('imdb')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Next, we’ll create a plot to see the distribution of the positive and negative classes.

# Data Exploration
train_df = pd.DataFrame(dataset["train"])
sns.countplot(x='label', data=train_df)
plt.title('Class distribution')
Fig 1. Class distribution of the training dataset

Next, we preprocess our dataset by tokenizing the texts. We use BERT’s tokenizer, which will convert the text into tokens that correspond to BERT’s vocabulary.

# Step 2: Preprocess the dataset
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets =, batched=True)

After that, we prepare our training and evaluation datasets. Remember, if you want to use all the data, you can set the num_samples variable to -1.

if num_samples == -1:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(num_samples))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(num_samples))

Then, we load the pre-trained BERT model. We’ll use the AutoModelForSequenceClassification class, a BERT model designed for classification tasks.

For this tutorial, we use the ‘bert-base-uncased’ version of BERT, which is trained on lower-case English text, is used for this tutorial.

# Step 3: Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Now, we’re ready to define our training arguments and create a Trainer instance to train our model.

# Step 4: Define training arguments
training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch", no_cuda=True, num_train_epochs=num_epochs)

# Step 5: Create Trainer instance and train
trainer = Trainer(
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset


Having trained our model, let’s evaluate it. We’ll calculate the confusion matrix and the ROC curve to understand how well our model performs.

# Step 6: Evaluation
predictions = trainer.predict(small_eval_dataset)

# Confusion matrix
cm = confusion_matrix(small_eval_dataset['label'], predictions.predictions.argmax(-1))
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')

# ROC Curve
fpr, tpr, _ = roc_curve(small_eval_dataset['label'], predictions.predictions[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(1.618 * 5, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

Fig 2. Confusion Matrix
Fig 3. ROC curve

The confusion matrix gives a detailed breakdown of how our predictions measure up to the actual labels, while the ROC curve shows us the trade-off between the true positive rate (sensitivity) and the false positive rate (1 — specificity) at various threshold settings.

Finally, to see our model in action, let’s use it to infer the sentiment of a sample text.

# Step 7: Inference on a new sample
sample_text = "This is a fantastic movie. I really enjoyed it."
sample_inputs = tokenizer(sample_text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")

# Move inputs to device (if GPU available)

# Make prediction
predictions = model(**sample_inputs)
predicted_class = predictions.logits.argmax(-1).item()

if predicted_class == 1:
print("Positive sentiment")
print("Negative sentiment")

By walking through an example of sentiment analysis on IMDb movie reviews, I hope you’ve gained a clear understanding of how to apply BERT to real-world NLP problems. The Python code I’ve included here can be adjusted and extended to tackle different tasks and datasets, paving the way for even more sophisticated and accurate language models.

Source link

Leave a Comment