Domain Adaption: Fine-Tune Pre-Trained NLP Models | by Shashank Kapadia | Jul, 2023


The complete code is available as a Jupyter Notebook on GitHub

For the fine-tuning of pre-trained NLP models using this method, the training data should consist of pairs of text strings accompanied by similarity scores between them.

The training data follows the format shown below:

Fig 3. Sample Format for Training Data

In this tutorial, we use a dataset sourced from the ESCO classification dataset, which has been transformed to generate similarity scores based on the relationships between different data elements.

Preparing the training data is a crucial step in the fine-tuning process. It is assumed that you have access to the required data and a method to transform it into the specified format. Since the focus of this article is to demonstrate the fine-tuning process, we will omit the details of how the data was generated using the ESCO dataset.

The ESCO dataset is available for developers to freely utilize as a foundation for various applications that offer services like autocomplete, suggestion systems, job search algorithms, and job matching algorithms. The dataset used in this tutorial has been transformed and provided as a sample, allowing unrestricted usage for any purpose.

Let’s start by examining the training data:

import pandas as pd

# Read the CSV file into a pandas DataFrame
data = pd.read_csv("./data/training_data.csv")

# Print head
data.head()

Fig 4. Sample data used for fine-tuning the model

To begin, we establish the multilingual universal sentence encoder as our baseline model. It is essential to set this baseline before proceeding with the fine-tuning process.

For this tutorial, we will use the STS benchmark and a sample similarity visualization as metrics to evaluate the changes and improvements achieved through the fine-tuning process.

The STS Benchmark dataset consists of English sentence pairs, each associated with a similarity score. During the model training process, we evaluate the model’s performance on this benchmark set. The persisted scores for each training run are the Pearson correlation between the predicted similarity scores and the actual similarity scores in the dataset.

These scores ensure that as the model is fine-tuned with our context-specific training data, it maintains some level of generalizability.

# Loads the Universal Sentence Encoder Multilingual module from TensorFlow Hub.
base_model_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
base_model = tf.keras.Sequential([
hub.KerasLayer(base_model_url,
input_shape=[],
dtype=tf.string,
trainable=False)
])

# Defines a list of test sentences. These sentences represent various job titles.
test_text = ['Data Scientist', 'Data Analyst', 'Data Engineer',
'Nurse Practitioner', 'Registered Nurse', 'Medical Assistant',
'Social Media Manager', 'Marketing Strategist', 'Product Marketing Manager']

# Creates embeddings for the sentences in the test_text list.
# The np.array() function is used to convert the result into a numpy array.
# The .tolist() function is used to convert the numpy array into a list, which might be easier to work with.
vectors = np.array(base_model.predict(test_text)).tolist()

# Calls the plot_similarity function to create a similarity plot.
plot_similarity(test_text, vectors, 90, "base model")

# Computes STS benchmark score for the base model
pearsonr = sts_benchmark(base_model)
print("STS Benachmark: " + str(pearsonr))

Fig 5. Similarity visualtions across test words

STS Benchmark (dev): 0.8325

The next step involves constructing the siamese model architecture using the baseline model and fine-tuning it with our domain-specific data.

# Load the pre-trained word embedding model
embedding_layer = hub.load(base_model_url)

# Create a Keras layer from the loaded embedding model
shared_embedding_layer = hub.KerasLayer(embedding_layer, trainable=True)

# Define the inputs to the model
left_input = keras.Input(shape=(), dtype=tf.string)
right_input = keras.Input(shape=(), dtype=tf.string)

# Pass the inputs through the shared embedding layer
embedding_left_output = shared_embedding_layer(left_input)
embedding_right_output = shared_embedding_layer(right_input)

# Compute the cosine similarity between the embedding vectors
cosine_similarity = tf.keras.layers.Dot(axes=-1, normalize=True)(
[embedding_left_output, embedding_right_output]
)

# Convert the cosine similarity to angular distance
pi = tf.constant(math.pi, dtype=tf.float32)
clip_cosine_similarities = tf.clip_by_value(
cosine_similarity, -0.99999, 0.99999
)
acos_distance = 1.0 - (tf.acos(clip_cosine_similarities) / pi)

# Package the model
encoder = tf.keras.Model([left_input, right_input], acos_distance)

# Compile the model
encoder.compile(
optimizer=tf.keras.optimizers.Adam(
learning_rate=0.00001,
beta_1=0.9,
beta_2=0.9999,
epsilon=0.0000001,
amsgrad=False,
clipnorm=1.0,
name="Adam",
),
loss=tf.keras.losses.MeanSquaredError(
reduction=keras.losses.Reduction.AUTO, name="mean_squared_error"
),
metrics=[
tf.keras.metrics.MeanAbsoluteError(),
tf.keras.metrics.MeanAbsolutePercentageError(),
],
)

# Print the model summary
encoder.summary()

Fig 6. Model architecture for fine-tuning

Fit the model

# Define early stopping callback
early_stop = keras.callbacks.EarlyStopping(
monitor="loss", patience=3, min_delta=0.001
)

# Define TensorBoard callback
logdir = os.path.join(".", "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

# Model Input
left_inputs, right_inputs, similarity = process_model_input(data)

# Train the encoder model
history = encoder.fit(
[left_inputs, right_inputs],
similarity,
batch_size=8,
epochs=20,
validation_split=0.2,
callbacks=[early_stop, tensorboard_callback],
)

# Define model input
inputs = keras.Input(shape=[], dtype=tf.string)

# Pass the input through the embedding layer
embedding = hub.KerasLayer(embedding_layer)(inputs)

# Create the tuned model
tuned_model = keras.Model(inputs=inputs, outputs=embedding)

Now that we have the fine-tuned model, let’s re-evaluate it and compare the results to those of the base model.

# Creates embeddings for the sentences in the test_text list. 
# The np.array() function is used to convert the result into a numpy array.
# The .tolist() function is used to convert the numpy array into a list, which might be easier to work with.
vectors = np.array(tuned_model.predict(test_text)).tolist()

# Calls the plot_similarity function to create a similarity plot.
plot_similarity(test_text, vectors, 90, "tuned model")

# Computes STS benchmark score for the tuned model
pearsonr = sts_benchmark(tuned_model)
print("STS Benachmark: " + str(pearsonr))

STS Benchmark (dev): 0.8349

Based on fine-tuning the model on the relatively small dataset, the STS benchmark score is comparable to that of the baseline model, indicating that the tuned model still exhibits generalizability. However, the similarity visualization demonstrates strengthened similarity scores between similar titles and a reduction in scores for dissimilar ones.

Fine-tuning pre-trained NLP models for domain adaptation is a powerful technique to improve their performance and precision in specific contexts. By utilizing quality, domain-specific datasets and leveraging siamese neural networks, we can enhance the model’s ability to capture semantic similarity.

This tutorial provided a step-by-step guide to the fine-tuning process, using the Universal Sentence Encoder (USE) model as an example. We explored the theoretical framework, data preparation, baseline model evaluation, and the actual fine-tuning process. The results demonstrated the effectiveness of fine-tuning in strengthening similarity scores within a domain.

By following this approach and adapting it to your specific domain, you can unlock the full potential of pre-trained NLP models and achieve better results in your natural language processing tasks



Source link

Leave a Comment