BERT vs GPT: Comparing the NLP Giants | by Thao Vu | Aug, 2023

How different are their structure, and how do the differences impact the model’s ability?

Image generated by the author using Stable Diffusion.

In 2018, NLP researchers were all amazed by the BERT paper [1]. The approach was simple, yet the result was impressive: it set new benchmarks for 11 NLP tasks.

In a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analysing and improving the model. [2]

In 2022, ChatGPT [3] blew up the whole Internet with its ability to generate human-like responses. The model can comprehend a wide range of topics and carry the conversation naturally for an extended period, which sets it apart from all traditional chatbots.

BERT and ChatGPT are significant breakthroughs in NLP, yet their approaches are different. How do their structures differ, and how do they impact the models’ ability? Let’s dive in!

We must first recall the commonly-used attention to understand the model structure fully. Attention mechanisms are designed to capture and model relationships between tokens in a sequence, which is one of the reasons why they have been so successful in NLP tasks.

An intuitive understanding

  • Imagine you have n goods stored in boxes v1, v2,…,v_n. These are called “values”.
  • We have query q which demands to take some suitable amount w of goods from each box. Let’s call them w_1, w_2,..,w_n (this is the “attention weight”)
  • How to determine w_1, w_2,.., w_n? Or, in other words, how to know among v_1,v_2, ..,v_n, which should be taken more than others?
  • Remember, all the values are stored in boxes we cannot peek into. So we can’t directly judge v_i should be taken less or more.
  • Luckily, we have a tag on each box, k_1, k_2,…,k_n, which are called “keys”. The “keys” represent the characteristic of what is inside the containers.
  • Based on the “similarity” of q and k_i (q*k_i), we can then decide how important the v_i is (w_i) and how much of v_i we should take(w_i*v_i).

Source link

Leave a Comment