Why are language models everywhere? | by Ajay Halthor | May, 2023

The answer lies in the 75 years of NLP history

Photo by Romain Vignes on Unsplash

Have you ever wondered about how we got here with ChatGPT and Large Language Models? The answer lies in the development of Natural Language Processing (NLP) itself. So let’s talk about it. Don’t worry; the history is more interesting than you think! Section 1 will describe the birth of AI and NLP. Section 2 will talk about the major pillars of the field. Sections 3 through 5 will go into detailed timelines for the past 75 years. And for the final section 6, we describe the convergence of all these fields into language modeling which has become so popular today!

In the beginning, there was Alan Turing’s 1950 publication Computing Machinery and Intelligence where he posits the question “Can machines think”. This paper is often touted as the birth of Artificial Intelligence. Although it did not talk about natural language explicitly, it laid the groundwork for future research in NLP. This is why the earliest works in NLP spring up in the 1950s.

  1. Machine Translation: This is when an AI takes in a sentence of one language and outputs a sentence in another language. For example, Google Translate.
  2. Speech Processing: AI takes an audio as input and generates the corresponding text as output.
  3. Text Summarization: AI takes in a story as input and generates a summary as an output.
  4. Language Modeling: AI is given a sequence of words, it will determine the next word.

There are far more than these four. Over time, there has been a convergence of each pillar towards using Language Models to accomplish their task. In the following sections, let’s talk about each timeline.

Figure 1: Timelines of major pillars of NLP (image by author)

Rule Based Systems: 1954 saw Georgetown IBM experiment that was used in the Cold War era to translate from Russian to English. The idea was that the translation task could be broken down into a set of rules to convert one language to the other i.e. a rule based system. Another early rule-based system was Yehoshua Bar-Hillel’s “Analytical Engine” for translating Hebrew to English.

Statistical Approaches: The problem with Rule based systems is they make a ton of assumptions. More complex the problem, more problematic are these assumptions. Translation is complex. From the 1980s, as we had access to more bilingual data and statistical methods become better established, we started applying these statistical models to language translation. A paradigm called Statistical Machine Translation (SMT) became popular. SMT paradigms decomposed the problem into 2 sub-problems: a translation problem and a language modeling problem.

Neural Approaches: Since 2015, SMT has been replaced by Neural Machine Translation. These make use of Neural Networks to directly learn the task of translation. They include the development of Recurrent Neural Networks, and eventually Transformer Models. With the introduction of models like GPT, the baseline pretrained model became Language Modeling and it is fine tuned with translation.

Rule Based Systems: The start for speech processing was also back in the 1950s & 60s where single digits and words were recognized. For example, Audrey by Bell Labs recognized digits, while IBM’s Shoebox performed arithmetic on voice command.

Statistical Approaches: However, converting speech to text is a complex problem; there are different dialects, accents, loudness. So breaking this complex problem down into subproblems was the move. Around the 70s, after Hidden Markov Models were introduced, the complex problem of speech to text could be broken down into 3 simpler problems:

  • Language modeling: We can determine the sequence of words and sentences. These were n-gram models.
  • Pronunciation modeling: This is done to associate words and phones. These are essentially simple models or even tables.
  • Acoustic modeling: We understand the relationship between the speech and phones. These were Hidden Markov Models with Gaussian Mixture models

These 3 parts are trained separately and then used together. But this creates its own complexity.

Neural Approaches: In the early 2000s, we saw these techniques replaced with neural networks. As we saw the advent of large scale text corpora, neural networks started outperforming everything. They performed end-to-end speech to text. So we could optimize the objective of generating text from the input speech directly; this led to better performance. With further development in the field, we got into Recurrent Networks, Convolution Neural Networks, and eventually fine tuning of pretrained language models.

Rule Based Systems: Research Started with Luhn’s publication The automatic creation of literature abstracts in 1958 that ranked the importance of sentences using word frequencies. This method selected sentences in the original text to construct a summary; the corresponding summary is called an “extraction based summary”. The next significant leap in the field came in 1969 with Edmonson’s paper New methods of automatic extractive. He claimed the importance of a sentence not only depended on word frequencies, but also on other factors such as location of sentence in the paragraph; whether the sentence has certain cue words; or if the sentence has words in the title. In the 1980s, we tried summarizing text as a human would without using the original sentences. These were “abstractive summaries”. FRUMP (Fast reading and understanding memory program) and SUSY were early implementations of such systems. However, they too depended on hand crafted rules and the summaries were not high quality.

Statistical Approaches: In the 90s and 2000s, we used statistical approaches to build classifiers that determine whether a sentence should be included in a summary or not. These classifiers could be a Logistic Regression, Decision Tree, SVM, or any other statistical model.

Neural Approaches: From 2015, Neural Networks saw impact with the introduction of A neural attention model for abstractive sentence summarization. This produced abstractive summaries typically headlines that are very short. However, the incorporation of LSTM cells and a sequence-to-sequence architecture lead to the ability to deal with longer input sequences and also generate proper summaries. From there, the field took the same pages as Machine Translation and use the pretainining and fine tuning architecture we see today.

The history of multiple pillars discussed in the previous sections shows some common patterns.

Rule based systems dominated in the early days of AI from 1950s and 60s. Around the 70s, we saw the introduction of statistical models to solve these problems. However, since language is complex, these statistical models would break down the complex tasks into sub tasks to solve these problems. With the advent of more data and better hardware in the 2000s, neural network approaches were on the rise.

Neural Networks can learn complex language tasks end to end and hence have better performance than statistical approaches. Transformer Neural Networks introduced in 2017 that could effectively learn to solve language tasks. But since they required a ton of data to train models effectively, BERT and GPT were introduced and used the concept of transfer learning to learn language tasks. The idea here is that language tasks are don’t require much data for systems that have some baseline understanding of language itself. GPT for example acquires this “understanding of language” by understanding Language Modeling first and then fine tuning on a specific language task. This is why modern NLP has converged to using language models at their core.

Hope you now know why Large Language Models like ChatGPT are super important and why we see language modeling everywhere. It took the better part of a century to get here. For more details on NLP and Language Modeling, check out this playlist of videos that delves into different concepts in the field. Happy learning!

Source link

Leave a Comment