Language modeling (LM) aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens. Language models have revolutionized natural language processing (NLP) in recent years. It is now well-known that increasing the scale of language models (e.g., training compute, model parameters, etc.) can lead to better performance and sample efficiency on a range of downstream NLP tasks. The survey paper “*A Survey of Large Language Models*” [1] covers almost every aspect of the large language models. The paper provides an up-to-date review of the literature on LLMs, details about the training mechanisms like pre-training approaches along with instruction tuning techniques & further alignment training with the recent RLHF approach. The approaches of instruction tuning and alignment tuning is used to adapt LLMs according to specific goals.

*After pre-training or adaptation tuning, a major approach to using LLMs is to design suitable prompting strategies for solving various tasks.* *A typical prompting method also known as in-context learning (ICL), formulates the task description and/or demonstrations (examples) in the form of natural language text.*

LLMs demonstrate an in-context learning (ICL) ability, that is, learning from a few examples in the context. Many studies have shown that LLMs can perform a series of complex tasks through ICL, such as solving mathematical reasoning problems.

The key idea of in-context learning is to learn from analogy. The figure below gives an example describing how language models make decisions with ICL. First, ICL requires a few examples to form a demonstration context. These examples are usually written in natural language templates. Then, ICL concatenates a query question and a piece of demonstration context together to form a prompt, which is then fed into the language model for prediction [2].

Different from supervised learning requiring a training stage that uses backward gradients to update model parameters, ICL does not conduct parameter updates and directly performs predictions on the pre-trained language models. The model is expected to learn the pattern hidden in the demonstration and accordingly make the right prediction.

## What makes ICL attractive?

- Examples written in natural language provide an interpretable interface to communicate with LLMs. This paradigm makes it much easier to incorporate human knowledge into LLMs by changing the examples and templates
- It is similar to the decision process of human beings by learning from analogy.
- Compared with supervised training, ICL is a training-free learning framework. This not only greatly reduces the computation costs for adapting the model to new tasks, but also makes language-model-as-service possible and can be easily applied to large-scale real-world tasks.

## But how does this work?

After pre-training, LLMs can exhibit intriguing ICL capabilities (emergent capabilities) without being updated [3]. While intuitively reasonable, the working mechanism of the ICL remains unclear, and few studies have provided preliminary explanations for the two questions.

## How does pre-training affect the ICL ability?

Researchers suggested that a pre-trained model acquires some emergent ICL abilities when it achieves a large scale of pre-training steps or model parameters [3]. Some studies also showed that the ICL ability grows as the parameters of LLMs increase from 0.1 billion to 175 billion. Research suggests that the design of training tasks is an important influence factor on the ICL capability of LLMs. Besides training tasks, recent studies have also investigated the relationship between ICL and the pre-training corpora. It has been shown that the performance of ICL heavily depends on the source of pre-training corpora rather than the scale.

## How do LLMs perform ICL during inference?

In the paper “*Why Can GPT Learn In-Context?*” [4], researchers figured out a dual form between Transformer attention and gradient descent and further proposed to understand ICL as implicit fine-tuning. They compared GPT-based ICL and explicit fine-tuning on real tasks and found that ICL behaves similarly to fine-tuning from multiple perspectives. Under this framework, the ICL process can be explained as follows: by means of forward computation, LLMs generate meta-gradients with respect to demonstrations and implicitly perform gradient descent via the attention mechanism.

Another perspective from Stanford research [5] explains ‘*In-context learning as Implicit Bayesian Inference’.* The authors provide a framework where the LM does in-context learning by using the prompt to “locate” the relevant concept it has learned during pre-training to do the task. We can theoretically view this as Bayesian inference of a latent concept conditioned on the prompt, and this capability comes from structure (long-term coherence) in the pre-training data.

Even though there are some answers, this research is still evolving to understand the mechanism and underlying reasons better.

Now let us explore some popular ICL methods.

- Chain of thought (COT)
- Self-consistency COT
- Tree of Thoughts

## Chain of thought (COT)

It is observed that standard prompting techniques (also known as general input-output prompting) do not perform well on complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. CoT is an improved prompting strategy to boost the performance of LLMs such non-trivial cases involving reasoning [6]. Instead of simply constructing the prompts with input-output pairs as in ICL, CoT incorporates intermediate reasoning steps that can lead to the final output into the prompts. As can be seen from the example below.

The figure above shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. On the left side, in ICL, the model is provided with examples or demonstrations of mathematical reasoning questions and a direct answer. But the model is not able to predict the correct answer.

On the right side, in COT, the model is presented with an intermediate step to help arrive at an answer of the example/demonstration given. We can see when a model is now asked a similar reasoning question, it is able to predict the answer correctly, thus proving the efficacy of the COT approach for such use cases.

If you see, COT or ICL in general provide some examples to demonstrate the use cases this is called **Few-Shot (few examples)**. There is one more paper [7] that brought out interesting prompting *“Let us think step by step..”* without any examples to demonstrate the use case, this is called **Zero-short (no examples)**.

In** Zero-shot CoT, **LLM is first prompted by *“Let’s think step by step”* to generate reasoning steps and then prompted by *“Therefore, the answer is”* to derive the final answer. They find that such a strategy drastically boosts the performance when the model scale exceeds a certain size, but is not effective with small-scale models, showing a significant pattern of emergent abilities.

Above: Example inputs and outputs of GPT-3 with (a) standard Few-shot (ICL), (b) Few-shot-CoT, (c) standard Zero-shot (ICL), and (d) ours (Zero-shot-CoT).

Similar to Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reaches the correct answer where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples per task, Zero-Shot does not need any examples and just uses the same prompt “Let’s think step by step” across all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).

This research shows LLMs are decent zero-shot reasoners by adding a simple prompt, *Let’s think step by step*, to facilitate step-by-step thinking before answering each question.

## Let us see what happens underneath:

While Zero-shot-CoT is conceptually simple, it uses prompting twice to extract both reasoning and answer, as explained in the figure below.

The process involves two steps: first “

reasoning prompt extraction” to extract a full reasoning path from a language model, and then use the second “answer prompt extraction” to extract the answer in the correct format from the reasoning text.

**1st prompt — reasoning extraction**

In this step first modify the input question x into a prompt x’ using a simple template **“Q: [X]. A: [T]”**, where [X] is an input slot for x and [T] is a slot for hand-crafted trigger sentence t that would extract chain of thought to answer the question x. For example, if we use *“Let’s think step by step”* as a trigger sentence, the prompt x’ would be **“Q: [X]. A: Let’s think step by step.” **Prompted text x’ is then fed into a language model and generates subsequent sentence z. We can use any decoding strategy.

Some other examples of such prompts:

Let’s think about this logically.

Let’s solve this problem by splitting it into steps.

Let’s think like a detective step by step.

Before we dive into the answer.

**2nd prompt — answer extraction**

In the second step, the generated sentence z along with prompted sentence x’ is used to extract the final answer from the language model. To be concrete, simply concatenate three elements as with **“[X’] [Z] [A]”: [X’]** for 1st prompt x’, [Z] for sentence z generated at the first step, and [A] for a trigger sentence to extract the answer. The prompt for this step is self-augmented since the prompt contains the sentence z generated by the same language model. In experiments, authors use slightly different answer trigger depending on the answer format.

For example, the use of *“Therefore, among A through E, the answer is”* for **multi-choice QA**, and *“Therefore, the answer (Arabic numerals) is”* for math problems requiring a **numerical answer**.

The paper [11] has interesting ideas, the performance of various prompts, etc., please read for more details.

**When CoT works for LLMs?**

It only has a positive effect on sufficiently large models (e.g., typically containing 10B or more parameters but not on small models. This phenomenon is referred to as the ‘*emergent abilities*’ of large language models. An ability is considered to be emergent if it is not present in smaller models but is present in larger models [3].

- It is mainly effective to improve the tasks that require step-by-step reasoning, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning.
- For other tasks that do not rely on complex reasoning, it might show worse performance than standard. Interestingly, it seems that the performance gain brought by CoT prompting could be significant only when standard prompting yields poor results.

**Why LLMs Can Perform CoT Reasoning?**

- It is widely
*hypothesized*that it can be attributed to training on code since models trained on it show a strong reasoning ability. Intuitively, code data is well organized with algorithmic logic and programming flow, which may be useful to improve the reasoning performance of LLMs.**However, this hypothesis still lacks publicly reported evidence of ablation experiments (with and without training on code).** - The major distinction between CoT prompting and standard prompting is the
*incorporation of reasoning paths prior to the final answer*. Thus, some researchers investigate the effect of different components in the reasoning paths. Specifically, a recent study identifies three key components in CoT prompting, namely symbols (e.g., numerical quantities in arithmetic reasoning), patterns (e.g., equations in arithmetic reasoning), and text (i.e., the rest of tokens that are not symbols or patterns). It is shown that the latter two parts (i.e., patterns and text) are essential to the model performance, and removing either one would lead to a significant performance drop.

In summary, this is an active area of research. For an in-depth discussion on this, please read [2]. There is one more interesting research [8] that discusses possible reasons for in-context learning in transformer models.

## Self-consistency COT

Instead of using the greedy decoding strategy in COT, the authors in [9] propose another decoding strategy called self-consistency to replace the greedy decoding strategy used in chain-of-thought prompting, that further improves language models’ reasoning performance by a significant margin. Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct answer. The more that deliberate thinking and analysis is required for a problem, the greater the diversity of reasoning paths that can recover the answer.

First, prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, authors propose

“sample-and-marginalize”decoding procedure.

The figure below illustrates the self-consistency method with an example.

First sample from the language model’s decoder to generate a diverse set of reasoning paths; each reasoning path might lead to a different final answer, so determine the optimal answer by marginalizing out the sampled reasoning paths to find the most consistent answer in the final answer set. Or in other words, from the model’s decoder, by taking a majority vote over the answers, we arrive at the most “consistent” answer among the final answer set.

Such an approach is analogous to the human experience that if multiple different ways of thinking lead to the same answer, one has greater confidence that the final answer is correct. Compared to other decoding methods, self-consistency avoids the repetitiveness and local optimality that plague greedy decoding, while mitigating the stochasticity of a single sampled generation.

Extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

One **limitation **of self-consistency is that it incurs more computation cost. In practice, people can try a small number of paths (e.g., 5 or 10) as a starting point to realize most of the gains while not incurring too much cost, as in most cases the performance saturates quickly.

## Tree of thoughts

Authors in [10] propose “*Tree of Thoughts*” (ToT), which generalizes over the “*Chain of Thoughts*” approach to prompting language models and enables exploration over coherent units of text (“thoughts”) that serve as intermediate steps toward problem-solving. ToT allows LMs to perform deliberate decision-making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. The results/experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.

Tree of Thoughts (ToT) allows LMs to explore multiple reasoning paths over thoughts (above Figure). ToT frames any problem as a search over a tree, where each node is a state s = [x, z1···i] representing a partial solution with the input x and the sequence of thoughts so far zi. The ToT does 4 things: **thought decomposition, thought generator, state evaluator, and search algorithm**.

1. **Thought decomposition:** Decompose the intermediate process into thought steps:

While CoT samples thoughts coherently without explicit decomposition, ToT leverages problem properties to design and decompose intermediate thought steps. As *Table 1* shows, depending on different problems, a thought could be a couple of words (Crosswords), a line of equation (Game of 24), or a whole paragraph of writing plan (Creative Writing). It is like how you divide the question into several tasks. Each task is a step Zn that we discuss. Note that, this part is only about decomposing the questions into tasks. It is like planning, we don’t actually do any thoughts in this part.

2. **Thought generation:** So after we define the task for each step in thought decomposition. We now actually generate the thoughts. We try to generate k thoughts as candidates for given a step Zn. There are two ways for generating thoughts: sample and propose.

a. Sample i.i.d. thoughts from a CoT prompt. We repeat the generation process k times independently. This works better when the thought space is rich (e.g. each thought is a paragraph), and i.i.d. samples lead to diversity.

In the above figure, a step of deliberate search in a randomly picked **Creative Writing task**. Given the input, the LM samples 5 different plans, then votes 5 times to decide which plan is best. The majority choice is used to consequently write the output passage with the same sample-vote procedure.

b. Propose thoughts sequentially using a “propose prompt”. This works better when the thought space is more constrained (e.g. each thought is just a word or a line), so proposing different thoughts in the same context avoids duplication. In this, we generate k thoughts at one inference. So, these k thoughts may not be independent.

3. **Evaluate states:** In this part, we define a state evaluation function: v(s). To expand the tree, we use this function to find the good path, like in chess programming. We evaluate the given path of the tree *s=[x, z1…i]*. There are two ways to define the evaluation function:

- Value each state independently: each state ‘s’ (or path) will be evaluated independently. [
*Example: Game of 24*] - Vote across states: each state ‘s’ will be evaluated given the set of all states S. Just like you compare the states in S to each other as in self-consistency COT. [
*Example: creative writing task*]

**Example Game of 24:**

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10–4) * (13–9) = 24”.

To frame ‘*Game of 24*’ into ToT, we decompose the thoughts into 3 steps, each an intermediate equation. As shown in Figure above (a), at each tree node, we exact the “left” numbers and prompt the LM to propose some possible next steps. The same “propose prompt” is used for all 3 thought steps, though it only has one example with 4 input numbers. We perform a breadth-first search (BFS) in ToT, where at each step we keep the best b = 5 candidates. To perform deliberate BFS in ToT, as shown in Figure (b), we prompt LM to evaluate each thought candidate as “sure/maybe/impossible” with regard to reaching 24. The aim is to promote correct partial solutions that can be verdicted within few look-ahead trials, and eliminate impossible partial solutions based on “too big/small” commonsense, and keep the rest “maybe”. We sample values 3 times for each thought.

4. **Search algorithm:** We try to expand the tree. For each leaf node, we evaluate it with the state evaluation function. To choose which leaf node for evaluation, we use a search algorithm. It could be a breadth-first search and a depth-first search. One can plug and play different search algorithms depending on the tree structure.

Conceptually, ToT has several benefits as a method for general problem-solving with LMs:

**Generality**: IO, CoT, CoT-SC, and self-refinement can be seen as special cases of ToT (i.e. trees of limited depth and breadth**Modularity**: The base LM, as well as the thought decomposition, generation, evaluation, and search procedures, can all be varied independently.**Adaptability**: Different problem properties, LM capabilities, and resource constraints can be accommodated.**Convenience**: No extra training is needed, just a pre-trained LM is sufficient.

ToT framework empowers LMs to more autonomously and intelligently make decisions and solve problems.

**Limitations**. ToT requires more resources (e.g. model API cost) than sampling methods in order to improve task performances, but the modular flexibility of ToT allows users to customize such performance-cost tradeoffs, and ongoing open-source efforts should readily reduce such costs in the near future.

Prompt engineering is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics. *Can we automate this process of prompt engineering? *This is an active research area and the following section discusses some attempts towards automatic prompt design approaches.

## Automatic Prompt Augmentation and Selection COT

In the paper titled “*Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data*” [11]. Most CoT studies rely on carefully designed human-annotated rational chains to prompt the language model, which poses challenges for real-world applications where labeled training data is available without human-annotated rational chains. To construct chain-of-thought prompts automatically, authors suggested augment-prune-select, a three-step process:

**Augment**: Generate multiple pseudo-chains of thought given question using few-shot or zero-shot CoT prompts;**Prune**: Prune pseudo chains based on whether generated answers match ground truths.**Select**: Apply a variance-reduced policy gradient strategy to learn the probability distribution over selected examples, while considering the probability distribution over examples as policy and the validation set accuracy as reward.

## Auto-CoT: Automatic Chain-of-Thought Prompting

In “*Automatic Chain-of-Thought Prompting in Large Language Models*” [12], the authors propose Auto-CoT paradigm to automatically construct demonstrations with questions and reasoning chains. In this technique, authors adopted clustering techniques to sample questions and then generates chains. They observed that LLMs tend to make certain types of mistakes. One type of errors can be similar in the embedding space and thus get grouped together. By only sampling one or a few from frequent-error clusters, we can prevent too many wrong demonstrations of one error type and collect a diverse set of examples.

**Auto-CoT** consists of the following main stages:

**Question clustering**: Perform cluster analysis for a given set of questions Q. First compute a vector representation for each question in Q by Sentence-BERT. The contextualized vectors are averaged to form a fix-sized question representation. Then, the question representations are processed by the k-means clustering algorithm to produce k clusters of questions.**Demonstration selection**: Select a set of representative questions from each cluster; i.e. one demonstration from one cluster. Samples in each cluster are sorted by distance to the cluster centroid and those closer to the centroid are selected first.**Rationale generation**: Use zero-shot CoT to generate reasoning chains for selected questions and construct few-shot prompt to run inference.

LLMs have shown reasoning capabilities with CoT prompting. The superior performance of Manual-CoT hinges on the hand-crafting of demonstrations. To eliminate such manual designs, the proposed Auto-CoT automatically constructs demonstrations. It samples questions with diversity and generates reasoning chains to construct demonstrations. Experimental results on reasoning datasets showed that with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.

In-context learning or prompting helps us to communicate with LLM to steer its behavior for desired outcomes. It is an attractive approach to extracting information because you don’t need a large offline training set, you don’t need offline access to a model, and it feels intuitive even for non-engineers. Prompt engineering aims to utilize prompting as a way to build reliable functionality for real-world applications. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics. Prompting requires significant human efforts to create and adapt to new datasets. The annotation process is nontrivial because humans need to not only select the questions but also carefully design the reasoning steps for each question, so there is a need for automation of the prompting techniques.

[1] A Survey of Large Language Models, https://arxiv.org/pdf/2303.18223.pdf

[2] A Survey on In-Context Learning, https://arxiv.org/pdf/2301.00234.pdf

[3] Emergent Abilities of Large Language Models, https://arxiv.org/pdf/2206.07682.pdf

[4] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers, https://arxiv.org/pdf/2212.10559.pdf

[5] An Explanation of In-context Learning as Implicit Bayesian Inference, http://ai.stanford.edu/blog/understanding-incontext/

[6] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/pdf/2201.11903.pdf

[7] Large Language Models are Zero-shot Reasoners, https://arxiv.org/pdf/2205.11916.pdf

[8] In-context learning and induction heads. Transformer Circuits, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html .

[9] Self-consistency improves chain-of-thought reasoning in LLM, https://arxiv.org/pdf/2203.11171.pdf

[10] Tree of Thoughts, https://arxiv.org/pdf/2305.10601.pdf

[11] Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data https://arxiv.org/pdf/2302.12822.pdf

[12] Automatic Chain-of-Thought Prompting in Large Language Models, https://arxiv.org/pdf/2210.03493.pdf

[13] Large Language models can Self Improve, https://www.arxiv-vanity.com/papers/2210.11610/