Peter Chng

Token selection strategies: Top-K, Top-p, and Temperature

Many Large Language Models (LLMs) have inference-time parameters that control how “random” the output is. Typically these parameters are Top-K, Top-p and Temperature. Let’s look at how these influence the output of a LLM.

LLMs output probability distributions

A LLM typically operates on a sequence of tokens, which could be words, letters, or sub-word units. (As an example, the OpenAI GPT LLMs tokenize on sub-word units, where 100 tokens is on average 75 words.) The set of possible tokens is called the vocabulary of the LLM.

The LLM takes in an input sequence of tokens and then tries to predict the next token. It does this by generating a discrete probability distribution over all possible tokens, using the softmax function as the last layer of the network. This is the raw output of the LLM.

For example, if we had a vocabulary size of 5, the output might look like this: (Most LLMs obviously have far larger vocabularies) $$ t_0 \rightarrow 0.4 \\ t_1 \rightarrow 0.2 \\ t_2 \rightarrow 0.2 \\ t_3 \rightarrow 0.15 \\ t_4 \rightarrow 0.05 \\ $$

Since this is a probability distribution, all the values will sum to $1$. Once we have this probability distribution, we can decide how to sample from it, and that’s where Top-K and Top-p come in.

Top-K sampling

Top-K sampling works like this:

  1. Order the tokens in descending order of probability.
  2. Select the first $K$ tokens to create a new distribution.
  3. Sample from those tokens.

For example, let’s say we sampled using a Top-3 strategy from our above example. The top 3 tokens are:

$$ t_0 \rightarrow 0.4 \\ t_1 \rightarrow 0.2 \\ t_2 \rightarrow 0.2 \\ $$

However, the probabilities no longer add up to $1$, so we have to normalize the probabilities by the sum of the top 3 tokens. This means we divide each probability by $0.4+0.2+0.2 = 0.8$. This gives us a new probability distribution over just the top 3 tokens:

$$ t_0 \rightarrow 0.5 \\ t_1 \rightarrow 0.25 \\ t_2 \rightarrow 0.25 \\ $$

We can now select a token by sampling from it. (This is just sampling from a multinomial distribution)

If you set $K=1$, then you get what’s called a greedy strategy, where the most-likely token is always picked.

Top-p sampling

This strategy (also called nucleus sampling) is similar to Top-K, but instead of picking a certain number of tokens, we select enough tokens to “cover” a certain amount of probability defined by the parameter p in the following manner:

  1. Order the tokens in descending order of probability.
  2. Select the smallest number of top tokens such that their cumulative probability is at least p.
  3. Sample from those tokens.

For example, let’s say we sampled with a top-p strategy using $p = 0.5$, again from our example above. The process would be:

  1. The top token, $t_0$ is selected. It has a probability of $0.4$, and our cumulative probability is also $0.4$.
  2. The cumulative probability is less than $p = 0.5$, so we select the next token.
  3. The next token, $t_1$ has a probability of $0.2$, and now our cumulative probability is $0.6$.
  4. The cumulative probability is at least the value of $p = 0.5$, so we stop.

This results in only the top 2 tokens being selected:

$$ t_0 \rightarrow 0.4 \\ t_1 \rightarrow 0.2 \\ $$

Again, we have to normalize the probability by dividing by the sum $0.4 + 0.2 = 0.6$, giving us:

$$ t_0 \rightarrow 0.67 \\ t_1 \rightarrow 0.33 \\ $$

We can now now sample from this distribution as done previously with Top-K.

Temperature

Temperature affects how “random” the model’s output is, and works differently than the previous two parameters. While Top-K and Top-p operate directly on the output probabilities, temperature affects the softmax function itself, and so requires a short review of how that works.

The softmax function takes in a vector of $n$ real numbers, and then normalizes it into a discrete probability distribution across those $n$ elements. The probabilities will sum to $1$.

The standard/unit softmax function is defined as below:

$$ \sigma(\vec{x})_{i} = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} $$

This function is applied for each element in the input vector $\vec{x}$ to produce the corresponding output vector. Namely:

  1. The exponential function is applied to the element $x_i$.
  2. The resultant value is then normalized by the sum of exponentials across all elements $x_j$.
    • This ensures that the resultant values sum to $1$, making the output vector a probability distribution.

This is how we obtained the token probability distributions that Top-K/Top-p operated on. The second-to-last layer in the model has the same dimension as the number of tokens in the vocabulary, but the output vector $\vec{x}$ represents raw pre-activation values that cannot be interpreted as a probability distribution. Softmax converts this into a probability distribution over all possible tokens in the last layer.

Besides converting the output into a probability distribution, softmax also alters the relative differences between each element. The effect of the softmax function depends on the range of the input elements, $x_i$:

  • If the two input elements being compared are both $x_i \lt 1$, then the differences between them will be reduced.
  • If at least one of the elements being compared is $\gt 1$, then the differences between them will be amplified. This can make the model more “certain” about predictions.

Let’s look at the input and output values of this standard softmax function to see how relative differences are altered. When input values are less than $1$, the relative differences are reduced in the output values:

Softmax transformation when inputs are less than 1
Softmax transformation when inputs are less than 1

By contrast, when some of the input values are greater than $1$, the differences between them are amplified in the output values:

Softmax transformation when some inputs are greater than 1
Softmax transformation when some inputs are greater than 1

This reduction or amplification in the output values affects how “certain” the model’s predictions are. How can we control this “certainty” in the probability distribution output by softmax? That’s where the Temperature parameter comes in. Consider a “scaled” softmax function of the form:

$$ \sigma(\vec{x})_{i} = \frac{e^{x_i\over{T}}}{\sum_{j=1}^{n} e^{x_j\over{T}}} $$

The only difference is the inverse scaling parameter, $1\over{T}$, applied in the exponential function, where $T$ is defined as the Temperature. Let’s consider the impact of $T$ on the output:

  • If $0 \lt T \lt 1$, then the $x_i$ input values get pushed further away from $0$ and differences are amplified.
  • If $T \gt 1$, then the $x_i$ input values get pushed toward $0$ and differences are reduced.

Let’s again plot the output of the softmax function, but this time we’ll compare different values of $T$:

How temperature affects softmax
How temperature affects softmax

Basically, with smaller values of the temperature $T$, the differences between input values is amplified. By contrast, with larger values of $T$ the differences are reduced.

You can also consider what happens in the extremes of $T$ to get a better intuitive sense of how temperature affects the output:

  • If $T \rightarrow 0$ then we will be dealing with extremely large exponentials, and the $x_i$ element with the largest value will dominate, i.e. its probability will be close to $1$ and all others will be close to $0$. (This would be equivalent to a greedy strategy where the top token is always selected)
  • If $T \rightarrow \infty$ then the exponentials all become $e^0 = 1$. This turns the output into a uniform distribution, i.e. all probabilities become $1\over{n}$. That is, all tokens are equally probable. (This obviously isn’t a useful model anymore)

Essentially, the temperature changes the shape of the probability distribution. As temperature increases, differences in probability are reduced, resulting in more “random” output from the model. This manifests in a LLM as more “creative” output. Conversely, a lower temperature makes the output more deterministic.

As a side note, the parameter is probably called “temperature” in relation to the concept from thermodynamics: At higher temperatures, concentrations of gas or fluid will diffuse (spread out) faster than at cold ones. (See also the concept of temperature in simulated annealing)

Summary

Top-K, Top-p, and Temperature are all inference-time parameters that affect how tokens are generated by a LLM. They all operate on the probability distribution that is output by the LLM.

  • Top-K and Top-p are just sampling strategies. They aren’t specific to LLMs, or even neural networks at all. They are just ways to sample from a discrete probability distribution.
  • Top-K limits us to a certain number ($K$) of the top tokens to consider.
  • Top-p limits us to the top tokens within a certain probability mass ($p$).

By contrast, temperature works differently:

  • Temperature is not a sampling strategy, but instead is a parameter of the softmax function, which is the last layer in the network.
  • Temperature affects the shape of the probability distribution.
  • High temperatures make token probabilities closer to each other, meaning less-likely tokens could show up. This makes the output more “creative” or random.
  • Low temperatures make the model more “certain” by amplifying probability differences. This makes the output more deterministic.