May 1, 2024

Why do LLM input tokens cost less than output tokens?

If you look at the pricing of the current SOTA models from both OpenAI and Anthropic, you’ll notice something interesting: Input tokens cost considerably less than output tokens. Why is that? Let’s look at the compute cost.

According to various LLM scaling laws (which are nicely summarized here), the floating-point operations (FLOPs) required for a single forward pass of a LLM can be approximated by:

$C_{forward} \approx 2N$

Where $N$ is the number of parameters in the model.¹ So for inference of $D$ tokens, the compute cost in FLOPs is simply:

$C_{forward} \approx 2DN$

So the number of compute ops (FLOPs) is the same whether we are processing $D$ input tokens or generating $D$ output tokens, assuming that kv cache is used to prevent repeated computations during generation. So why the dollar cost difference? We’ve looked at compute, but we haven’t looked at memory cost:

During generation of output tokens, there must be one forward pass for each token generated. So for generating $D$ tokens, we’d need $D$ forward passes.
Contrast this with processing $D$ input tokens: This can be done in a single forward pass.

Although the cumulative compute ops would be similar in both cases, doing this across $D$ forward passes is much more memory-intensive, especially for large models like LLMs. Essentially, for each forward pass, the entire set of model weights must be loaded from GPU global memory to the SMs where the computation is actually done.

So, although the compute cost of generating $D$ output tokens is the same (or similar) as processing $D$ input tokens, the memory cost (in terms of reading from global memory to on-chip/SM memory like registers) is far higher. How much does this impact things?

As it stands, the hardware specifications of current GPUs tends to make LLMs memory bound. Take, for example, the specs on an H100:

Compute: 989.5 TFLOPS using FP16/BF16²
Memory Bandwidth: 3.35TB/s

If we divide the compute by the memory bandwidth we can get a ratio of FLOPS to each byte read from memory. This value turns out to be ~295.4 FLOPS/byte. However, since FP16/BF16 takes two bytes (16-bits), let’s multiply this by 2 to get the FLOPS per FP16/BF16 value read, which yields ~591 FLOPS/value.

This value represents the optimal ratio of compute to memory usage. That is, if you want to fully-utilize the GPU’s resources, for each FP16/BF16 value you read, you should do 591 FLOPs of compute. This leads to the following corollaries:

If you do less than this for each value read, you will be memory-limited, and won’t realize the full compute of the GPU: The SMs will be bottlenecked waiting on reading from/writing to from memory.
If you do more than this for each value read, then you will be underutilizing the GPU’s memory, as the SMs will be taking more time on compute.

In practice for LLMs, usually (1) is the issue: The memory bandwidth can’t keep up with how much compute the SMs can do.

Going back to our question of input vs. output tokens cost:

Processing $D$ input tokens can be done in one forward pass: This means the model’s weights only need to be loaded once, and a bunch of compute can be done at once. This is essentially a batch mode of operation.
Generating $D$ output tokens: This takes $D$ forward passes, and during each forward pass, there is relatively little compute done. This is bad since we are loading the model’s weights but not doing much compute each time to “amortize” the heavy cost of doing so.

So, the extra cost you pay for output tokens reflects the additional memory cost it takes to generate them, and not necessarily the compute cost. In practice, LLM-serving companies will usually batch together multiple user requests (which is covered in this article) to increase GPU utilization, since batch size 1 inference is so inefficient in this way.

Understanding this, we see why it’s important to have a sufficient batch size during token generation: You don’t want to pay the cost to load the model weights for a forward pass to only generate a token for a single sequence; you should be generating them for multiple sequences, and the number of sequences that is optimal depends on the LLM and the optimal “compute:memory” ratio of the underlying GPU hardware!

This also explains why OpenAI’s “Batch API” can charge 50% less to deliver tokens asynchronously:

During off peak hours, online requests may not be able to create a batch with sufficient size to fully-utilize compute.
This batch needs to be sent to the GPU quickly, since it’s for an online request.
If there is a queue of asynchronous requests (from Batch API), it can be used to “top up” the batch before it’s sent to the GPU, allowing for more effective compute usage, with (relatively) minimal) impact on memory usage.

So the marginal cost of doing this is probably quite low, as they need not provision any extra capacity - they’re just using the extra compute that would have otherwise been wasted.

Aside: Estimating max tokens/s from memory bandwidth for a LLM

Since we know that LLMs are usually memory-bound, we can estimate an upper-bound on the token generation speed based on the model size and GPU memory bandwidth. Say we have a 70B parameter model quantized in INT8: This will taken 70 GB of space. Let’s also assume we have an H100’s worth of memory bandwidth: 3.35 TB/s. Then:

To generate one token, we need one forward pass, so we have to read the entire model (70 GB) from global memory to on-chip/SM memory.
If the memory bandwidth is 3.35 TB/s, then: (3.35 TB/s) / (70 GB/token) => 47.9 tokens/second.

This represents an upper bound on token generation speed, since we haven’t accounted for a bunch of things.

Some methods include embeddings and the last layer which maps to $n_{vocab}$ , and some do not, but in a large enough model, this doesn’t make a huge difference since the parameter count of the attention and feedforward layers dominates. ↩︎
The Nvidia specifications for an H100 say it can do 1,979 teraFLOPS for fp16/bf16, but has a notable caveat that this is with “sparsity”. This refers to Nvidia’s “structured sparsity” which means that in a matrix each group of “four contiguous values” must have at least two which are zero. This is their “2:4” sparsity pattern, resulting in 50% of the values being 0. To get the FLOPS for a “dense” matrix, you should divide this value by 2, which I have done here. ↩︎