Post

From Boring to Brilliant: A Guide to LLM Sampling Techniques

From Boring to Brilliant: A Guide to LLM Sampling Techniques

When you ask an LLM a question, it doesn’t just “know” the answer. It predicts the next word, one by one. But how it chooses that next word makes the difference between a robotic, repetitive answer and a creative, human-sounding response.

This logic is defined by Sampling Techniques. Here is a breakdown of the most common methods, how they work, and when to use them.


1. The Deterministic Route: Searching for the “Best” Answer

These methods aim to find the statistically most probable sequence of text. There is no randomness here; the same input will always yield the same output.

This is the simplest approach. At every step, the model looks at the probability distribution and strictly picks the #1 most probable token (argmax).

  • Pros: It is fast, computationally cheap, and produces very coherent, on-topic text.
  • Cons: It kills creativity. It is prone to getting stuck in repetitive loops (e.g., “I am I am I am…”) and is “brittle”. One bad word choice early on can derail the whole sentence.
  • Best For: Strictly factual tasks like arithmetic, translation, or Q&A where you want the single “correct” answer.

Think of this as a “smarter” Greedy Search. Instead of only keeping the single best word at each step, it keeps track of the $k$ most probable sequences (called “beams”) simultaneously.

  • Pros: It offers high fluency and is better at maintaining overall sentence logic compared to Greedy.
  • Cons: It is computationally expensive ($k$ times more work) and can still result in generic, safe text.
  • Best For: Machine Translation and Summarization, where the overall quality of the full sequence is the top priority.

2. The Stochastic Route: Adding Creativity

To make text feel human, we need randomness (stochasticity). These methods sample from the probability distribution rather than just maximizing it.

Top-K Sampling

This method truncates the “tail” of the probability distribution. It filters the options to the Top $k$ most probable tokens (e.g., the top 50), and then samples randomly from that specific group.

  • Pros: It balances creativity with coherence by filtering out “absurd” low-probability words.
  • Cons: It isn’t adaptive. The cutoff $k$ is fixed. If the “true” list of good words is 3, but $k$ is 50, you might get a bad word. If the list of good words is 200, you unfairly cut 150 of them.
  • Best For: Creative tasks like story writing, chatbots, and brainstorming.

Top-P (Nucleus) Sampling

Currently the “gold standard” for open-ended generation. Instead of a fixed number of words, it filters to the smallest set of words whose cumulative probability is $\ge p$ (e.g., $p=0.92$).

  • Pros: It is highly adaptive.
    • If the model is certain (one word has 95% probability), the sample pool shrinks to just that word (acting like Greedy).
    • If the model is uncertain, the pool expands to include many options (acting like Top-K).
  • Best For: The modern default for chatbots (including ChatGPT), coding assistants, and creative writing.

Deep Dive: Beam Search vs. Top-K Sampling

A common point of confusion is that both Beam Search and Top-K use a number “$k$.” However, they are fundamentally different tools for different jobs.

FeatureBeam Search ($k=3$)Top-K Sampling ($k=3$)
TypeSearch Algorithm (Deterministic)Sampling Method (Stochastic)
GoalFind the single, most probable complete sentencePick one random, “good” next word
What is $k$?Beam Width: Number of parallel sequences to trackFilter Size: Number of top tokens to choose from

The “Cat” Example

Imagine the model is completing the sentence: “The cat sat on the…”

1. How Beam Search handles it ($k=2$) Beam search looks at the “Total Score” of the entire sentence so far.

  • It calculates 50,000+ possibilities but only keeps the Top 2 paths: “The cat sat on the mat” and “The cat sat on the rug.” It permanently discards “floor” or “dog”.
  • It then explores the next word for those two specific paths to see which yields the highest total probability for the full sentence.
  • Result: No randomness. It optimizes for the “best” path.

2. How Top-K handles it ($k=3$) Top-K makes a decision for right now, with no memory of other paths.

  • It takes the top 3 words: mat (0.4), rug (0.3), floor (0.15).
  • It recalculates their probabilities to sum to 100% (e.g., mat becomes ~47%, rug ~35%).
  • It rolls the dice. There is a real chance it picks “rug” or “floor,” even if “mat” was the most likely.
  • Result: Randomness is introduced. The model creates something new.

Conclusion

Choosing the right sampling technique depends entirely on your goal. If you need a mathematically perfect translation, stick to Beam Search. If you want a chatbot to write a poem or brainstorm ideas, Top-P (Nucleus) Sampling is your best friend.

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.