Introduction to The Problem of Language Modeling

24 Jun, 2025

Predicting is difficult — especially about the future.

Language models like the Transformer have taken the world by storm. I want to spend some time discussing the fundamental problem these models try to solve and a simple way to measure the performance of these models.

For the purposes of this discussion, I will define a language model as a machine learning model that generates fluent language. By fluent, I mean language that sounds natural and is similar to the types of language that a human who is considered fluent in some language would generate.

Problem Definition

More formally, a language model is some model that given a series of tokens generates the next token. How does the model decide what the next token will be? Well, it generates the token that has the highest probability of appearing given all the previous tokens. This is called next token prediction.

$n$ -gram Language Models

How does a model accurately predict the next token? A first approximation is to look at relative likelihoods of certain sequences of tokens. This is what an $n$ -gram language model seeks to do. An $n$ -gram language model predicts the next token based on the preceding $n - 1$ tokens, approximating the probability of a token given its limited context. We can express this in terms of a conditional probability as $P (w_{n + 1} ∣ w_{1}, w_{1}, \dots, w_{n})$ . Using the chain rule of probability we can simplify this expression as:

P (w_{1} \dots w_{n}) = P (w_{1}) P (w_{2} ∣ w_{1}) P (w_{3} ∣ w_{1 : 2}) \dots P (w_{n} ∣ w_{1 : n - 1}) = \prod_{k = 1}^{n} P (w_{k} ∣ w_{1 : k - 1})

In the above expression, $P (w_{k} ∣ w_{1 : k - 1})$ is the probability that we see a token ( $w_{k}$ ) given some sequence of tokens ( $w_{1 : k - 1}$ ) has already been observed.

Now we can count up all the different sequences in a corpus and use the resulting probabilities to compute the probability of observing a sequence of tokens plus the next token. We select the token that maximizes the probability of the full sequence, given the preceding context. We can increase or decrease the value of $n$ to consider more or less of the surrounding previous context.

$n$ -gram language models are simple but don't perform well for a variety of reasons. They don't take into account long range dependencies between tokens and they don't generalize well because they simply memorize sequences of tokens. They don't perform well when predicting tokens they haven't been trained on.

Evaluating Language Models

The standard way to train a model is to break a data set into three parts: the training, validation and test sets. A model is trained on the training set and uses the validation set to iteratively adjust hyperparameters and improve performance. Once training is complete, the model performance is tested on the test set.

Perplexity is the standard metric by which we measure the performance of language models. It measures how well a model predicts the correct next token in the test set. A good model will generate high probabilities for the correct next word in the test set.

Perplexity (w_{1 : n}) = P (w_{1}, w_{2}, \dots, w_{n})^{- \frac{1}{n}} = {(\prod_{i = 1}^{n} \frac{1}{P (w_{i} ∣ w_{1 : i - 1})})}^{\frac{1}{n}}

A lower perplexity corresponds to a better model that is less "surprised" by the correct next token.

In Closing

This was a theoretical introduction to the core problem language models try to solve. In a future post I will discuss more complex language models like LSTMs, RNNs and the Transformer. Language models have come a long way — from simple $n$ -gram models to architectures capable of capturing long-range dependencies and emergent behaviors — all by optimizing for the deceptively simple task of next token prediction.

#ai #llms #ml

Introduction to The Problem of Language Modeling

Problem Definition

n-gram Language Models

Evaluating Language Models

In Closing

$n$ -gram Language Models