Paper Walkthroughs

The Linear Representation Hypothesis and the Geometry of Large Language Models

Informally, the "linear representation hypothesis" is the idea that high-level concepts are represented linearly as directions in some representation space.

Background on Language Models

A language model is fundamentally a probability distribution over sequences of tokens. Given a context sequence \(x = (x_1, x_2, \ldots, x_t)\), the model predicts the next token \(y\) by computing \(\mathbb{P}(y \mid x)\).

Modern LLMs implement this distribution through a two-stage process:

  1. Embedding Stage: The context \(x\) is mapped to a representation vector (or embedding vector) \(\lambda(x) \in \Lambda \simeq \mathbb{R}^d\), where \(\Lambda\) is the representation space and \(d\) is the model's hidden dimension. This embedding captures the semantic and syntactic information from the context.
  2. Unembedding Stage: Each possible output word \(y\) in the vocabulary is associated with an unembedding vector \(\gamma(y) \in \Gamma \simeq \mathbb{R}^d\), where \(\Gamma\) is the unembedding space. The probability of generating word \(y\) is then given by the softmax distribution:
$$\mathbb{P}(y \mid x) \propto \exp\left(\lambda(x)^\top \gamma(y)\right) = \exp\left(\langle \lambda(x), \gamma(y) \rangle\right)$$

More precisely, normalizing over the entire vocabulary \(\mathcal{V}\):

$$\mathbb{P}(y \mid x) = \frac{\exp\left(\lambda(x)^\top \gamma(y)\right)}{\sum_{y' \in \mathcal{V}} \exp\left(\lambda(x)^\top \gamma(y')\right)}$$
Language Model Architecture

What is a Concept?

Before we can talk about whether concepts are represented linearly, we need to be precise about what a "concept" even means in the context of LLMs. The key insight is surprisingly simple: a concept is anything you can change about an output while keeping everything else the same.

Concepts as Factors of Variation

Think about the sentence "The king rules the kingdom." We can transform this in various independent ways:

Each of these transformations changes one aspect of the output while leaving the others intact. These aspects—language, gender, number, tense—are what we call concepts.

Formalizing Concepts with Causal Language

To make this precise, we model a concept as a concept variable \(W\) that:

This gives us a simple causal chain: \(X \to W \to Y\).

For simplicity, let's focus on binary concepts—concepts that take two values. For example:

To make the math cleaner, we'll encode binary concepts numerically. For instance, we might set male \(\Rightarrow\) 0 and female \(\Rightarrow\) 1. The choice of which value is 0 or 1 is arbitrary, but it will affect the sign of concept vectors we discover (more on this later).

The Linear Representation Hypothesis

The Cone of a Vector

Before stating the definition, we need a geometric tool. Given a vector \(v \in \mathbb{R}^d\), its cone is:

$$\text{Cone}(v) = \{\alpha v : \alpha > 0\}$$

Definition: Unembedding Representation

Unembedding Representation Definition

What does this mean geometrically?

Connection to Measurement

The first major result of the paper is that this unembedding representation is intimately connected to measuring concepts using linear probes:

Theorem: Connection to Measurement

What does this theorem say?

Consider a concrete scenario: suppose we know the output token will be either "king" or "queen" (say, because the context is about a monarch). The theorem tells us that the probability of outputting "king" (versus "queen") is logit-linear in the language model representation \(\lambda\), with regression coefficients given by \(\widetilde{\gamma}_W\).

More formally: the log-odds

$$\log \frac{\mathbb{P}(\text{output is "king"})}{\mathbb{P}(\text{output is "queen"})} = \alpha \lambda^\top \widetilde{\gamma}_W$$

Embedding vs Unembedding

Recall that a language model involves two spaces:

We've just seen that if concept differences \(\gamma(Y(1)) - \gamma(Y(0))\) align in the unembedding space, we get linear measurement. What happens if we look at context differences in the embedding space?

Embedding Representation Dual Relationship

Theorem 2.5 tells us what happens when we add the embedding representation \(\overline{\lambda}_W\) to a context embedding. This is the mathematical foundation for model steering or activation engineering.

Linear Representation Summary

More results and experiments are in the paper. Check it out at arxiv.org/abs/2311.03658!