Embeddings in Machine Learning

Follow the letter 'A' from raw text all the way into a dense vector — learning what tokens are, why we use them, and how gradient descent bakes semantic meaning into numbers.

Estimated time: ~12 min
Difficulty: intermediate
Sources: 4 sources

Your name is in the model’s training data. Somewhere in a weight matrix with 50,000 rows and 768 columns there is a cluster of numbers that “knows” your name sits near other names — near “Alice” and “Bob” and “Dr.” — not near “Thursday” or “kilograms.” Nobody told it that. It figured it out from text alone, by learning to predict what comes next.

What Is a Token — and Why Not Just Use Characters?

Before a model can touch your text, it has to slice it into pieces. The obvious choices are:

Characters — simple, but the sentence “The quick brown fox” becomes 18 separate inputs. The model has to learn that q, u, i, c, k together mean something — over and over again.
Words — intuitive, but the English vocabulary is enormous. “run”, “runs”, “running”, “ran” are all different entries. Rare words, typos, and new words become unknown tokens.

The industry settled on a middle path called subword tokenization. The core algorithm, BPE (Byte Pair Encoding), starts from individual characters and repeatedly merges the most frequent adjacent pair. ^{[Neural Machine Translation of Rare Words with Subword Units]}

BPE in three steps

Suppose the only training corpus is: low low low lower lowest

Start: l o w · l o w · l o w · l o w e r · l o w e s t
Most frequent pair: l o → merge into lo. Corpus: lo w · lo w e r · lo w e s t
Most frequent pair: lo w → merge into low. Now “low” is a single token.
Continue until vocabulary reaches target size (~50,000 for GPT-style models).

The result: common words stay whole; rare words split into familiar pieces. “unbelievable” → un + believ + able. Each piece appears in many other words, so the model has seen plenty of training signal for each one.

Token def.

A token is the atomic unit a language model operates on — produced by a subword tokenizer. It is typically a frequent character sequence (often a word, a word-fragment, a punctuation mark, or a whitespace-prefixed word). Most modern LLMs use between 32k and 200k tokens in their vocabulary.

Click a word to watch it pass through BPE tokenization — from characters to tokens to integer IDs.

Check your understanding

Why does BPE tokenization handle the word 'ChatGPT' worse than the word 'hello'?

From Token ID to Dense Vector: The Embedding Lookup

After tokenization, each token becomes an integer — its token ID. For GPT-4 the token hello is ID 15339. That number means nothing on its own; the model needs to look up a learned vector for it.

This is where the embedding matrix comes in. It is a table with one row per token in the vocabulary. Each row is a list of floating-point numbers — the embedding vector for that token. A typical size:

vocab size 50, 000 \times embedding dimension 768 \approx 38.4 M parameters

To get the embedding for token 15339, the model simply reads row 15339. That row is the model’s learned representation of hello.

Before learning: the one-hot vector. If you tried to give the model the token ID directly as a number, it would imply that token 15340 is “slightly more” than 15339 — meaningless. The conceptually cleaner starting point is a one-hot vector: a vector of all zeros except a single 1 at the token’s index. It is sparse (50,000 dimensions, almost all zero) and says only “I am this token, nothing else.”

The embedding lookup is exactly equivalent to multiplying a one-hot vector by the embedding matrix — you get one row back. But rather than carrying around those enormous sparse vectors at runtime, we skip straight to the row lookup.

Common misconception

Embeddings are manually designed to encode semantic meaning.

What's actually true

No one hand-crafted the meaning in embedding vectors. They start as random numbers. Training — specifically gradient descent on a prediction task — pushes similar-context words toward each other in the vector space. Semantic structure is an emergent consequence of the training objective, not a designed feature.

Select a word to see the contrast between its sparse one-hot vector and its dense learned embedding.

Check your understanding

A vocabulary has 100,000 tokens and each embedding has 512 dimensions. How many parameters are in the embedding matrix?

How Embeddings Learn: Gradient Descent and Semantic Neighborhoods

Embeddings are not special — they are just weights in a neural network, trained exactly like any other weights. The training signal comes from a prediction task.

Word2Vec intuition. The original word2vec model ^{[Efficient Estimation of Word Representations in Vector Space]} framed learning as: given surrounding words, predict the center word (CBOW) or given the center word, predict surrounding words (skip-gram).

If “cat” and “dog” frequently appear in the same contexts — “I feed my ___”, “the ___ ran across the yard” — backpropagation nudges their embedding vectors closer together. Over millions of examples, the geometry of the space converges to reflect statistical co-occurrence.

The famous result: the geometry captures semantic relationships as vector arithmetic.

king - man + woman \approx queen

Nobody encoded royalty or gender. Those axes emerged because the data forced them to.

Show the formal gradient update for an embedding lookup

In the skip-gram model, the objective is to maximise the log probability of context words given the center word. For center word $w_c$ with embedding $\mathbf{v}_{w_c}$ and context word $w_o$ with output vector $\mathbf{u}_{w_o}$ , the loss for one pair is:

$L = -\log\frac{\exp(\mathbf{u}_{w_o}^\top \mathbf{v}_{w_c})}{\sum_{w=1}^{V} \exp(\mathbf{u}_w^\top \mathbf{v}_{w_c})}$

The gradient with respect to $\mathbf{v}_{w_c}$ is:

$\frac{\partial L}{\partial \mathbf{v}_{w_c}} = \sum_{w=1}^{V} P(w | w_c)\, \mathbf{u}_w - \mathbf{u}_{w_o}$

At each step we update only the row of the embedding matrix corresponding to $w_c$ — all other rows are unchanged. This is why training is efficient even with a 50k vocabulary.

In transformer LLMs, the same principle applies but richer. ^{[Attention Is All You Need]} The embedding matrix is now just the first layer. Subsequent attention layers produce contextual embeddings: the vector for “bank” in “river bank” differs from “bank” in “savings bank” because attention mixes in context.

Switch views to see how royalty, animal taxonomy, and sentiment polarity self-organize in the learned vector space.

Check your understanding

What training objective causes word2vec to place 'cat' and 'dog' near each other in embedding space?

Where Embeddings Live Today

Embeddings are everywhere in modern ML pipelines.

	Use case	Input modality	Architecture	Embedding dimension
GPT-4 text	Language generation	Subword tokens	Transformer decoder	~12,288 (GPT-4 class)
BERT search	Semantic search / RAG	Subword tokens	Transformer encoder	768
CLIP images	Image–text retrieval	Image patches + tokens	Vision + text encoders	512–1024
Whisper audio	Speech recognition	Mel-spectrogram frames	Transformer encoder	512–1280
Recommendations	Product / user matching	Item IDs	Two-tower network	64–256

Embedding types across modern architectures

Retrieval-Augmented Generation (RAG) — the technique powering many AI assistants — is fundamentally about embeddings. A document is split into chunks, each chunk is embedded into a vector, and all vectors are stored in a vector database. When you ask a question, it is also embedded; the nearest stored vectors (by cosine similarity) are retrieved and fed as context to the LLM.

Analogy — Address index is like Vector database

A traditional database index tells you “row 4,521 has the value ‘Paris’.” A vector database index tells you “these 10 embeddings are geometrically closest to your query embedding” — no exact match needed, just proximity in learned meaning-space.

Cosine similarity measures the angle between two vectors, ignoring their magnitudes. Two embeddings with cosine similarity near 1.0 are semantically similar; near −1.0, they’re semantically opposed; near 0, unrelated. It is the standard metric for embedding comparison.

cosine (a, b) = \frac{a \cdot b}{∥ a ∥ ∥ b ∥}

Check your understanding

In a RAG system, what is the query encoded into before searching the vector database?

End-of-lesson checkQ 1 / 4

A language model uses a vocabulary of 32,000 tokens and 1,024-dimensional embeddings. How many parameters are in its embedding matrix?

Ownable artifact — derive it yourself.

Sketch the full pipeline for the string "Hello world" on paper:

Write out the characters, then group them into tokens (use your intuition — “hello” is one, “world” is one, space is one).
Assign each token a made-up integer ID.
Draw the 3×4 embedding matrix (3 tokens, 4 dimensions). Fill each row with random numbers.
“Look up” each token: circle its row.
Now ask: if you trained this system to predict “world” given “Hello”, which row’s numbers would backpropagation update?

That sketch captures the entire mechanism: text → tokens → IDs → embedding lookup → gradient update.