Embeddings in Machine Learning
Follow the letter 'A' from raw text all the way into a dense vector — learning what tokens are, why we use them, and how gradient descent bakes semantic meaning into numbers.
- Estimated time
- ~12 min
- Difficulty
- intermediate
- Sources
- 4 sources
Your name is in the model’s training data. Somewhere in a weight matrix with 50,000 rows and 768 columns there is a cluster of numbers that “knows” your name sits near other names — near “Alice” and “Bob” and “Dr.” — not near “Thursday” or “kilograms.” Nobody told it that. It figured it out from text alone, by learning to predict what comes next.
What Is a Token — and Why Not Just Use Characters?
Before a model can touch your text, it has to slice it into pieces. The obvious choices are:
- Characters — simple, but the sentence “The quick brown fox” becomes 18 separate inputs. The model has to learn that
q,u,i,c,ktogether mean something — over and over again. - Words — intuitive, but the English vocabulary is enormous. “run”, “runs”, “running”, “ran” are all different entries. Rare words, typos, and new words become unknown tokens.
The industry settled on a middle path called subword tokenization. The core algorithm, BPE (Byte Pair Encoding), starts from individual characters and repeatedly merges the most frequent adjacent pair. [Neural Machine Translation of Rare Words with Subword Units]
BPE in three steps
Suppose the only training corpus is: low low low lower lowest
- Start:
l o w·l o w·l o w·l o w e r·l o w e s t - Most frequent pair:
l o→ merge intolo. Corpus:lo w·lo w e r·lo w e s t - Most frequent pair:
lo w→ merge intolow. Now “low” is a single token. - Continue until vocabulary reaches target size (~50,000 for GPT-style models).
The result: common words stay whole; rare words split into familiar pieces. “unbelievable” → un + believ + able. Each piece appears in many other words, so the model has seen plenty of training signal for each one.
A token is the atomic unit a language model operates on — produced by a subword tokenizer. It is typically a frequent character sequence (often a word, a word-fragment, a punctuation mark, or a whitespace-prefixed word). Most modern LLMs use between 32k and 200k tokens in their vocabulary.
Check your understanding
Why does BPE tokenization handle the word 'ChatGPT' worse than the word 'hello'?
From Token ID to Dense Vector: The Embedding Lookup
After tokenization, each token becomes an integer — its token ID. For GPT-4 the token hello is ID 15339. That number means nothing on its own; the model needs to look up a learned vector for it.
This is where the embedding matrix comes in. It is a table with one row per token in the vocabulary. Each row is a list of floating-point numbers — the embedding vector for that token. A typical size:
To get the embedding for token 15339, the model simply reads row 15339. That row is the model’s learned representation of hello.
Before learning: the one-hot vector. If you tried to give the model the token ID directly as a number, it would imply that token 15340 is “slightly more” than 15339 — meaningless. The conceptually cleaner starting point is a one-hot vector: a vector of all zeros except a single 1 at the token’s index. It is sparse (50,000 dimensions, almost all zero) and says only “I am this token, nothing else.”
The embedding lookup is exactly equivalent to multiplying a one-hot vector by the embedding matrix — you get one row back. But rather than carrying around those enormous sparse vectors at runtime, we skip straight to the row lookup.
Common misconception
Embeddings are manually designed to encode semantic meaning.
What's actually true
No one hand-crafted the meaning in embedding vectors. They start as random numbers. Training — specifically gradient descent on a prediction task — pushes similar-context words toward each other in the vector space. Semantic structure is an emergent consequence of the training objective, not a designed feature.
Check your understanding
A vocabulary has 100,000 tokens and each embedding has 512 dimensions. How many parameters are in the embedding matrix?
How Embeddings Learn: Gradient Descent and Semantic Neighborhoods
Embeddings are not special — they are just weights in a neural network, trained exactly like any other weights. The training signal comes from a prediction task.
Word2Vec intuition. The original word2vec model [Efficient Estimation of Word Representations in Vector Space] framed learning as: given surrounding words, predict the center word (CBOW) or given the center word, predict surrounding words (skip-gram).
If “cat” and “dog” frequently appear in the same contexts — “I feed my ___”, “the ___ ran across the yard” — backpropagation nudges their embedding vectors closer together. Over millions of examples, the geometry of the space converges to reflect statistical co-occurrence.
The famous result: the geometry captures semantic relationships as vector arithmetic.
Nobody encoded royalty or gender. Those axes emerged because the data forced them to.
Show the formal gradient update for an embedding lookup
In the skip-gram model, the objective is to maximise the log probability of context words given the center word. For center word with embedding and context word with output vector , the loss for one pair is:
The gradient with respect to is:
At each step we update only the row of the embedding matrix corresponding to — all other rows are unchanged. This is why training is efficient even with a 50k vocabulary.
In transformer LLMs, the same principle applies but richer. [Attention Is All You Need] The embedding matrix is now just the first layer. Subsequent attention layers produce contextual embeddings: the vector for “bank” in “river bank” differs from “bank” in “savings bank” because attention mixes in context.
Check your understanding
What training objective causes word2vec to place 'cat' and 'dog' near each other in embedding space?
Where Embeddings Live Today
Embeddings are everywhere in modern ML pipelines.
| Use case | Input modality | Architecture | Embedding dimension | |
|---|---|---|---|---|
| GPT-4 text | Language generation | Subword tokens | Transformer decoder | ~12,288 (GPT-4 class) |
| BERT search | Semantic search / RAG | Subword tokens | Transformer encoder | 768 |
| CLIP images | Image–text retrieval | Image patches + tokens | Vision + text encoders | 512–1024 |
| Whisper audio | Speech recognition | Mel-spectrogram frames | Transformer encoder | 512–1280 |
| Recommendations | Product / user matching | Item IDs | Two-tower network | 64–256 |
Retrieval-Augmented Generation (RAG) — the technique powering many AI assistants — is fundamentally about embeddings. A document is split into chunks, each chunk is embedded into a vector, and all vectors are stored in a vector database. When you ask a question, it is also embedded; the nearest stored vectors (by cosine similarity) are retrieved and fed as context to the LLM.
Analogy — Address index is like Vector database
A traditional database index tells you “row 4,521 has the value ‘Paris’.” A vector database index tells you “these 10 embeddings are geometrically closest to your query embedding” — no exact match needed, just proximity in learned meaning-space.
Cosine similarity measures the angle between two vectors, ignoring their magnitudes. Two embeddings with cosine similarity near 1.0 are semantically similar; near −1.0, they’re semantically opposed; near 0, unrelated. It is the standard metric for embedding comparison.
Check your understanding
In a RAG system, what is the query encoded into before searching the vector database?
End-of-lesson checkQ 1 / 4
A language model uses a vocabulary of 32,000 tokens and 1,024-dimensional embeddings. How many parameters are in its embedding matrix?
Ownable artifact — derive it yourself.
Sketch the full pipeline for the string "Hello world" on paper:
- Write out the characters, then group them into tokens (use your intuition — “hello” is one, “world” is one, space is one).
- Assign each token a made-up integer ID.
- Draw the 3×4 embedding matrix (3 tokens, 4 dimensions). Fill each row with random numbers.
- “Look up” each token: circle its row.
- Now ask: if you trained this system to predict “world” given “Hello”, which row’s numbers would backpropagation update?
That sketch captures the entire mechanism: text → tokens → IDs → embedding lookup → gradient update.