Have you ever wanted to understand how Large Language Models (LLMs) actually work under the hood? The best way to learn is to build one. Not a massive 175-billion parameter giant that requires a supercomputer, but a "tiny" LLM - one small enough to train on a single GPU (or even a CPU if you're patient) but complex enough to teach you the core concepts.
In this tutorial, we'll build a miniature decoder-only transformer (approx. 1-10M parameters) from scratch. We'll cover the tokenizer, the model architecture, the training loop, and generation.
What We're Building
We are building a TinyGPT:
- Architecture: Decoder-only Transformer (like GPT-2/3)
- Size: 1-10 million parameters
- Compute: Single consumer GPU (e.g., RTX 3060) or CPU
- Goal: Understand the moving parts: tokenization, embeddings, attention, and training dynamics.
Project Structure
Here is how we'll organize our code:
tiny-llm/
│── data/
│ └── tiny_corpus.txt
│── tokenizer.py
│── model.py
│── train.py
│── inference.py
│── config.py
Step 1: The Corpus
For a tiny model, we need a tiny dataset. You can use public domain books (like Sherlock Holmes), a collection of "Tiny Stories", or even your own notes. You only need about 1-10MB of text.
Save your text file as data/tiny_corpus.txt.
Step 2: The Tokenizer
We'll use a Byte-Pair Encoding (BPE) tokenizer. The tokenizers library from Hugging Face makes this easy.
# tokenizer.py
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=8000, special_tokens=["<bos>", "<eos>"])
# Pre-tokenize by splitting on whitespace
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# Train on our tiny corpus
tokenizer.train(["data/tiny_corpus.txt"], trainer)
# Save for later
tokenizer.save("tokenizer.json")
A vocabulary size of 8,000 is perfect for a model of this scale.
Step 3: The Model Architecture
We'll define a minimal configuration and the model structure using PyTorch.
Configuration (config.py):
model_config = {
"vocab_size": 8000,
"n_embd": 128, # Embedding dimension
"n_heads": 4, # Number of attention heads
"n_layers": 6, # Number of transformer layers
"block_size": 128, # Context window size
}
The Model (model.py):
import torch
import torch.nn as nn
import math
class TinyGPT(nn.Module):
def __init__(self, config):
super().__init__()
self.tok_embed = nn.Embedding(config["vocab_size"], config["n_embd"])
self.pos_embed = nn.Parameter(torch.zeros(1, config["block_size"], config["n_embd"]))
self.blocks = nn.ModuleList([
DecoderBlock(config) for _ in range(config["n_layers"])
])
self.ln_f = nn.LayerNorm(config["n_embd"])
self.head = nn.Linear(config["n_embd"], config["vocab_size"], bias=False)
def forward(self, idx):
B, T = idx.shape
tok = self.tok_embed(idx)
pos = self.pos_embed[:, :T, :]
x = tok + pos
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x)
return logits
class DecoderBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.attn = nn.MultiheadAttention(
embed_dim=config["n_embd"],
num_heads=config["n_heads"],
batch_first=True
)
self.ln1 = nn.LayerNorm(config["n_embd"])
self.ff = nn.Sequential(
nn.Linear(config["n_embd"], 4*config["n_embd"]),
nn.GELU(),
nn.Linear(4*config["n_embd"], config["n_embd"])
)
self.ln2 = nn.LayerNorm(config["n_embd"])
def forward(self, x):
# Causal mask to prevent attending to future tokens
attn_mask = torch.triu(
torch.ones(x.size(1), x.size(1)), diagonal=1
).bool().to(x.device)
attn_output, _ = self.attn(x, x, x, attn_mask=attn_mask)
x = x + attn_output
x = self.ln1(x)
ff_output = self.ff(x)
x = x + ff_output
x = self.ln2(x)
return x
Step 4: Training Loop
Now we bring it all together in a training loop.
# train.py
import torch
from model import TinyGPT
from tokenizer import tokenizer # Assuming you load your saved tokenizer
from config import model_config
# ... (Load data and encode helper functions here) ...
device = "cuda" if torch.cuda.is_available() else "cpu"
model = TinyGPT(model_config).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
# Simple training loop
for epoch in range(10):
# In a real script, you'd iterate over batches here
for x, y in get_batches(encoded_data, batch_size=32, block_size=128):
x, y = x.to(device), y.to(device)
logits = model(x)
loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)),
y.view(-1)
)
opt.zero_grad()
loss.backward()
opt.step()
print(f"Epoch {epoch} Loss: {loss.item()}")
torch.save(model.state_dict(), "tiny_llm.pt")
Step 5: Generation
Once trained, we can generate text!
# inference.py
def generate(model, tokenizer, prompt, max_new_tokens=50):
model.eval()
tokens = tokenizer.encode(prompt).ids
idx = torch.tensor(tokens)[None].to(device)
for _ in range(max_new_tokens):
logits = model(idx)
# Greedy decoding: pick the most likely next token
next_token = torch.argmax(logits[:, -1, :], dim=-1)
idx = torch.cat([idx, next_token.unsqueeze(0)], dim=1)
return tokenizer.decode(idx[0].tolist())
print(generate(model, tokenizer, "The story begins", 80))
Why Do This?
Building a tiny LLM demystifies the "magic" of AI. You learn:
Tokenization: How text becomes numbers.
Attention: How the model relates words to each other.
Training Dynamics: How loss decreases and the model "learns" structure.
Once you have this running, you can experiment with scaling up, adding more layers, or trying different datasets. Happy coding!