Have you ever wanted to understand how Large Language Models (LLMs) actually work under the hood? The best way to learn is to build one. Not a massive 175-billion parameter giant that requires a supercomputer, but a "tiny" LLM - one small enough to train on a single GPU (or even a CPU if you're patient) but complex enough to teach you the core concepts.

In this tutorial, we'll build a miniature decoder-only transformer (approx. 1-10M parameters) from scratch. We'll cover the tokenizer, the model architecture, the training loop, and generation.

What We're Building

We are building a TinyGPT:

Architecture: Decoder-only Transformer (like GPT-2/3)
Size: 1-10 million parameters
Compute: Single consumer GPU (e.g., RTX 3060) or CPU
Goal: Understand the moving parts: tokenization, embeddings, attention, and training dynamics.

Project Structure

Here is how we'll organize our code:

tiny-llm/
│── data/
│     └── tiny_corpus.txt
│── tokenizer.py
│── model.py
│── train.py
│── inference.py
│── config.py

Step 1: The Corpus

For a tiny model, we need a tiny dataset. You can use public domain books (like Sherlock Holmes), a collection of "Tiny Stories", or even your own notes. You only need about 1-10MB of text.

Save your text file as data/tiny_corpus.txt.

Step 2: The Tokenizer

We'll use a Byte-Pair Encoding (BPE) tokenizer. The tokenizers library from Hugging Face makes this easy.

# tokenizer.py
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=8000, special_tokens=["<bos>", "<eos>"])

# Pre-tokenize by splitting on whitespace
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on our tiny corpus
tokenizer.train(["data/tiny_corpus.txt"], trainer)

# Save for later
tokenizer.save("tokenizer.json")

A vocabulary size of 8,000 is perfect for a model of this scale.

Step 3: The Model Architecture

We'll define a minimal configuration and the model structure using PyTorch.

Configuration (config.py):

model_config = {
    "vocab_size": 8000,
    "n_embd": 128,      # Embedding dimension
    "n_heads": 4,       # Number of attention heads
    "n_layers": 6,      # Number of transformer layers
    "block_size": 128,  # Context window size
}

The Model (model.py):

import torch
import torch.nn as nn
import math

class TinyGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.tok_embed = nn.Embedding(config["vocab_size"], config["n_embd"])
        self.pos_embed = nn.Parameter(torch.zeros(1, config["block_size"], config["n_embd"]))

        self.blocks = nn.ModuleList([
            DecoderBlock(config) for _ in range(config["n_layers"])
        ])

        self.ln_f = nn.LayerNorm(config["n_embd"])
        self.head = nn.Linear(config["n_embd"], config["vocab_size"], bias=False)

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_embed(idx)
        pos = self.pos_embed[:, :T, :]
        x = tok + pos

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)
        return logits

class DecoderBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attn = nn.MultiheadAttention(
            embed_dim=config["n_embd"],
            num_heads=config["n_heads"],
            batch_first=True
        )
        self.ln1 = nn.LayerNorm(config["n_embd"])
        self.ff = nn.Sequential(
            nn.Linear(config["n_embd"], 4*config["n_embd"]),
            nn.GELU(),
            nn.Linear(4*config["n_embd"], config["n_embd"])
        )
        self.ln2 = nn.LayerNorm(config["n_embd"])

    def forward(self, x):
        # Causal mask to prevent attending to future tokens
        attn_mask = torch.triu(
            torch.ones(x.size(1), x.size(1)), diagonal=1
        ).bool().to(x.device)

        attn_output, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = x + attn_output
        x = self.ln1(x)

        ff_output = self.ff(x)
        x = x + ff_output
        x = self.ln2(x)
        return x

Step 4: Training Loop

Now we bring it all together in a training loop.

# train.py
import torch
from model import TinyGPT
from tokenizer import tokenizer # Assuming you load your saved tokenizer
from config import model_config

# ... (Load data and encode helper functions here) ...

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TinyGPT(model_config).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

# Simple training loop
for epoch in range(10):
    # In a real script, you'd iterate over batches here
    for x, y in get_batches(encoded_data, batch_size=32, block_size=128):
        x, y = x.to(device), y.to(device)

        logits = model(x)
        loss = torch.nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            y.view(-1)
        )

        opt.zero_grad()
        loss.backward()
        opt.step()

    print(f"Epoch {epoch} Loss: {loss.item()}")

torch.save(model.state_dict(), "tiny_llm.pt")

Step 5: Generation

Once trained, we can generate text!

# inference.py
def generate(model, tokenizer, prompt, max_new_tokens=50):
    model.eval()
    tokens = tokenizer.encode(prompt).ids
    idx = torch.tensor(tokens)[None].to(device)

    for _ in range(max_new_tokens):
        logits = model(idx)
        # Greedy decoding: pick the most likely next token
        next_token = torch.argmax(logits[:, -1, :], dim=-1)
        idx = torch.cat([idx, next_token.unsqueeze(0)], dim=1)

    return tokenizer.decode(idx[0].tolist())

print(generate(model, tokenizer, "The story begins", 80))

Why Do This?

Building a tiny LLM demystifies the "magic" of AI. You learn:

Tokenization: How text becomes numbers.
Attention: How the model relates words to each other.
Training Dynamics: How loss decreases and the model "learns" structure.

Once you have this running, you can experiment with scaling up, adding more layers, or trying different datasets. Happy coding!

Building a Tiny LLM from Scratch: A Hands-On Tutorial

What We're Building

Project Structure

Step 1: The Corpus

Step 2: The Tokenizer

Step 3: The Model Architecture

Step 4: Training Loop

Step 5: Generation

Why Do This?

RedactMyPDF Team

What We're Building

Project Structure

Step 1: The Corpus

Step 2: The Tokenizer

Step 3: The Model Architecture

Step 4: Training Loop

Step 5: Generation

Why Do This?

RedactMyPDF Team

Continue Reading

Building a Plugin or Extension Using Gemini: From Concept to Launch

How to Redact PDFs with Python

Beyond Keywords: How AI and Gemini Are Revolutionizing Document Redaction