A clean, educational implementation of a character-level GPT-style transformer from scratch in PyTorch. This project teaches transformer concepts through a small but complete implementation that you can run in minutes.
This is a tiny transformer model that learns to generate text at the character level. It implements the core components of modern language models:
- Embeddings: Token and positional embeddings for input representation
- Self-Attention: Multi-head causal attention mechanism
- Feed-Forward Networks: Position-wise MLPs with GELU activation
- Residual Connections: Skip connections for training stability
- Layer Normalization: Pre-layer norm architecture (GPT-2 style)
- Weight Tying: Shared weights between input embeddings and output head
The model is trained on a small Shakespeare excerpt and learns to generate Shakespeare-ish text.
- Small but complete: ~80K parameters by default, trains in minutes on CPU
- Well-commented code: Every component explained for learning
- No heavy dependencies: Just PyTorch, with optional rich/wandb for extras
- Multiple interfaces: Command line, Python API, and Jupyter notebook
- Comprehensive tests: Shape tests, masking tests, tokenizer tests
# Create virtual environment
python -m venv .venv
# Activate (Linux/Mac)
source .venv/bin/activate
# Activate (Windows)
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Train with default settings (2-3 minutes on CPU)
python -m src.train
# Or use the Makefile
make train# Generate text sample
python -m src.generate --ckpt_path checkpoints/best.ckpt --tokens 400
# Or use the Makefile
make sample# Run all tests
pytest -q
# Or use the Makefile
make testAfter training for ~1200 iterations (default), you might see output like:
--- SAMPLE ---
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd...
Note: With the small dataset and model size, the generated text will be somewhat repetitive but should capture the general style and structure.
You can customize the model and training through command-line arguments:
python -m src.train --n_layer 4 --n_head 4 --n_embd 256 --block_size 256python -m src.train --max_iters 2000 --lr 1e-3 --batch_size 64 --dropout 0.2python -m src.generate --ckpt_path checkpoints/best.ckpt \
--tokens 500 --temperature 0.8 --start "To be or not"Experiment with these parameters to see how they affect training and generation:
-
Model Size:
--n_layer 4: More layers for deeper representations--n_head 8: More attention heads for richer patterns--n_embd 256: Larger embedding dimension
-
Sequence Length:
--block_size 256: Longer context for better coherence
-
Training:
--max_iters 5000: Longer training for better quality--dropout 0.0: Remove dropout to see overfitting effects--lr 1e-4: Lower learning rate for more stable training
-
Generation:
--temperature 0.5: More focused, less random output--temperature 1.5: More creative, more random output--top_k 10: Limit sampling to top 10 tokens
pip install wandb
python -m src.train --use_wandb --project_name my-tiny-transformerpython -m src.train --use_schedulerpython -m src.train --resumeOpen notebooks/0_quick_start.ipynb for an interactive demo that runs in 4 cells.
tiny-transformer-buildml/
├── README.md # This file
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── Makefile # Convenience commands
├── .gitignore # Git ignore patterns
├── src/
│ ├── data.py # Character tokenizer and data loading
│ ├── model.py # Transformer model components
│ ├── train.py # Training loop and evaluation
│ ├── generate.py # Text generation
│ ├── config.py # Configuration management
│ └── utils.py # Utilities and helpers
├── notebooks/
│ └── 0_quick_start.ipynb # Interactive demo
├── tests/
│ ├── test_shapes.py # Tensor shape tests
│ ├── test_masking.py # Attention masking tests
│ └── test_tokenizer.py # Tokenizer tests
└── checkpoints/ # Saved model checkpoints
With default settings:
- Training time: 2-3 minutes on CPU, 30 seconds on GPU
- Model size: ~80K parameters, ~0.3MB
- Memory usage: <100MB during training
- Final loss: ~1.5-2.0 (character-level cross-entropy)
The model will generate somewhat coherent Shakespeare-ish text but with repetitions due to the small dataset and model size.
# Reduce batch size
python -m src.train --batch_size 16
# Reduce model size
python -m src.train --n_embd 64 --block_size 64# Reduce training iterations for quick demo
python -m src.train --max_iters 200 --eval_interval 50
# Use smaller model
python -m src.train --n_layer 1 --n_head 1 --n_embd 64Make sure you're in the project root directory and have activated your virtual environment.
This tiny transformer demonstrates the same core principles used in large language models:
- Scaling: Large models use the same architecture with more layers (96+ vs 2), more heads (96+ vs 2), and larger embeddings (4096+ vs 128)
- Data: Large models train on billions of tokens vs our ~2000 characters
- Training: Large models train for weeks/months vs our minutes
- Techniques: Same attention mechanism, residual connections, layer norm, and weight tying
This is an educational project. Feel free to:
- Add more text data sources
- Implement additional generation strategies (beam search, nucleus sampling)
- Add more comprehensive visualizations
- Improve documentation and examples
MIT License - feel free to use for educational purposes.
Inspired by:
- Andrej Karpathy's nanoGPT
- "Attention Is All You Need" paper
- BuildML's transformer tutorial
- The GPT and GPT-2 papers
| Command | Description |
|---|---|
make setup |
Create virtual environment and install dependencies |
make train |
Train the model with default settings |
make sample |
Generate text sample from best checkpoint |
make test |
Run all tests |
make format |
Format code with ruff (if available) |
make clean |
Clean up generated files |
Happy learning! 🚀