Tiny Transformer BuildML

A clean, educational implementation of a character-level GPT-style transformer from scratch in PyTorch. This project teaches transformer concepts through a small but complete implementation that you can run in minutes.

What is This?

This is a tiny transformer model that learns to generate text at the character level. It implements the core components of modern language models:

Embeddings: Token and positional embeddings for input representation
Self-Attention: Multi-head causal attention mechanism
Feed-Forward Networks: Position-wise MLPs with GELU activation
Residual Connections: Skip connections for training stability
Layer Normalization: Pre-layer norm architecture (GPT-2 style)
Weight Tying: Shared weights between input embeddings and output head

The model is trained on a small Shakespeare excerpt and learns to generate Shakespeare-ish text.

Why Educational?

Small but complete: ~80K parameters by default, trains in minutes on CPU
Well-commented code: Every component explained for learning
No heavy dependencies: Just PyTorch, with optional rich/wandb for extras
Multiple interfaces: Command line, Python API, and Jupyter notebook
Comprehensive tests: Shape tests, masking tests, tokenizer tests

Quick Start

1. Setup Environment

# Create virtual environment
python -m venv .venv

# Activate (Linux/Mac)
source .venv/bin/activate

# Activate (Windows)
.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Train the Model

# Train with default settings (2-3 minutes on CPU)
python -m src.train

# Or use the Makefile
make train

3. Generate Text

# Generate text sample
python -m src.generate --ckpt_path checkpoints/best.ckpt --tokens 400

# Or use the Makefile
make sample

4. Run Tests

# Run all tests
pytest -q

# Or use the Makefile
make test

Example Output

After training for ~1200 iterations (default), you might see output like:

--- SAMPLE ---
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd...

Note: With the small dataset and model size, the generated text will be somewhat repetitive but should capture the general style and structure.

Configuration Options

You can customize the model and training through command-line arguments:

Model Architecture

python -m src.train --n_layer 4 --n_head 4 --n_embd 256 --block_size 256

Training Settings

python -m src.train --max_iters 2000 --lr 1e-3 --batch_size 64 --dropout 0.2

Generation Options

python -m src.generate --ckpt_path checkpoints/best.ckpt \
  --tokens 500 --temperature 0.8 --start "To be or not"

Tweaks to Try

Experiment with these parameters to see how they affect training and generation:

Model Size:
- --n_layer 4: More layers for deeper representations
- --n_head 8: More attention heads for richer patterns
- --n_embd 256: Larger embedding dimension
Sequence Length:
- --block_size 256: Longer context for better coherence
Training:
- --max_iters 5000: Longer training for better quality
- --dropout 0.0: Remove dropout to see overfitting effects
- --lr 1e-4: Lower learning rate for more stable training
Generation:
- --temperature 0.5: More focused, less random output
- --temperature 1.5: More creative, more random output
- --top_k 10: Limit sampling to top 10 tokens

Advanced Features

Weights & Biases Integration

pip install wandb
python -m src.train --use_wandb --project_name my-tiny-transformer

Learning Rate Scheduling

python -m src.train --use_scheduler

Resume Training

python -m src.train --resume

Jupyter Notebook

Open notebooks/0_quick_start.ipynb for an interactive demo that runs in 4 cells.

Project Structure

tiny-transformer-buildml/
├── README.md              # This file
├── requirements.txt       # Python dependencies
├── pyproject.toml         # Project configuration
├── Makefile              # Convenience commands
├── .gitignore            # Git ignore patterns
├── src/
│   ├── data.py           # Character tokenizer and data loading
│   ├── model.py          # Transformer model components
│   ├── train.py          # Training loop and evaluation
│   ├── generate.py       # Text generation
│   ├── config.py         # Configuration management
│   └── utils.py          # Utilities and helpers
├── notebooks/
│   └── 0_quick_start.ipynb  # Interactive demo
├── tests/
│   ├── test_shapes.py    # Tensor shape tests
│   ├── test_masking.py   # Attention masking tests
│   └── test_tokenizer.py # Tokenizer tests
└── checkpoints/          # Saved model checkpoints

Expected Performance

With default settings:

Training time: 2-3 minutes on CPU, 30 seconds on GPU
Model size: ~80K parameters, ~0.3MB
Memory usage: <100MB during training
Final loss: ~1.5-2.0 (character-level cross-entropy)

The model will generate somewhat coherent Shakespeare-ish text but with repetitions due to the small dataset and model size.

Troubleshooting

CUDA Out of Memory

# Reduce batch size
python -m src.train --batch_size 16

# Reduce model size
python -m src.train --n_embd 64 --block_size 64

CPU Too Slow

# Reduce training iterations for quick demo
python -m src.train --max_iters 200 --eval_interval 50

# Use smaller model
python -m src.train --n_layer 1 --n_head 1 --n_embd 64

Import Errors

Make sure you're in the project root directory and have activated your virtual environment.

How This Maps to Large LLMs

This tiny transformer demonstrates the same core principles used in large language models:

Scaling: Large models use the same architecture with more layers (96+ vs 2), more heads (96+ vs 2), and larger embeddings (4096+ vs 128)
Data: Large models train on billions of tokens vs our ~2000 characters
Training: Large models train for weeks/months vs our minutes
Techniques: Same attention mechanism, residual connections, layer norm, and weight tying

Contributing

This is an educational project. Feel free to:

Add more text data sources
Implement additional generation strategies (beam search, nucleus sampling)
Add more comprehensive visualizations
Improve documentation and examples

License

MIT License - feel free to use for educational purposes.

Acknowledgments

Inspired by:

Andrej Karpathy's nanoGPT
"Attention Is All You Need" paper
BuildML's transformer tutorial
The GPT and GPT-2 papers

Commands Reference

Command	Description
`make setup`	Create virtual environment and install dependencies
`make train`	Train the model with default settings
`make sample`	Generate text sample from best checkpoint
`make test`	Run all tests
`make format`	Format code with ruff (if available)
`make clean`	Clean up generated files

Happy learning! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Tiny Transformer BuildML

What is This?

Why Educational?

Quick Start

1. Setup Environment

2. Train the Model

3. Generate Text

4. Run Tests

Example Output

Configuration Options

Model Architecture

Training Settings

Generation Options

Tweaks to Try

Advanced Features

Weights & Biases Integration

Learning Rate Scheduling

Resume Training

Jupyter Notebook

Project Structure

Expected Performance

Troubleshooting

CUDA Out of Memory

CPU Too Slow

Import Errors

How This Maps to Large LLMs

Contributing

License

Acknowledgments

Commands Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages