Skip to content

ChariPramod/MiniGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiny Transformer BuildML

A clean, educational implementation of a character-level GPT-style transformer from scratch in PyTorch. This project teaches transformer concepts through a small but complete implementation that you can run in minutes.

What is This?

This is a tiny transformer model that learns to generate text at the character level. It implements the core components of modern language models:

  • Embeddings: Token and positional embeddings for input representation
  • Self-Attention: Multi-head causal attention mechanism
  • Feed-Forward Networks: Position-wise MLPs with GELU activation
  • Residual Connections: Skip connections for training stability
  • Layer Normalization: Pre-layer norm architecture (GPT-2 style)
  • Weight Tying: Shared weights between input embeddings and output head

The model is trained on a small Shakespeare excerpt and learns to generate Shakespeare-ish text.

Why Educational?

  • Small but complete: ~80K parameters by default, trains in minutes on CPU
  • Well-commented code: Every component explained for learning
  • No heavy dependencies: Just PyTorch, with optional rich/wandb for extras
  • Multiple interfaces: Command line, Python API, and Jupyter notebook
  • Comprehensive tests: Shape tests, masking tests, tokenizer tests

Quick Start

1. Setup Environment

# Create virtual environment
python -m venv .venv

# Activate (Linux/Mac)
source .venv/bin/activate

# Activate (Windows)
.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Train the Model

# Train with default settings (2-3 minutes on CPU)
python -m src.train

# Or use the Makefile
make train

3. Generate Text

# Generate text sample
python -m src.generate --ckpt_path checkpoints/best.ckpt --tokens 400

# Or use the Makefile
make sample

4. Run Tests

# Run all tests
pytest -q

# Or use the Makefile
make test

Example Output

After training for ~1200 iterations (default), you might see output like:

--- SAMPLE ---
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd...

Note: With the small dataset and model size, the generated text will be somewhat repetitive but should capture the general style and structure.

Configuration Options

You can customize the model and training through command-line arguments:

Model Architecture

python -m src.train --n_layer 4 --n_head 4 --n_embd 256 --block_size 256

Training Settings

python -m src.train --max_iters 2000 --lr 1e-3 --batch_size 64 --dropout 0.2

Generation Options

python -m src.generate --ckpt_path checkpoints/best.ckpt \
  --tokens 500 --temperature 0.8 --start "To be or not"

Tweaks to Try

Experiment with these parameters to see how they affect training and generation:

  1. Model Size:

    • --n_layer 4: More layers for deeper representations
    • --n_head 8: More attention heads for richer patterns
    • --n_embd 256: Larger embedding dimension
  2. Sequence Length:

    • --block_size 256: Longer context for better coherence
  3. Training:

    • --max_iters 5000: Longer training for better quality
    • --dropout 0.0: Remove dropout to see overfitting effects
    • --lr 1e-4: Lower learning rate for more stable training
  4. Generation:

    • --temperature 0.5: More focused, less random output
    • --temperature 1.5: More creative, more random output
    • --top_k 10: Limit sampling to top 10 tokens

Advanced Features

Weights & Biases Integration

pip install wandb
python -m src.train --use_wandb --project_name my-tiny-transformer

Learning Rate Scheduling

python -m src.train --use_scheduler

Resume Training

python -m src.train --resume

Jupyter Notebook

Open notebooks/0_quick_start.ipynb for an interactive demo that runs in 4 cells.

Project Structure

tiny-transformer-buildml/
├── README.md              # This file
├── requirements.txt       # Python dependencies
├── pyproject.toml         # Project configuration
├── Makefile              # Convenience commands
├── .gitignore            # Git ignore patterns
├── src/
│   ├── data.py           # Character tokenizer and data loading
│   ├── model.py          # Transformer model components
│   ├── train.py          # Training loop and evaluation
│   ├── generate.py       # Text generation
│   ├── config.py         # Configuration management
│   └── utils.py          # Utilities and helpers
├── notebooks/
│   └── 0_quick_start.ipynb  # Interactive demo
├── tests/
│   ├── test_shapes.py    # Tensor shape tests
│   ├── test_masking.py   # Attention masking tests
│   └── test_tokenizer.py # Tokenizer tests
└── checkpoints/          # Saved model checkpoints

Expected Performance

With default settings:

  • Training time: 2-3 minutes on CPU, 30 seconds on GPU
  • Model size: ~80K parameters, ~0.3MB
  • Memory usage: <100MB during training
  • Final loss: ~1.5-2.0 (character-level cross-entropy)

The model will generate somewhat coherent Shakespeare-ish text but with repetitions due to the small dataset and model size.

Troubleshooting

CUDA Out of Memory

# Reduce batch size
python -m src.train --batch_size 16

# Reduce model size
python -m src.train --n_embd 64 --block_size 64

CPU Too Slow

# Reduce training iterations for quick demo
python -m src.train --max_iters 200 --eval_interval 50

# Use smaller model
python -m src.train --n_layer 1 --n_head 1 --n_embd 64

Import Errors

Make sure you're in the project root directory and have activated your virtual environment.

How This Maps to Large LLMs

This tiny transformer demonstrates the same core principles used in large language models:

  • Scaling: Large models use the same architecture with more layers (96+ vs 2), more heads (96+ vs 2), and larger embeddings (4096+ vs 128)
  • Data: Large models train on billions of tokens vs our ~2000 characters
  • Training: Large models train for weeks/months vs our minutes
  • Techniques: Same attention mechanism, residual connections, layer norm, and weight tying

Contributing

This is an educational project. Feel free to:

  • Add more text data sources
  • Implement additional generation strategies (beam search, nucleus sampling)
  • Add more comprehensive visualizations
  • Improve documentation and examples

License

MIT License - feel free to use for educational purposes.

Acknowledgments

Inspired by:

  • Andrej Karpathy's nanoGPT
  • "Attention Is All You Need" paper
  • BuildML's transformer tutorial
  • The GPT and GPT-2 papers

Commands Reference

Command Description
make setup Create virtual environment and install dependencies
make train Train the model with default settings
make sample Generate text sample from best checkpoint
make test Run all tests
make format Format code with ruff (if available)
make clean Clean up generated files

Happy learning! 🚀

About

Minimal GPT Implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors