What happens inside a Vision Transformer when you fine-tune it for a new task?
This project explores how attention patterns change when we fine-tune a pretrained ViT (trained on ImageNet) for galaxy morphology classification (Galaxy Zoo), as an example of adaptation of a transformer to very specific domains.
| Metric | Before | After | Change |
|---|---|---|---|
| CLS Attention Entropy | 4.81 | 3.09 | -1.71 (more focused) |
| Top-10 Patch Concentration | 20.8% | 79.4% | +58.6% |
After fine-tuning, the model concentrates 80% of its attention on just 10 patches (out of 196). Before, attention was spread across the whole image.
Entropy measures how spread out attention is — lower means the model focuses on fewer patches.
Layer 11 (the final layer) shows the biggest change: entropy drops from 4.8 to 3.1. Early layers (0-3) barely change — they extract general features that work for any image task.
Locality measures whether attention heads look at nearby patches (local) or distant ones (global).
| Layer | Change |
|---|---|
| 10-11 | More local (focus on nearby patches) |
| 0-1 | Slightly more global |
Late layers learn to focus on spatially coherent regions — like the spiral arms of a galaxy.
| Split | Samples | Accuracy |
|---|---|---|
| Train | 4,937 | - |
| Validation | 549 | 86.5% (best) |
| Test | 568 | 86.3% |
| Baseline | - | 22.6% |
Fine-tuned for 5 epochs (~18 minutes on Apple M4 Max).
Note on dataset size: We requested 10,000 train + 1,000 test samples, but only ~55% survived filtering. Samples with empty summary fields were excluded.
ViT has 12 attention heads per layer, each learning to look at different things. These heatmaps show where each head focuses.
Before fine-tuning — attention is diffuse:

After fine-tuning — each head focuses on specific regions:

Single-layer attention shows where the model looks at one layer. Rollout accumulates attention through all 12 layers to show total information flow.
After fine-tuning:
- Rollout changes modestly (entropy: 4.85 -> 4.74)
- Last layer changes dramatically (entropy: 4.81 -> 3.09)
This suggests the model learns task-specific focus primarily in the final layers, while earlier layers maintain general-purpose attention patterns.
The CLS token is a special token that aggregates information from all patches for the final classification decision.
CLS attention at early (L0), middle (L5), and late (L11) layers — before and after fine-tuning.
examples/
01_manual_attention.py # Verify attention math
02_fine_tune_galaxy.py # Train the model
03_spatial_attention.py # Locality analysis
04_cls_attention.py # CLS entropy by layer
05_attention_rollout.py # Rollout vs last layer
06_before_after_comparison.py # Combined analysis
src/
model_loader.py # Load ViT, extract weights
attention_manual.py # Manual attention computation
attention_rollout.py # Cumulative attention flow
visualization.py # Plotting utilities
vit_circuits.py # OV/QK circuit analysis
dataset_utils.py # Galaxy Zoo loading
fine_tuning.py # HuggingFace Trainer config
# Install dependencies
uv sync
# Run fine-tuning (skip if using existing model)
uv run python examples/02_fine_tune_galaxy.py
# Run analysis
uv run python examples/06_before_after_comparison.py- Model:
google/vit-base-patch16-224(86M parameters) - Dataset: Galaxy Zoo 2 (
mwalmsley/gz2) — 6 morphology classes - Hardware: Apple M4 Max (MPS backend)
- Framework: PyTorch + HuggingFace Transformers
- Fine-tuning changes late layers most — early layers learn general features that transfer well
- Attention becomes focused — the model learns what to ignore
- Locality increases — task-relevant features are often spatially coherent





