ViT Attention Patterns: Before & After Fine-Tuning

What happens inside a Vision Transformer when you fine-tune it for a new task?

This project explores how attention patterns change when we fine-tune a pretrained ViT (trained on ImageNet) for galaxy morphology classification (Galaxy Zoo), as an example of adaptation of a transformer to very specific domains.

Key Findings

The model learns to focus

Metric	Before	After	Change
CLS Attention Entropy	4.81	3.09	-1.71 (more focused)
Top-10 Patch Concentration	20.8%	79.4%	+58.6%

After fine-tuning, the model concentrates 80% of its attention on just 10 patches (out of 196). Before, attention was spread across the whole image.

Late layers change the most

Entropy measures how spread out attention is — lower means the model focuses on fewer patches.

Layer 11 (the final layer) shows the biggest change: entropy drops from 4.8 to 3.1. Early layers (0-3) barely change — they extract general features that work for any image task.

Attention becomes more local

Locality measures whether attention heads look at nearby patches (local) or distant ones (global).

Layer	Change
10-11	More local (focus on nearby patches)
0-1	Slightly more global

Late layers learn to focus on spatially coherent regions — like the spiral arms of a galaxy.

Training Results

Split	Samples	Accuracy
Train	4,937	-
Validation	549	86.5% (best)
Test	568	86.3%
Baseline	-	22.6%

Fine-tuned for 5 epochs (~18 minutes on Apple M4 Max).

Note on dataset size: We requested 10,000 train + 1,000 test samples, but only ~55% survived filtering. Samples with empty summary fields were excluded.

Visualizations

Per-Head Attention (Layer 11)

ViT has 12 attention heads per layer, each learning to look at different things. These heatmaps show where each head focuses.

Before fine-tuning — attention is diffuse:

After fine-tuning — each head focuses on specific regions:

Rollout vs Single-Layer Attention

Single-layer attention shows where the model looks at one layer. Rollout accumulates attention through all 12 layers to show total information flow.

After fine-tuning:

Rollout changes modestly (entropy: 4.85 -> 4.74)
Last layer changes dramatically (entropy: 4.81 -> 3.09)

This suggests the model learns task-specific focus primarily in the final layers, while earlier layers maintain general-purpose attention patterns.

CLS Attention Across Layers

The CLS token is a special token that aggregates information from all patches for the final classification decision.

CLS attention at early (L0), middle (L5), and late (L11) layers — before and after fine-tuning.

Project Structure

examples/
  01_manual_attention.py       # Verify attention math
  02_fine_tune_galaxy.py       # Train the model
  03_spatial_attention.py      # Locality analysis
  04_cls_attention.py          # CLS entropy by layer
  05_attention_rollout.py      # Rollout vs last layer
  06_before_after_comparison.py  # Combined analysis

src/
  model_loader.py              # Load ViT, extract weights
  attention_manual.py          # Manual attention computation
  attention_rollout.py         # Cumulative attention flow
  visualization.py             # Plotting utilities
  vit_circuits.py              # OV/QK circuit analysis
  dataset_utils.py             # Galaxy Zoo loading
  fine_tuning.py               # HuggingFace Trainer config

Quick Start

# Install dependencies
uv sync

# Run fine-tuning (skip if using existing model)
uv run python examples/02_fine_tune_galaxy.py

# Run analysis
uv run python examples/06_before_after_comparison.py

Technical Details

Model: google/vit-base-patch16-224 (86M parameters)
Dataset: Galaxy Zoo 2 (mwalmsley/gz2) — 6 morphology classes
Hardware: Apple M4 Max (MPS backend)
Framework: PyTorch + HuggingFace Transformers

What I Learned

Fine-tuning changes late layers most — early layers learn general features that transfer well
Attention becomes focused — the model learns what to ignore
Locality increases — task-relevant features are often spatially coherent

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
outputs		outputs
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT Attention Patterns: Before & After Fine-Tuning

Key Findings

The model learns to focus

Late layers change the most

Attention becomes more local

Training Results

Visualizations

Per-Head Attention (Layer 11)

Rollout vs Single-Layer Attention

CLS Attention Across Layers

Project Structure

Quick Start

Technical Details

What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViT Attention Patterns: Before & After Fine-Tuning

Key Findings

The model learns to focus

Late layers change the most

Attention becomes more local

Training Results

Visualizations

Per-Head Attention (Layer 11)

Rollout vs Single-Layer Attention

CLS Attention Across Layers

Project Structure

Quick Start

Technical Details

What I Learned

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages