Skip to content

inkybubble/mi_02_vit_attention_patterns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViT Attention Patterns: Before & After Fine-Tuning

What happens inside a Vision Transformer when you fine-tune it for a new task?

This project explores how attention patterns change when we fine-tune a pretrained ViT (trained on ImageNet) for galaxy morphology classification (Galaxy Zoo), as an example of adaptation of a transformer to very specific domains.

Before/After Comparison


Key Findings

The model learns to focus

Metric Before After Change
CLS Attention Entropy 4.81 3.09 -1.71 (more focused)
Top-10 Patch Concentration 20.8% 79.4% +58.6%

After fine-tuning, the model concentrates 80% of its attention on just 10 patches (out of 196). Before, attention was spread across the whole image.

Late layers change the most

CLS Entropy by Layer

Entropy measures how spread out attention is — lower means the model focuses on fewer patches.

Layer 11 (the final layer) shows the biggest change: entropy drops from 4.8 to 3.1. Early layers (0-3) barely change — they extract general features that work for any image task.

Attention becomes more local

Locality Scores

Locality measures whether attention heads look at nearby patches (local) or distant ones (global).

Layer Change
10-11 More local (focus on nearby patches)
0-1 Slightly more global

Late layers learn to focus on spatially coherent regions — like the spiral arms of a galaxy.


Training Results

Training Curves

Split Samples Accuracy
Train 4,937 -
Validation 549 86.5% (best)
Test 568 86.3%
Baseline - 22.6%

Fine-tuned for 5 epochs (~18 minutes on Apple M4 Max).

Note on dataset size: We requested 10,000 train + 1,000 test samples, but only ~55% survived filtering. Samples with empty summary fields were excluded.


Visualizations

Per-Head Attention (Layer 11)

ViT has 12 attention heads per layer, each learning to look at different things. These heatmaps show where each head focuses.

Before fine-tuning — attention is diffuse: Per-head Before

After fine-tuning — each head focuses on specific regions: Per-head After

Rollout vs Single-Layer Attention

Single-layer attention shows where the model looks at one layer. Rollout accumulates attention through all 12 layers to show total information flow.

Rollout Comparison

After fine-tuning:

  • Rollout changes modestly (entropy: 4.85 -> 4.74)
  • Last layer changes dramatically (entropy: 4.81 -> 3.09)

This suggests the model learns task-specific focus primarily in the final layers, while earlier layers maintain general-purpose attention patterns.

CLS Attention Across Layers

The CLS token is a special token that aggregates information from all patches for the final classification decision.

CLS Spatial Layers

CLS attention at early (L0), middle (L5), and late (L11) layers — before and after fine-tuning.


Project Structure

examples/
  01_manual_attention.py       # Verify attention math
  02_fine_tune_galaxy.py       # Train the model
  03_spatial_attention.py      # Locality analysis
  04_cls_attention.py          # CLS entropy by layer
  05_attention_rollout.py      # Rollout vs last layer
  06_before_after_comparison.py  # Combined analysis

src/
  model_loader.py              # Load ViT, extract weights
  attention_manual.py          # Manual attention computation
  attention_rollout.py         # Cumulative attention flow
  visualization.py             # Plotting utilities
  vit_circuits.py              # OV/QK circuit analysis
  dataset_utils.py             # Galaxy Zoo loading
  fine_tuning.py               # HuggingFace Trainer config

Quick Start

# Install dependencies
uv sync

# Run fine-tuning (skip if using existing model)
uv run python examples/02_fine_tune_galaxy.py

# Run analysis
uv run python examples/06_before_after_comparison.py

Technical Details

  • Model: google/vit-base-patch16-224 (86M parameters)
  • Dataset: Galaxy Zoo 2 (mwalmsley/gz2) — 6 morphology classes
  • Hardware: Apple M4 Max (MPS backend)
  • Framework: PyTorch + HuggingFace Transformers

What I Learned

  1. Fine-tuning changes late layers most — early layers learn general features that transfer well
  2. Attention becomes focused — the model learns what to ignore
  3. Locality increases — task-relevant features are often spatially coherent

About

How ViT attention patterns change when fine-tuned for galaxy classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages