Skip to content

feat(finetune): add bf16 mixed-precision via amp_dtype config#288

Open
Burton-David wants to merge 1 commit intoshiyu-coder:masterfrom
Burton-David:feat/bf16-amp
Open

feat(finetune): add bf16 mixed-precision via amp_dtype config#288
Burton-David wants to merge 1 commit intoshiyu-coder:masterfrom
Burton-David:feat/bf16-amp

Conversation

@Burton-David
Copy link
Copy Markdown

Summary

Adds optional bf16 mixed-precision to train_tokenizer.py and train_predictor.py, gated by a new Config.amp_dtype field.

# finetune/config.py
self.amp_dtype = "bfloat16"   # default: None -> FP32

torch.autocast(device_type="cuda", dtype=torch.bfloat16, ...) wraps the forward and loss. Backward, grad clipping, and optimizer.step stay outside autocast. AdamW master weights are FP32 and bf16 keeps FP32's exponent range, so no GradScaler is needed.

Numbers

Measured on RTX 4090 and A100 80GB, torch 2.4.1+cu124, against NeoQuasar/Kronos-base and NeoQuasar/Kronos-Tokenizer-base. Inputs are SPY 1-min OHLCV bars from yfinance, z-score per window, seq=512. Median of 30 timed iters after 5 warmup.

Predictor step (tokenize, forward, loss, backward, clip, optimizer.step):

Batch 4090 fp32 -> bf16 (ms) 4090 speedup A100 fp32 -> bf16 (ms) A100 speedup
10 131.6 -> 64.8 2.03x 248.8 -> 68.7 3.62x
25 322.0 -> 157.1 2.05x 597.6 -> 144.6 4.13x
50 640.0 -> 331.5 1.93x 1156.2 -> 266.1 4.34x
100 OOM (24 GB) 2344.6 -> 508.1 4.61x

Peak memory drops 24-25% at every batch size on both GPUs. On A100 80GB, bf16 lets B=100 fit at 32 GB peak; FP32 needs 43 GB.

Tokenizer step (forward, recon + BSQ loss, backward, clip, optimizer.step):

Batch 4090 fp32 -> bf16 (ms) 4090 speedup A100 fp32 -> bf16 (ms) A100 speedup
10 21.9 -> 26.0 0.84x 26.7 -> 19.8 1.35x
25 37.6 -> 26.3 1.43x 64.9 -> 24.2 2.69x
50 73.7 -> 35.5 2.08x 114.7 -> 45.0 2.55x
100 157.7 -> 85.4 1.85x 199.9 -> 82.8 2.41x

At B=10 on the 4090 the tokenizer runs slower in bf16. The tokenizer is small (5.5M params) and at that batch size the autocast setup cost beats its compute win on Ada; from B=25 up it wins on both GPUs. Tokenizer peak memory drops 31-35% across the sweep.

Single-step loss vs FP32 from identical pretrained weights and identical inputs: predictor 1.3-1.7% rel, tokenizer 0.8-1.2% rel.

Compatibility

amp_dtype = None is the default and keeps the loop bit-exact with the current behavior. Opt-in via Config.amp_dtype = "bfloat16".

Wraps the forward + loss in torch.autocast for both the tokenizer and
predictor training loops, gated by a new Config.amp_dtype field. Setting
"bfloat16" enables bf16 autocast; None (default) keeps the existing FP32
path bit-exact. bf16 has the same exponent range as FP32, so AdamW master
weights need no scaling and no GradScaler is wired in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant