feat(finetune): add bf16 mixed-precision via amp_dtype config by Burton-David · Pull Request #288 · shiyu-coder/Kronos

Burton-David · 2026-05-05T22:43:12Z

Summary

Adds optional bf16 mixed-precision to train_tokenizer.py and train_predictor.py, gated by a new Config.amp_dtype field.

# finetune/config.py
self.amp_dtype = "bfloat16"   # default: None -> FP32

torch.autocast(device_type="cuda", dtype=torch.bfloat16, ...) wraps the forward and loss. Backward, grad clipping, and optimizer.step stay outside autocast. AdamW master weights are FP32 and bf16 keeps FP32's exponent range, so no GradScaler is needed.

Numbers

Measured on RTX 4090 and A100 80GB, torch 2.4.1+cu124, against NeoQuasar/Kronos-base and NeoQuasar/Kronos-Tokenizer-base. Inputs are SPY 1-min OHLCV bars from yfinance, z-score per window, seq=512. Median of 30 timed iters after 5 warmup.

Predictor step (tokenize, forward, loss, backward, clip, optimizer.step):

Batch	4090 fp32 -> bf16 (ms)	4090 speedup	A100 fp32 -> bf16 (ms)	A100 speedup
10	131.6 -> 64.8	2.03x	248.8 -> 68.7	3.62x
25	322.0 -> 157.1	2.05x	597.6 -> 144.6	4.13x
50	640.0 -> 331.5	1.93x	1156.2 -> 266.1	4.34x
100	OOM (24 GB)		2344.6 -> 508.1	4.61x

Peak memory drops 24-25% at every batch size on both GPUs. On A100 80GB, bf16 lets B=100 fit at 32 GB peak; FP32 needs 43 GB.

Tokenizer step (forward, recon + BSQ loss, backward, clip, optimizer.step):

Batch	4090 fp32 -> bf16 (ms)	4090 speedup	A100 fp32 -> bf16 (ms)	A100 speedup
10	21.9 -> 26.0	0.84x	26.7 -> 19.8	1.35x
25	37.6 -> 26.3	1.43x	64.9 -> 24.2	2.69x
50	73.7 -> 35.5	2.08x	114.7 -> 45.0	2.55x
100	157.7 -> 85.4	1.85x	199.9 -> 82.8	2.41x

At B=10 on the 4090 the tokenizer runs slower in bf16. The tokenizer is small (5.5M params) and at that batch size the autocast setup cost beats its compute win on Ada; from B=25 up it wins on both GPUs. Tokenizer peak memory drops 31-35% across the sweep.

Single-step loss vs FP32 from identical pretrained weights and identical inputs: predictor 1.3-1.7% rel, tokenizer 0.8-1.2% rel.

Compatibility

amp_dtype = None is the default and keeps the loop bit-exact with the current behavior. Opt-in via Config.amp_dtype = "bfloat16".

Wraps the forward + loss in torch.autocast for both the tokenizer and predictor training loops, gated by a new Config.amp_dtype field. Setting "bfloat16" enables bf16 autocast; None (default) keeps the existing FP32 path bit-exact. bf16 has the same exponent range as FP32, so AdamW master weights need no scaling and no GradScaler is wired in.

Burton-David mentioned this pull request May 5, 2026

feat(inference): add bf16 mixed-precision via KronosPredictor.amp_dtype #289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(finetune): add bf16 mixed-precision via amp_dtype config#288

feat(finetune): add bf16 mixed-precision via amp_dtype config#288
Burton-David wants to merge 1 commit intoshiyu-coder:masterfrom
Burton-David:feat/bf16-amp

Burton-David commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Burton-David commented May 5, 2026

Summary

Numbers

Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant