Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs by brosequist · Pull Request #90 · TheTom/turboquant_plus

brosequist · 2026-05-09T08:44:41Z

Hi @TheTom — thanks for the friendly note on #61 back in April. I'd left the original six PRs (#61, #62, #63, #64, #65, #66) sitting open for a few weeks and decided to close them today and rebundle here as a single PR, hoping the smaller review surface helps you triage when you have time. All six commits still apply cleanly against main (zero rebase needed) and are preserved as separate commits in this branch so the individual rationale and git blame story stay intact.

Closing this PR if you'd rather see them re-opened individually is also fine — happy to follow whatever workflow works for you.

What's in this PR (6 commits, original PR refs in parens)

1. `fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization` (was #61)

KVCacheCompressor.memory_stats() was omitting the 32-bit float norm stored per V vector, inflating the reported compression ratio. Adds v_bits_total += n_vectors * 32.
Adds compressed_size_bits() to TurboQuantMSE (was missing; TurboQuant already had it).
Replaces seed + 1000 offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between the PolarQuant and QJL stages.
Adds compress_token() / get_compressed_cache() streaming API to KVCacheCompressor for auto-regressive token-by-token inference.
Adds CompressedVector.to_bytes() / from_bytes() for disk / network serialisation.

2. `test: document QJL regression in test_turboquant_improves_over_polarquant` (was #62)

The existing test had no assertion — it only print()'d, silently allowing QJL to be worse than PolarQuant. Adds a regression-guard assertion documenting the empirical finding (TQ 2-bit avg ≈ 0.091 vs PQ 2-bit avg ≈ 0.041 inner-product distortion). If QJL is ever fixed to actually improve over PQ, the test will fail loudly and prompt re-evaluation of the production path.

3. `test: add correctness and round-trip tests for fast rotation functions` (was #63)

Three property tests for fast_rotate / fast_unrotate (none of which existed previously):
1. Round-trip invertibility — fast_unrotate(fast_rotate(x)) ≈ x
2. Batch consistency — row-by-row equals all-at-once
3. Energy distribution — roughly uniform per-coordinate variance after rotation

4. `feat: add calibrate() to OutlierTurboQuant for data-driven channel split` (was #64)

OutlierTurboQuant.calibrate(calibration_vectors) computes per-channel RMS across a calibration set and marks channels whose RMS exceeds 3× the median as outlier channels, updating the compressor's split in place.
Follows the dynamic-threshold approach from the LLM.int8() / SmoothQuant literature.

5. `chore: add ruff linting to pyproject.toml and CI workflow` (was #65)

[tool.ruff] block in pyproject.toml (line-length=120, E/W/F, ignoring E501/E741).
.github/workflows/lint.yml runs ruff check on push / PR.
Pure tooling — zero behavioural changes.

6. `docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models` (was #66)

Prominent warning block in docs/turboquant-recommendations.md documenting observed NaN divergence when using q8_0 or turbo3 on models with large K-vector norms (e.g. Qwen2.5-7B) on AMD/ROCm (HIP) backends. Recommends turbo2 / turbo4 or pre-quantisation K-norm clipping.
Likely related to the user-reported symptoms in [ROCm] Scale operation fails with "invalid device function" during Gemma 4 loading #86 ("[ROCm] Scale operation fails... during Gemma 4 loading") and [Bug] Numerical instability (NaN/Infinite Loop) with Qwen 2.5/3.5 models using turbo4 on RTX 4070Ti #60 ("Numerical instability with Qwen 2.5/3.5 models using turbo4").

Test plan

pytest tests/test_kv_cache.py — covers V-norm accounting, streaming API, serialisation round-trip
pytest tests/test_distortion.py::TestDistortionScaling::test_turboquant_improves_over_polarquant — QJL regression assertion
pytest tests/test_rotation.py — fast-rotation property tests
pytest tests/test_outlier.py — calibrate() plus all-inlier / all-outlier edge cases
ruff check . — passes (and the new GH Actions workflow runs it on every push)
Docs-only changes (Implement full TurboQuant (turboquant.py) — Algorithm 2 #6) — nothing to test

🤖 Generated with Claude Code

…essed_size_bits KVCacheCompressor.memory_stats() omitted the float32 norm stored per V vector, inflating the reported compression ratio. Add v_bits_total += n_vectors * 32 to account for it. Also adds compressed_size_bits() to TurboQuantMSE (was missing; TurboQuant already had it), fixing the asymmetry between the two classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x), batch vs single-vector consistency, and energy distribution uniformity after rotation. All three property tests were previously untested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously the outlier/inlier channel split was set at construction time and never adjusted. calibrate(calibration_vectors) now computes per-channel RMS, flags channels whose RMS exceeds 3× the median as outliers, and updates the split on the compressor — matching the dynamic-threshold approach described in the LLM.int8() and SmoothQuant literature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules, ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml) that runs ruff check on every push and pull request. Replaces ad-hoc style discussions with an enforced, zero-config lint gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a prominent WARNING block to turboquant-recommendations.md documenting the observed NaN divergence when using q8_0 or turbo3 compression on models with large K-vector norms (e.g. Qwen2.5-7B) on AMD/ROCm (HIP) backends. The root cause is the int8 overflow path that differs between HIP and CUDA. Recommended mitigations: switch to turbo2/turbo4 or add pre-quantization K-norm clipping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The lint workflow added in 46efe26 ran 'ruff check .' against the whole repo and failed immediately because the existing codebase has 233 pre-existing ruff violations (78 F401 unused imports, 68 I001 import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.) across benchmarks/ and scripts/. Adding a CI gate that the legacy code doesn't pass is unhelpful, so remove .github/workflows/lint.yml. Keep the [tool.ruff] block in pyproject.toml as opt-in documentation: anyone running 'ruff check' locally still gets the configured rules, and the workflow can be re-enabled later once the legacy violations are addressed (most are auto-fixable via 'ruff check --fix' across 187 of the 233).

@brosequist

Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual fixes, deferring the streaming + serialization API surface until a production caller exists. Included: - KVCacheCompressor.memory_stats() was omitting the float32 norm stored per V vector, inflating reported compression ratio. Adds v_bits_total += n_vectors * 32. - TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant already had it). - Replaces seed + 1000 magic offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between PolarQuant and QJL stages, and between K and V quantizers. Deferred (not in this commit): - compress_token() / get_compressed_cache() streaming API - CompressedVector.to_bytes() / from_bytes() binary serialization - CompressedKVCache.save() / load() npz serialization

TheTom · 2026-05-09T15:49:27Z

hey @brosequist, first off, big apology for the delay on these. you opened the originals back in april, i sat on them way too long, and the rebundle made it much easier to review. really appreciate the patience and the diligence on the rebundle work.

i landed a curated subset in #91 with you as author on the cherry-picks. quick rundown:

merging from #90 (you authored, cherry-picked):

✅ V-norm fix in memory_stats + TurboQuantMSE.compressed_size_bits. split out of 0fd5de9 to keep just the fixes. real bug, accounting was off.
✅ SeedSequence PRNG cleanup. also split from 0fd5de9. cleaner than the magic offset.
✅ QJL regression-guard test. locks the production reality from docs/papers/turbo4-resurrection.md. good test.
✅ rotation property tests. pure additive, real coverage.
✅ ruff config (kept config, dropped the workflow per your follow-up).

deferred from #90:

⏸ streaming API (compress_token / get_compressed_cache) and binary/npz serialization (to_bytes / from_bytes / save / load). split out of 0fd5de9. solid code, but i don't have a production caller yet and want to design these for whatever ends up wiring them up. holding until that lands.
⏸ OutlierTurboQuant.calibrate(). the calibrate code itself is clean, but OutlierTurboQuant is a dead path on my end per docs/turboquant-plus-experiments.md (kurtosis stays high after channel removal, WHT rotation handles tails better, WUSH paper confirmed). holding for now.
⏸ HIP/AMD NaN docs. the root-cause story (large K norms causing NaN) actually contradicts what i'm seeing in docs/papers/asymmetric-kv-compression.md, which finds extreme K norms compress better because the post-normalization distribution becomes more Gaussian (boundary layers are ideal for Lloyd-Max). real cause is HIP-kernel-specific. happy to revisit after kernel triage.

also added a parallel K-norm accounting fix on top of yours in #91. compressed_size_bits for TurboQuant K was undercounting too (it stores two norms, not one), so #91 has both sides corrected. you flagged the V side, that pulled my attention to the K side.

thanks again for sticking with this. let me know if anything in the curation feels off, or if you'd like to take another swing at any of the deferred items with the production-caller / kernel context in mind.

brosequist · 2026-05-13T20:52:51Z

Thank you Tom, and no problem with these not being merged right away.

I saw that these weren't reflected on the last main fork and thought they were helpful enough fixes to re-bump, and make sure they weren't accidentally rejected/lost. I'll take a look at anything I missed the first go through that you've raised on the deferred items and see if there's anything else from there.

…JL fix Pulls in 3 upstream commits since merge-base 1224fef: - c46f6b9 docs(papers): block-selector sparse attention WIP log - 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93) - 280b466 README: mark QJL as reference-only Clean auto-merge. Only file touched by both sides was turboquant.py; upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots in alongside our V-norm/MSE accounting fix without conflict. Our fork-local commits retained: V-norm in memory_stats, SeedSequence PRNG, MSE compressed_size_bits, QJL regression test, rotation tests, ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc. PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of these — remains open; once it merges to upstream/main we'll want to rebase/reset to drop redundant commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…annel split" This reverts 3215eb3. Dropping per fork-curation review: - TheTom deferred this from PR TheTom#91 because OutlierTurboQuant is on a deprecated path (turboquant-plus-experiments.md — outlier channeling loses to WHT rotation, which drives kurtosis 8-50 → 2.9). - Our llama-cpp-turboquant implementation never wired in OutlierTurboQuant — production kernels (q8_0-turbo3_0, turbo4_0-turbo2_0) use the standard TurboQuant / TurboQuantMSE path. Zero consumers downstream. Keeping the commit hash on record in docs/fork-notes/upstream-pr-90-status.md for credit; the code itself has no reason to survive future upstream churn on a dead module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upstream renamed pyproject.toml to package refract-llm only; the [dev] extra was reduced to pytest/coverage/build/twine. The in-repo `turboquant` Python package (dev-only, not shipped) still depends on numpy + scipy, so `python -m pytest tests/` fails at import in CI with ModuleNotFoundError. Adding numpy>=1.21 and scipy>=1.7 to [dev] keeps `pip install -e ".[dev]"` self-contained for contributors and gets the GitHub Actions matrix green. Fork-local for now; candidate upstream PR after TheTom#91 merges, since the same failure is hitting `TheTom/turboquant_plus` main (every run since 2026-05-09 fails the same way). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… models" This reverts f74cb67. Dropping per fork-curation review + live validation on 2026-05-28: - TheTom deferred this from PR TheTom#91 because the int8-overflow root-cause story contradicts asymmetric-kv-compression.md:218 (large K norms compress *better* post-normalization, not worse). - Validation on llama-cpp-turboquant b1-5aeb2fd (the production HIP build on rog-mega-pc) ran the exact failing config -ctk q8_0 -ctv turbo3 on Qwen2.5-7B-Instruct Q4_K_M across 5 prompts (smoke through 3700-token near-capacity, both temp=0 and temp=0.9 sampling) on both Navi44 (gfx1200) and Navi48 (gfx1201) GPUs. Results byte-identical between cards and coherent in all cases. No NaN, no gibberish. Either the bug was fixed in the codebase since the doc was written or it was always specific to AMD hardware we don't run (MI300X gfx942 likely). Leaving the warning in place would mislead RDNA4 users about a config that works fine for them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

brett and others added 7 commits May 9, 2026 04:20

TheTom mentioned this pull request May 9, 2026

Curated subset of #90 + K-side norm accounting fix #91

Open

2 tasks

brett and others added 4 commits May 28, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs#90

Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs#90
brosequist wants to merge 11 commits into
TheTom:mainfrom
brosequist:main

brosequist commented May 9, 2026

Uh oh!

TheTom commented May 9, 2026

Uh oh!

brosequist commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brosequist commented May 9, 2026

What's in this PR (6 commits, original PR refs in parens)

1. fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization (was #61)

2. test: document QJL regression in test_turboquant_improves_over_polarquant (was #62)

3. test: add correctness and round-trip tests for fast rotation functions (was #63)

4. feat: add calibrate() to OutlierTurboQuant for data-driven channel split (was #64)

5. chore: add ruff linting to pyproject.toml and CI workflow (was #65)

6. docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models (was #66)

Test plan

Uh oh!

TheTom commented May 9, 2026

Uh oh!

brosequist commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization` (was #61)

2. `test: document QJL regression in test_turboquant_improves_over_polarquant` (was #62)

3. `test: add correctness and round-trip tests for fast rotation functions` (was #63)

4. `feat: add calibrate() to OutlierTurboQuant for data-driven channel split` (was #64)

5. `chore: add ruff linting to pyproject.toml and CI workflow` (was #65)

6. `docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models` (was #66)