Curated subset of #90 + K-side norm accounting fix#91
Open
TheTom wants to merge 6 commits into
Open
Conversation
Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual fixes, deferring the streaming + serialization API surface until a production caller exists. Included: - KVCacheCompressor.memory_stats() was omitting the float32 norm stored per V vector, inflating reported compression ratio. Adds v_bits_total += n_vectors * 32. - TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant already had it). - Replaces seed + 1000 magic offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between PolarQuant and QJL stages, and between K and V quantizers. Deferred (not in this commit): - compress_token() / get_compressed_cache() streaming API - CompressedVector.to_bytes() / from_bytes() binary serialization - CompressedKVCache.save() / load() npz serialization
…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x), batch vs single-vector consistency, and energy distribution uniformity after rotation. All three property tests were previously untested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules, ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml) that runs ruff check on every push and pull request. Replaces ad-hoc style discussions with an enforced, zero-config lint gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The lint workflow added in 46efe26 ran 'ruff check .' against the whole repo and failed immediately because the existing codebase has 233 pre-existing ruff violations (78 F401 unused imports, 68 I001 import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.) across benchmarks/ and scripts/. Adding a CI gate that the legacy code doesn't pass is unhelpful, so remove .github/workflows/lint.yml. Keep the [tool.ruff] block in pyproject.toml as opt-in documentation: anyone running 'ruff check' locally still gets the configured rules, and the workflow can be re-enabled later once the legacy violations are addressed (most are auto-fixable via 'ruff check --fix' across 187 of the 233).
TurboQuant.CompressedVector stores TWO float32 norms per vector (vector_norms = ||x||_2, residual_norms = ||residual||_2), but compressed_size_bits and KVCacheCompressor.memory_stats only counted one (32 bits instead of 64). Pre-existing on main, parallel to the V-side undercount fixed in the previous commit. V uses TurboQuantMSE which stores a single norm — 32 is correct there. K uses full TurboQuant which stores two norms. Effect: K compressed size was understated by 32 bits per vector, inflating reported compression ratio. With d=128 b=3 the TurboQuant ratio drops from 4.92× → 4.57× (true value), and the combined KV ratio at d=128 k=v=3 drops from ~2.46× → ~2.37×. No quantization-output changes, accounting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
brosequist
pushed a commit
to brosequist/turboquant_plus
that referenced
this pull request
May 28, 2026
…JL fix Pulls in 3 upstream commits since merge-base 1224fef: - c46f6b9 docs(papers): block-selector sparse attention WIP log - 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93) - 280b466 README: mark QJL as reference-only Clean auto-merge. Only file touched by both sides was turboquant.py; upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots in alongside our V-norm/MSE accounting fix without conflict. Our fork-local commits retained: V-norm in memory_stats, SeedSequence PRNG, MSE compressed_size_bits, QJL regression test, rotation tests, ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc. PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of these — remains open; once it merges to upstream/main we'll want to rebase/reset to drop redundant commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist
pushed a commit
to brosequist/turboquant_plus
that referenced
this pull request
May 28, 2026
…annel split" This reverts 3215eb3. Dropping per fork-curation review: - TheTom deferred this from PR TheTom#91 because OutlierTurboQuant is on a deprecated path (turboquant-plus-experiments.md — outlier channeling loses to WHT rotation, which drives kurtosis 8-50 → 2.9). - Our llama-cpp-turboquant implementation never wired in OutlierTurboQuant — production kernels (q8_0-turbo3_0, turbo4_0-turbo2_0) use the standard TurboQuant / TurboQuantMSE path. Zero consumers downstream. Keeping the commit hash on record in docs/fork-notes/upstream-pr-90-status.md for credit; the code itself has no reason to survive future upstream churn on a dead module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist
pushed a commit
to brosequist/turboquant_plus
that referenced
this pull request
May 28, 2026
Upstream renamed pyproject.toml to package refract-llm only; the [dev] extra was reduced to pytest/coverage/build/twine. The in-repo `turboquant` Python package (dev-only, not shipped) still depends on numpy + scipy, so `python -m pytest tests/` fails at import in CI with ModuleNotFoundError. Adding numpy>=1.21 and scipy>=1.7 to [dev] keeps `pip install -e ".[dev]"` self-contained for contributors and gets the GitHub Actions matrix green. Fork-local for now; candidate upstream PR after TheTom#91 merges, since the same failure is hitting `TheTom/turboquant_plus` main (every run since 2026-05-09 fails the same way). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist
pushed a commit
to brosequist/turboquant_plus
that referenced
this pull request
May 28, 2026
… models" This reverts f74cb67. Dropping per fork-curation review + live validation on 2026-05-28: - TheTom deferred this from PR TheTom#91 because the int8-overflow root-cause story contradicts asymmetric-kv-compression.md:218 (large K norms compress *better* post-normalization, not worse). - Validation on llama-cpp-turboquant b1-5aeb2fd (the production HIP build on rog-mega-pc) ran the exact failing config -ctk q8_0 -ctv turbo3 on Qwen2.5-7B-Instruct Q4_K_M across 5 prompts (smoke through 3700-token near-capacity, both temp=0 and temp=0.9 sampling) on both Navi44 (gfx1200) and Navi48 (gfx1201) GPUs. Results byte-identical between cards and coherent in all cases. No NaN, no gibberish. Either the bug was fixed in the codebase since the doc was written or it was always specific to AMD hardware we don't run (MI300X gfx942 likely). Leaving the warning in place would mislead RDNA4 users about a config that works fine for them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Selectively merging from @brosequist's #90 — keeping the actual fixes, deferring API surface, plus an additional K-side fix that #90 missed.
Thanks @brosequist for the bundle — review surface was much easier this way. Re-bundling further per the curation below.
What's included
b75813bmemory_stats, SeedSequence PRNG,TurboQuantMSE.compressed_size_bits1074625test_turboquant_improves_over_polarquantf23570e3e375720ca5bcc8afc4bfBrett's first commit was split — credit preserved
The original
0fd5de9bundle (5 changes) was cherry-picked with the streaming + serialization API deferred. What's kept:KVCacheCompressor.memory_stats()TurboQuantMSE.compressed_size_bits()(was missing;TurboQuantalready had it)SeedSequence.spawn(2)replacingseed + 1000magic offsetWhat's deferred (no caller yet, want to design for the production integration):
KVCacheCompressor.compress_token()/get_compressed_cache()streaming APICompressedVector.to_bytes()/from_bytes()binary serializationCompressedKVCache.save()/load()npz serializationThe split commit retains @brosequist as author.
What's added
8afc4bfis a parallel fix to the V-norm fix inb75813b.TurboQuant.CompressedVectorstores two float32 norms (vector_norms = ||x||_2andresidual_norms = ||residual||_2), butTurboQuant.compressed_size_bitsandKVCacheCompressor.memory_statsonly counted one. V usesTurboQuantMSE(single norm — 32 is correct). K uses fullTurboQuant(two norms — 64 is correct).Numerical effect (verified live):
TurboQuant(d=128, b=3).compression_ratio(): 4.92× → 4.57× (true)KVCacheCompressor(d=128, k=3, v=3).memory_stats(...)['compression_ratio']: ~2.46× → 2.37× (true)No quantization-output changes — accounting only.
What's not included from #90
feat: add calibrate() to OutlierTurboQuant—OutlierTurboQuantis a deprecated path perdocs/turboquant-plus-experiments.md("Outlier channeling doesn't work… kurtosis stays 8-50… WHT rotation gets it to 2.9"). Calibrate code is well-written, just on a dead module.docs: HIP/AMD NaN warning— root-cause story ("large K norms → NaN") contradictsdocs/papers/asymmetric-kv-compression.md:218, which finds extreme K norms compress better (more Gaussian after normalization). Real cause is HIP-kernel-specific. Will revisit after kernel triage.Consistency with
docs/papers/why-mse-fails-for-kv-quantization.mdThe new MSE paper argues MSE is a broken proxy for K cache quantization in deployment because attention is non-linear and sparse. Brett's QJL regression-guard test (
1074625) measures inner-product distortion on synthetic Gaussian pairs (d=256) — the linear-operator regime where the new paper explicitly says IP/MSE does proxy quality (alongside RaBitQ-style top-k IP search). The test is a regression-guard for IP distortion only; the production decision to drop QJL is justified separately bydocs/papers/turbo4-resurrection.md's PPL ablation. No conflict.Test plan
pytest tests/ refract/tests/— 982 passed, 1 skipped (7 fewer than Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs #90 baseline because streaming/serialization tests deferred with their code)TurboQuant.compression_ratio()andKVCacheCompressor.memory_stats()numerics against expected values🤖 Generated with Claude Code