feat: integrate KVPress for KV cache compression (#366)#623
feat: integrate KVPress for KV cache compression (#366)#623llcnt merged 11 commits intoPrunaAI:mainfrom
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 5 minor |
| Security | 5 high |
🟢 Metrics 9 complexity · 0 duplication
Metric Results Complexity 9 Duplication 0
TIP This summary will be updated as you push new changes. Give us feedback
|
Hi @kschwethelm! Thanks for the contribution. This branch has conflicts with |
7f8b282 to
da9199e
Compare
|
Hi @minettekaum, thanks! I've rebased onto main and resolved the conflicts. The pytest import error should also be resolved now. The test job failed with |
|
Thank you for the contribution @kschwethelm! The tests have passed now. We will review soon! |
simlang
left a comment
There was a problem hiding this comment.
Thank you so much for tackling this integration. First iteration already looks great! 🚀
a700a2b to
4fd566c
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 4fd566c. Configure here.
Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.
kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.
Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.
- Replace tags.QUANTIZER with explicit LLM algorithm names to avoid false symmetry matches with diffuser algorithms - Fix SmashConfig.add() dict flattening: only flatten when key is a registered algorithm name, not for dict-valued hyperparameters - Remove wrapper/special presses from PRESS_TYPES (CriticalKVPress and others that don't accept compression_ratio directly) - Add unit tests for press type validation and kwargs forwarding - Add SnapKV integration test with press_kwargs
Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.
Drop the dedicated KV_COMPRESSOR tag and use tags.PRUNER as kvpress's group tag, matching how other pruners are categorized. Replace all tags.KV_COMPRESSOR references in compatible_before/after lists with the string "kvpress" to align with the repo convention of naming specific algorithms in compatibility lists.
Add pipeline guard at the top of _apply to delegate to _apply_to_model_within_transformers_pipeline when the model is a TextGenerationPipeline, matching the pattern used by gptq, torch_compile, and other algorithms.
simlang
left a comment
There was a problem hiding this comment.
Looks good to me! Thank you so much for introducing KVPress to pruna!
LGTM! 🚀
llcnt
left a comment
There was a problem hiding this comment.
Thank you for the work :)
|
Thank you so much! It was fun to contribute and discuss with you :)) |

Description
Integrate KVPress into Pruna, making 20 KV cache compression strategies available for causal language models. KVPress compresses the key-value cache during the prefill phase, reducing memory usage for long-context inference.
Key implementation details:
kvpressalgorithm module following thePrunaAlgorithmBasepattern
compression_ratioandpress_kwargsfor press-specific parametersKV_CACHERalgorithm tag for the cache compression categoryreapplysave strategy — press is re-applied on model loadExcluded press types: Wrapper presses (ChunkPress, AdaKVPress, PerLayerCompressionPress, DMSPress, etc.) are not included in this initial integration. These require a nested
ScorerPressinstance as a constructor argument, which doesn't fit the current single-class design. Similarly, ThinKPress is excluded as it compresses along the channel dimension with a different parameter interface. These could be added in a follow-up if needed.Some downstream evaluation results are available in repo kschwethelm/pruna-kvpress-eval.
Related Issue
Fixes #366
Type of Change
Testing
uv run pytest -m "cpu and not slow")Unit tests added in
tests/algorithms/test_kvpress.pywith a dedicated tester intests/algorithms/testers/kvpress.py. Integration evaluated in a separate repo -> see evaluation report.Checklist