feat: integrate KVPress for KV cache compression (#366) by kschwethelm · Pull Request #623 · PrunaAI/pruna

kschwethelm · 2026-04-10T14:58:20Z

Description

Integrate KVPress into Pruna, making 20 KV cache compression strategies available for causal language models. KVPress compresses the key-value cache during the prefill phase, reducing memory usage for long-context inference.

Key implementation details:

New kvpress algorithm module following the PrunaAlgorithmBase
pattern
Supports 20 scorer presses (ExpectedAttention, SnapKV, StreamingLLM, TOVA, KVzip, etc.)
Configurable compression_ratio and press_kwargs for press-specific parameters
New KV_CACHER algorithm tag for the cache compression category
Compatibility defined with quantization algorithms (before) and torch_compile (after)
Uses reapply save strategy — press is re-applied on model load

Excluded press types: Wrapper presses (ChunkPress, AdaKVPress, PerLayerCompressionPress, DMSPress, etc.) are not included in this initial integration. These require a nested ScorerPress instance as a constructor argument, which doesn't fit the current single-class design. Similarly, ThinKPress is excluded as it compresses along the channel dimension with a different parameter interface. These could be added in a follow-up if needed.

Some downstream evaluation results are available in repo kschwethelm/pruna-kvpress-eval.

Related Issue

Fixes #366

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor (no functional change)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Testing

I added or updated tests covering my changes
Existing tests pass locally (uv run pytest -m "cpu and not slow")

Unit tests added in tests/algorithms/test_kvpress.py with a dedicated tester in tests/algorithms/testers/kvpress.py. Integration evaluated in a separate repo -> see evaluation report.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my code, especially for agent-assisted changes
I updated the documentation where necessary

codacy-production · 2026-04-10T14:59:32Z

Not up to standards ⛔

🔴 Issues 5 high · 5 minor

Alerts:
⚠ 10 issues (≤ 0 issues of at least minor severity)

Results:
10 new issues

Category Results

Documentation 5 minor

Security 5 high

View in Codacy

🟢 Metrics 9 complexity · 0 duplication

Metric Results

Complexity 9

Duplication 0

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

minettekaum · 2026-04-13T16:25:45Z

Hi @kschwethelm! Thanks for the contribution. This branch has conflicts with main. Could you rebase onto main, resolve the conflicts, and push the updated branch? Could you also review the test results and make any necessary changes? Let @sdiazlor or me know if you need any help :)

kschwethelm · 2026-04-13T17:15:05Z

Hi @minettekaum, thanks! I've rebased onto main and resolved the conflicts. The pytest import error should also be resolved now. The test job failed with ModuleNotFoundError: No module named 'kvpress' because the base job doesn't install the kvpress extra

sdiazlor · 2026-04-15T11:19:48Z

Thank you for the contribution @kschwethelm! The tests have passed now. We will review soon!

simlang

Thank you so much for tackling this integration. First iteration already looks great! 🚀

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 4fd566c. Configure here.}

llcnt

I tested a combo hqq+kvpress(snapkv)+torch_compile, and it works well!
Thanks for the contribution, this is very cool :):)
Could you address this comment? After that we will be good to go ;)

Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.

kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.

Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.

- Replace tags.QUANTIZER with explicit LLM algorithm names to avoid false symmetry matches with diffuser algorithms - Fix SmashConfig.add() dict flattening: only flatten when key is a registered algorithm name, not for dict-valued hyperparameters - Remove wrapper/special presses from PRESS_TYPES (CriticalKVPress and others that don't accept compression_ratio directly) - Add unit tests for press type validation and kwargs forwarding - Add SnapKV integration test with press_kwargs

Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.

Drop the dedicated KV_COMPRESSOR tag and use tags.PRUNER as kvpress's group tag, matching how other pruners are categorized. Replace all tags.KV_COMPRESSOR references in compatible_before/after lists with the string "kvpress" to align with the repo convention of naming specific algorithms in compatibility lists.

Add pipeline guard at the top of _apply to delegate to _apply_to_model_within_transformers_pipeline when the model is a TextGenerationPipeline, matching the pattern used by gptq, torch_compile, and other algorithms.

simlang

Looks good to me! Thank you so much for introducing KVPress to pruna!

LGTM! 🚀

llcnt

Thank you for the work :)

kschwethelm · 2026-05-06T15:19:19Z

Thank you so much! It was fun to contribute and discuss with you :))

kschwethelm force-pushed the feat/kvpress branch from 7f8b282 to da9199e Compare April 13, 2026 17:12

sdiazlor requested review from gsprochette and simlang April 15, 2026 11:17

simlang requested changes Apr 15, 2026

View reviewed changes

Comment thread src/pruna/algorithms/kvpress.py Outdated

Comment thread src/pruna/algorithms/kvpress.py

Comment thread tests/algorithms/test_kvpress.py Outdated

kschwethelm force-pushed the feat/kvpress branch 2 times, most recently from a700a2b to 4fd566c Compare April 17, 2026 06:56

cursor Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/pruna/algorithms/kvpress.py

gsprochette requested review from llcnt and removed request for gsprochette April 24, 2026 08:03

gsprochette reviewed Apr 29, 2026

View reviewed changes

Comment thread pyproject.toml

llcnt requested changes Apr 29, 2026

View reviewed changes

Comment thread src/pruna/algorithms/kvpress.py

kschwethelm added 10 commits May 2, 2026 10:30

feat: integrate KVPress for KV cache compression

c14f724

Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.

feat: bump kvpress to >=0.5.2, add FastKVzipPress

becc89d

kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.

feat: add press_kwargs for press-specific parameters

1546da9

Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.

feat: add KV_CACHER tag, replace explicit kvpress references

469e5b3

Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.

refactor: rename KV_CACHER tag to KV_COMPRESSOR, improve docstrings

1ab2e3b

docs: document excluded wrapper presses in kvpress docstring

63a1315

fix: handle transformers pipeline in kvpress _apply

e046d4a

Add pipeline guard at the top of _apply to delegate to _apply_to_model_within_transformers_pipeline when the model is a TextGenerationPipeline, matching the pattern used by gptq, torch_compile, and other algorithms.

ci: register requires_kvpress marker for optional extra

c0b1fa8

kschwethelm force-pushed the feat/kvpress branch from 90cd1f8 to c0b1fa8 Compare May 3, 2026 07:38

sdiazlor requested review from llcnt and simlang May 5, 2026 15:02

simlang approved these changes May 6, 2026

View reviewed changes

llcnt approved these changes May 6, 2026

View reviewed changes

gsprochette force-pushed the feat/kvpress branch from ad4577b to c0b1fa8 Compare May 6, 2026 13:46

test: mark kvpress as require_kvpress

9302ef2

llcnt merged commit b210fdb into PrunaAI:main May 6, 2026
5 checks passed

kschwethelm deleted the feat/kvpress branch May 6, 2026 15:18

Conversation

kschwethelm commented Apr 10, 2026

Description

Related Issue

Type of Change

Testing

Checklist

Uh oh!

codacy-production Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

minettekaum commented Apr 13, 2026

Uh oh!

kschwethelm commented Apr 13, 2026

Uh oh!

sdiazlor commented Apr 15, 2026

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kschwethelm commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codacy-production Bot commented Apr 10, 2026 •

edited

Loading