-
Notifications
You must be signed in to change notification settings - Fork 118
Add tutorial for KV cache compression with TurboQuant #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kacperlukawski
wants to merge
5
commits into
main
Choose a base branch
from
turboquant-tutorial
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+368
−0
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
65416cd
Add tutorial for KV cache compression with TurboQuant
kacperlukawski 0b86083
Make it clear that we use unofficial turboquant implementation
kacperlukawski abc01d4
Sspecify Python version in tutorial configuration
kacperlukawski 47a33b0
Remove HF_TOKEN ref
kacperlukawski ec27d0c
Add cell outputs
kacperlukawski File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
355 changes: 355 additions & 0 deletions
355
tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,355 @@ | ||
| { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you leave the outputs, especially when we print a result? I find these very useful Reply via ReviewNB
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added all the cell outputs. |
||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Compress the KV Cache with TurboQuant and Haystack\n", | ||
| "\n", | ||
| "- **Level**: Advanced\n", | ||
| "- **Time to complete**: 20 min\n", | ||
| "- **Components Used**: [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator)\n", | ||
| "- **Goal**: Apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Overview\n", | ||
| "\n", | ||
| "Every time an LLM generates a token, it reads and writes a **key-value (KV) cache** - a growing table of intermediate activations that lets the model attend to previous tokens without recomputing them. On long contexts or large models, this cache becomes the dominant consumer of GPU memory.\n", | ||
| "\n", | ||
| "[TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) is a KV cache compression algorithm from Google Research (ICLR 2026) that shrinks those vectors to 3–4 bits per coordinate without any retraining. It works in two stages:\n", | ||
| "\n", | ||
| "1. **PolarQuant** - a random orthogonal rotation maps cache vectors to a more uniform distribution, then quantizes them in polar coordinates using Lloyd-Max optimal centroids.\n", | ||
| "2. **QJL** (Quantized Johnson-Lindenstrauss) - a single extra bit per vector corrects residual errors in attention score computation, preserving accuracy at extreme compression ratios.\n", | ||
| "\n", | ||
| "The result: KV memory can drop from 1,639 MiB to 435 MiB (3.76x) on an RTX 4090, with ≥6x reduction validated on server hardware, and near-identical output quality.\n", | ||
| "\n", | ||
| "In this tutorial you will use [`turboquant-vllm`](https://github.com/Alberto-Codes/turboquant-vllm), a community implementation of the TurboQuant algorithm, to wire `CompressedDynamicCache` into Haystack's [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator), run a generation, and measure time-to-first-token, throughput, and live VRAM usage." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Installing Haystack and TurboQuant\n", | ||
| "\n", | ||
| "First, let's install `haystack-ai` and [`turboquant-vllm`](https://github.com/Alberto-Codes/turboquant-vllm), a community implementation of the TurboQuant algorithm that provides the `CompressedDynamicCache` wrapper." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T13:26:59.381856Z", | ||
| "start_time": "2026-04-03T13:26:24.012627Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "%%bash\n", | ||
| "\n", | ||
| "pip install -q haystack-ai turboquant-vllm" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip is available: \u001B[0m\u001B[31;49m25.0.1\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m26.0.1\u001B[0m\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n" | ||
| ] | ||
| } | ||
| ], | ||
| "execution_count": 1 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Setting Up a Streaming Callback\n", | ||
| "\n", | ||
| "To measure **time-to-first-token (TTFT)** and throughput, we pass a streaming callback that timestamps each arriving token. The first call marks TTFT, while the last marks the end of generation." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T13:26:59.418920Z", | ||
| "start_time": "2026-04-03T13:26:59.390162Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "import time\n", | ||
| "\n", | ||
| "first_token_time = None\n", | ||
| "last_token_time = None\n", | ||
| "\n", | ||
| "def timing_callback(chunk):\n", | ||
| " global first_token_time, last_token_time\n", | ||
| " now = time.perf_counter()\n", | ||
| " if first_token_time is None:\n", | ||
| " first_token_time = now\n", | ||
| " last_token_time = now" | ||
| ], | ||
| "outputs": [], | ||
| "execution_count": 2 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Compressing the KV Cache\n", | ||
| "\n", | ||
| "Next, let's create the compressed cache. We start with HuggingFace's standard `DynamicCache` and wrap it with `CompressedDynamicCache`, which intercepts cache writes and applies TurboQuant compression in place.\n", | ||
| "\n", | ||
| "Two parameters control the compression:\n", | ||
| "- `head_dim` - the dimensionality of each attention head's key/value vectors\n", | ||
| "- `bits` - the target bit-width per coordinate\n", | ||
| "\n", | ||
| "> **Note**: Pass the original `cache` object to the generator - not `compressed`. `CompressedDynamicCache` modifies `cache` internally, so both variables point to the same compressed state." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T13:27:40.845609Z", | ||
| "start_time": "2026-04-03T13:26:59.424193Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "from transformers import DynamicCache\n", | ||
| "from turboquant_vllm import CompressedDynamicCache\n", | ||
| "\n", | ||
| "# The CompressedDynamicCache modifies the DynamicCache internally,\n", | ||
| "# so we pass the same `cache` instance to both the generator,\n", | ||
| "# and not `compressed` directly.\n", | ||
| "cache = DynamicCache()\n", | ||
| "compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "/Users/kacper.lukawski/Projects/haystack-tutorials/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", | ||
| " from .autonotebook import tqdm as notebook_tqdm\n" | ||
| ] | ||
| } | ||
| ], | ||
| "execution_count": 3 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Initializing the Generator\n", | ||
| "\n", | ||
| "Now let's set up [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator) with a selected model, like `Qwen/Qwen3-4B-Thinking-2507`. We pass the compressed `cache` via `generation_kwargs` so that every decoding step writes through TurboQuant." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T13:28:00.561590Z", | ||
| "start_time": "2026-04-03T13:27:40.873264Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "from haystack.components.generators.chat import HuggingFaceLocalChatGenerator\n", | ||
| "\n", | ||
| "generator = HuggingFaceLocalChatGenerator(\n", | ||
| " model=\"Qwen/Qwen3-4B-Thinking-2507\",\n", | ||
| " task=\"text-generation\",\n", | ||
| " generation_kwargs={\n", | ||
| " \"past_key_values\": cache,\n", | ||
| " \"use_cache\": True,\n", | ||
| " },\n", | ||
| " streaming_callback=timing_callback,\n", | ||
| ")" | ||
| ], | ||
| "outputs": [], | ||
| "execution_count": 4 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Running the Generator\n", | ||
| "\n", | ||
| "Let's run a generation and record the total wall time." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T14:06:57.416718Z", | ||
| "start_time": "2026-04-03T13:28:00.574408Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "from haystack.dataclasses import ChatMessage\n", | ||
| "\n", | ||
| "start = time.perf_counter()\n", | ||
| "output = generator.run(messages=[\n", | ||
| " ChatMessage.from_user(\"What is the capital of France?\"),\n", | ||
| "])\n", | ||
| "total_time = time.perf_counter() - start" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 55.96it/s]\n", | ||
| "Device set to use mps\n" | ||
| ] | ||
| } | ||
| ], | ||
| "execution_count": 5 | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T14:06:59.733732Z", | ||
| "start_time": "2026-04-03T14:06:58.328820Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "reply = output[\"replies\"][0]\n", | ||
| "print(reply.text)" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Okay, the user is asking, \"What is the capital of France?\" This seems like a straightforward geography question. Let me recall... I know that Paris is the capital of France. But wait, I should make sure I'm not mixing it up with other countries. For example, London is the capital of the UK, and Berlin is for Germany. Yeah, France's capital is definitely Paris.\n", | ||
| "\n", | ||
| "Hmm, why would someone ask this? Maybe they're a student studying for a test, or someone learning English as a second language. They might be confirming basic facts. Or perhaps they're testing me to see if I know the answer correctly. Either way, I should give a clear and confident answer.\n", | ||
| "\n", | ||
| "I remember that historically, Paris has been the capital since the Middle Ages. It's also known for the Eiffel Tower, the Louvre Museum, and the French Revolution. So, there's no doubt here. But I should double-check to avoid any mistakes. Let me think... Yes, all reliable sources say Paris is the capital. No recent changes either—France hasn't moved its capital. \n", | ||
| "\n", | ||
| "The user might also be confused if they heard about \"République française\" or something else, but no, the capital is still Paris. I should mention it's in the Île-de-France region to be precise, but maybe that's extra. The question is simple, so the answer should be concise.\n", | ||
| "\n", | ||
| "Wait, is there any trick here? Like, does France have multiple capitals? No, Paris is the only capital. Sometimes people confuse it with other cities like Lyon or Marseille, but those are major cities, not capitals. So, no confusion here.\n", | ||
| "\n", | ||
| "I think the best response is to state clearly that Paris is the capital of France. Maybe add a brief note about its significance to be helpful. Like, it's the political, cultural, and economic center. But the user just asked for the capital, so keep it short. Don't overcomplicate it.\n", | ||
| "\n", | ||
| "Also, the user might be non-native English speaker, so I should use simple language. No need for complex terms. Just \"Paris\" is enough, but adding \"the capital city of France\" makes it clear.\n", | ||
| "\n", | ||
| "Let me phrase it: \"The capital of France is Paris.\" Done. That's accurate and straightforward. No need for more details unless the user asks follow-ups. \n", | ||
| "\n", | ||
| "Wait, just to be thorough—did France ever have another capital? Like, during the French Revolution, they moved temporarily? No, Paris remained the capital. There\n" | ||
| ] | ||
| } | ||
| ], | ||
| "execution_count": 6 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Reading the Metrics\n", | ||
| "\n", | ||
| "Three metrics to check:\n", | ||
| "\n", | ||
| "- **TTFT** (time-to-first-token) - latency to the first output token - a proxy for perceived responsiveness.\n", | ||
| "- **Throughput** (tok/s) - tokens decoded per second. TurboQuant's memory savings reduce cache read pressure, which can improve this on memory-bandwidth-bound hardware.\n", | ||
| "- **Total time** - end-to-end wall time including model loading overhead." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T14:07:00.078667Z", | ||
| "start_time": "2026-04-03T14:06:59.919485Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "tokens = reply.meta[\"usage\"][\"completion_tokens\"]\n", | ||
| "if first_token_time is not None and last_token_time is not None:\n", | ||
| " generation_time = last_token_time - first_token_time\n", | ||
| " print(f\"TTFT: {first_token_time - start:.3f}s\")\n", | ||
| " print(f\"Tokens: {tokens}\")\n", | ||
| " print(f\"Speed: {tokens / generation_time:.1f} tok/s\")\n", | ||
| "print(f\"Total time: {total_time:.3f}s\")" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "TTFT: 81.985s\n", | ||
| "Tokens: 512\n", | ||
| "Speed: 0.4 tok/s\n", | ||
| "Total time: 1425.370s\n" | ||
| ] | ||
| } | ||
| ], | ||
| "execution_count": 7 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Checking VRAM Usage\n", | ||
| "\n", | ||
| "`vram_bytes()` returns the byte footprint of all compressed KV tensors. Compare it against an uncompressed `DynamicCache` to verify the reduction reported in the TurboQuant paper." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "metadata": { | ||
| "ExecuteTime": { | ||
| "end_time": "2026-04-03T14:07:00.526515Z", | ||
| "start_time": "2026-04-03T14:07:00.121042Z" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "compressed.vram_bytes()" | ||
| ], | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "20680704" | ||
| ] | ||
| }, | ||
| "execution_count": 8, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "execution_count": 8 | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "🎉 Congratulations! You've successfully run a local LLM with TurboQuant KV cache compression through Haystack and measured its real-world memory and throughput impact." | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.10.0" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 4 | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Components Used:..
Goal: After completing this tutorial, you will have learned how to apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good catch! I created an issue to modify the template, so we have a proper terminology used: #443.