perf(video): cache caption tokenization + batch past-bookmark lookup#648
Open
mircealungu wants to merge 1 commit into
Open
perf(video): cache caption tokenization + batch past-bookmark lookup#648mircealungu wants to merge 1 commit into
mircealungu wants to merge 1 commit into
Conversation
`GET /user_video?video_id=` returned ~1 MB and took ~2s end-to-end on
captioned videos with hundreds of segments. Two dominant costs:
1. `Video.video_info(with_content=True)` ran Stanza tokenization on every
caption on every request -- no cache. Articles have
`article_tokenization_cache`; videos didn't.
2. `UserVideo.user_video_info` then looped over the freshly-tokenized
captions and made one DB query per caption to fetch past bookmarks
(N+1).
This PR adds:
- `caption_tokenization_cache` table (caption_id PK, tokenized_text
MEDIUMTEXT, created_at) mirroring the article cache. Captions are
immutable post-ingestion so entries are populated lazily on first read
and never invalidated. Foreign-key cascade on delete.
- `CaptionTokenizationCache` model with `find_or_create` (race-safe via
flush+IntegrityError-then-fetch, same pattern as the article one),
`get_many` for batched reads, and `delete_older_than` for cleanup.
- `Video.video_info` now batch-fetches the cache, parses cached JSON
when present, runs Stanza only on misses, persists the new rows, and
commits once at the end.
- `VideoCaptionContext.get_user_bookmarks_grouped_by_caption` -- one
IN-clause query that replaces the per-caption helper; returns a
{caption_id: [bookmark_json, ...]} dict.
- `UserVideo.user_video_info` uses the grouped result instead of the
per-caption helper. The single-caption helper is left in place for
other callers.
Expected effect after first warm cache: /user_video drops from ~2s to
roughly the time of one batched SELECT plus the JSON serialisation. The
N+1 collapses to one query regardless of caption count.
Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql`
must be run before/with deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problem
`GET /user_video?video_id=` was returning ~1 MB and taking ~2 seconds end-to-end for any captioned video. Observed live on video 2478 (16-min auto-captioned Danish DR clip) -- multiple back-to-back refreshes all took the same ~2s, no caching warm-up.
Two costs dominated the response, both tied to the number of captions (hundreds for a long video):
Changes
`caption_tokenization_cache` table + model
Mirrors `ArticleTokenizationCache` 1:1. Captions are immutable after a video is ingested, so entries are populated lazily on first read and never invalidated; `delete_older_than` is provided for housekeeping. Race-safe `find_or_create` using the same flush+`IntegrityError`-then-fetch pattern as the article version. `get_many` is the batched read.
Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` (FK cascade on delete).
`Video.video_info(with_content=True)` now uses the cache
Batched past-bookmark lookup
Added `VideoCaptionContext.get_user_bookmarks_grouped_by_caption(user_id, caption_ids)` -- one query with `caption_id IN (...)`, returns `{caption_id: [bookmark_json, ...]}`. `UserVideo.user_video_info` now uses the grouped result. The original single-caption helper is left in place for other callers.
Expected impact
After one warm-up request per video, `/user_video` should drop from ~2s to roughly the time of one batched `SELECT` plus JSON serialisation. The N+1 collapses to one query regardless of caption count. Response size is unchanged (~1 MB) -- that's a separate concern; a lighter preview endpoint for callers that don't need the full transcript is future work.
Deploy
Run `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` before/with the deploy.
Out of scope
🤖 Generated with Claude Code