perf(video): cache caption tokenization + batch past-bookmark lookup by mircealungu · Pull Request #648 · zeeguu/api

mircealungu · 2026-06-01T20:34:14Z

Problem

`GET /user_video?video_id=` was returning ~1 MB and taking ~2 seconds end-to-end for any captioned video. Observed live on video 2478 (16-min auto-captioned Danish DR clip) -- multiple back-to-back refreshes all took the same ~2s, no caching warm-up.

Two costs dominated the response, both tied to the number of captions (hundreds for a long video):

Stanza re-tokenization on every request. `Video.video_info(with_content=True)` ran `tokenize_for_reading` for each caption inline, with no cache. Articles have `article_tokenization_cache`; videos didn't.
N+1 `past_bookmarks` query. `UserVideo.user_video_info` then looped over the captions and called `VideoCaptionContext.get_all_user_bookmarks_for_caption` once per caption -- one DB query each.

Changes

`caption_tokenization_cache` table + model

Mirrors `ArticleTokenizationCache` 1:1. Captions are immutable after a video is ingested, so entries are populated lazily on first read and never invalidated; `delete_older_than` is provided for housekeeping. Race-safe `find_or_create` using the same flush+`IntegrityError`-then-fetch pattern as the article version. `get_many` is the batched read.

Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` (FK cascade on delete).

`Video.video_info(with_content=True)` now uses the cache

Collects all caption ids upfront.
One `SELECT ... WHERE caption_id IN (...)` against the cache.
Cache hit: parse cached JSON, no Stanza work.
Cache miss: run Stanza, persist via `find_or_create`, set `tokenized_text`.
One `db.session.commit()` at the end if any new rows were populated.

Batched past-bookmark lookup

Added `VideoCaptionContext.get_user_bookmarks_grouped_by_caption(user_id, caption_ids)` -- one query with `caption_id IN (...)`, returns `{caption_id: [bookmark_json, ...]}`. `UserVideo.user_video_info` now uses the grouped result. The original single-caption helper is left in place for other callers.

Expected impact

After one warm-up request per video, `/user_video` should drop from ~2s to roughly the time of one batched `SELECT` plus JSON serialisation. The N+1 collapses to one query regardless of caption count. Response size is unchanged (~1 MB) -- that's a separate concern; a lighter preview endpoint for callers that don't need the full transcript is future work.

Deploy

Run `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` before/with the deploy.

Out of scope

Lighter `/video_preview` endpoint returning only `{id, title, duration}` for the share-flow's intermediate screen.
Whether to gzip the `/user_video` response (the 1 MB transports as ~150 KB compressed; worth checking that nginx/cloudflare is doing this).
`tokenized_title` is still re-computed each request (single Stanza call -- negligible compared to the captions).

🤖 Generated with Claude Code

`GET /user_video?video_id=` returned ~1 MB and took ~2s end-to-end on captioned videos with hundreds of segments. Two dominant costs: 1. `Video.video_info(with_content=True)` ran Stanza tokenization on every caption on every request -- no cache. Articles have `article_tokenization_cache`; videos didn't. 2. `UserVideo.user_video_info` then looped over the freshly-tokenized captions and made one DB query per caption to fetch past bookmarks (N+1). This PR adds: - `caption_tokenization_cache` table (caption_id PK, tokenized_text MEDIUMTEXT, created_at) mirroring the article cache. Captions are immutable post-ingestion so entries are populated lazily on first read and never invalidated. Foreign-key cascade on delete. - `CaptionTokenizationCache` model with `find_or_create` (race-safe via flush+IntegrityError-then-fetch, same pattern as the article one), `get_many` for batched reads, and `delete_older_than` for cleanup. - `Video.video_info` now batch-fetches the cache, parses cached JSON when present, runs Stanza only on misses, persists the new rows, and commits once at the end. - `VideoCaptionContext.get_user_bookmarks_grouped_by_caption` -- one IN-clause query that replaces the per-caption helper; returns a {caption_id: [bookmark_json, ...]} dict. - `UserVideo.user_video_info` uses the grouped result instead of the per-caption helper. The single-caption helper is left in place for other callers. Expected effect after first warm cache: /user_video drops from ~2s to roughly the time of one batched SELECT plus the JSON serialisation. The N+1 collapses to one query regardless of caption count. Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` must be run before/with deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-01T20:35:10Z

ArchLens detected architectural changes in the following views:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(video): cache caption tokenization + batch past-bookmark lookup#648

perf(video): cache caption tokenization + batch past-bookmark lookup#648
mircealungu wants to merge 1 commit into
masterfrom
perf/video-tokenization-cache

mircealungu commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mircealungu commented Jun 1, 2026

Problem

Changes

`caption_tokenization_cache` table + model

`Video.video_info(with_content=True)` now uses the cache

Batched past-bookmark lookup

Expected impact

Deploy

Out of scope

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant