Skip to content

perf(video): cache caption tokenization + batch past-bookmark lookup#648

Open
mircealungu wants to merge 1 commit into
masterfrom
perf/video-tokenization-cache
Open

perf(video): cache caption tokenization + batch past-bookmark lookup#648
mircealungu wants to merge 1 commit into
masterfrom
perf/video-tokenization-cache

Conversation

@mircealungu
Copy link
Copy Markdown
Member

Problem

`GET /user_video?video_id=` was returning ~1 MB and taking ~2 seconds end-to-end for any captioned video. Observed live on video 2478 (16-min auto-captioned Danish DR clip) -- multiple back-to-back refreshes all took the same ~2s, no caching warm-up.

Two costs dominated the response, both tied to the number of captions (hundreds for a long video):

  1. Stanza re-tokenization on every request. `Video.video_info(with_content=True)` ran `tokenize_for_reading` for each caption inline, with no cache. Articles have `article_tokenization_cache`; videos didn't.
  2. N+1 `past_bookmarks` query. `UserVideo.user_video_info` then looped over the captions and called `VideoCaptionContext.get_all_user_bookmarks_for_caption` once per caption -- one DB query each.

Changes

`caption_tokenization_cache` table + model

Mirrors `ArticleTokenizationCache` 1:1. Captions are immutable after a video is ingested, so entries are populated lazily on first read and never invalidated; `delete_older_than` is provided for housekeeping. Race-safe `find_or_create` using the same flush+`IntegrityError`-then-fetch pattern as the article version. `get_many` is the batched read.

Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` (FK cascade on delete).

`Video.video_info(with_content=True)` now uses the cache

  • Collects all caption ids upfront.
  • One `SELECT ... WHERE caption_id IN (...)` against the cache.
  • Cache hit: parse cached JSON, no Stanza work.
  • Cache miss: run Stanza, persist via `find_or_create`, set `tokenized_text`.
  • One `db.session.commit()` at the end if any new rows were populated.

Batched past-bookmark lookup

Added `VideoCaptionContext.get_user_bookmarks_grouped_by_caption(user_id, caption_ids)` -- one query with `caption_id IN (...)`, returns `{caption_id: [bookmark_json, ...]}`. `UserVideo.user_video_info` now uses the grouped result. The original single-caption helper is left in place for other callers.

Expected impact

After one warm-up request per video, `/user_video` should drop from ~2s to roughly the time of one batched `SELECT` plus JSON serialisation. The N+1 collapses to one query regardless of caption count. Response size is unchanged (~1 MB) -- that's a separate concern; a lighter preview endpoint for callers that don't need the full transcript is future work.

Deploy

Run `tools/migrations/26-06-01--add_caption_tokenization_cache.sql` before/with the deploy.

Out of scope

  • Lighter `/video_preview` endpoint returning only `{id, title, duration}` for the share-flow's intermediate screen.
  • Whether to gzip the `/user_video` response (the 1 MB transports as ~150 KB compressed; worth checking that nginx/cloudflare is doing this).
  • `tokenized_title` is still re-computed each request (single Stanza call -- negligible compared to the captions).

🤖 Generated with Claude Code

`GET /user_video?video_id=` returned ~1 MB and took ~2s end-to-end on
captioned videos with hundreds of segments. Two dominant costs:

1. `Video.video_info(with_content=True)` ran Stanza tokenization on every
   caption on every request -- no cache. Articles have
   `article_tokenization_cache`; videos didn't.
2. `UserVideo.user_video_info` then looped over the freshly-tokenized
   captions and made one DB query per caption to fetch past bookmarks
   (N+1).

This PR adds:

- `caption_tokenization_cache` table (caption_id PK, tokenized_text
  MEDIUMTEXT, created_at) mirroring the article cache. Captions are
  immutable post-ingestion so entries are populated lazily on first read
  and never invalidated. Foreign-key cascade on delete.
- `CaptionTokenizationCache` model with `find_or_create` (race-safe via
  flush+IntegrityError-then-fetch, same pattern as the article one),
  `get_many` for batched reads, and `delete_older_than` for cleanup.
- `Video.video_info` now batch-fetches the cache, parses cached JSON
  when present, runs Stanza only on misses, persists the new rows, and
  commits once at the end.
- `VideoCaptionContext.get_user_bookmarks_grouped_by_caption` -- one
  IN-clause query that replaces the per-caption helper; returns a
  {caption_id: [bookmark_json, ...]} dict.
- `UserVideo.user_video_info` uses the grouped result instead of the
  per-caption helper. The single-caption helper is left in place for
  other callers.

Expected effect after first warm cache: /user_video drops from ~2s to
roughly the time of one batched SELECT plus the JSON serialisation. The
N+1 collapses to one query regardless of caption count.

Migration: `tools/migrations/26-06-01--add_caption_tokenization_cache.sql`
must be run before/with deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

ArchLens detected architectural changes in the following views:
diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant