Reclaim orphaned article content; stop prune_old_articles leaking it#633
Merged
Conversation
Article deletion was leaving ~95% of the content tables orphaned. Two causes, both fixed here: 1. prune_old_articles.py deletes with FOREIGN_KEY_CHECKS=0 (needed to get past article's NO ACTION children). With FK checks off, the ON DELETE CASCADE children are never cleaned up. delete_in_batches now deletes article's cascade-owned children explicitly (article_fragment + its article_fragment_context, article_tokenization_cache, article_cefr_assessment, and the rest of the CASCADE set) so pruning no longer leaks them. 2. The shared, deduplicated content tables (new_text / source / source_text) have no cascade path at all — a row can be shared across articles and user data. New tool cleanup_orphaned_content.py reclaims them (and clears the historical backlog) by deleting only rows not referenced by any surviving article fragment, bookmark_context, caption, bookmark, user_activity_data, or video. Dry-run by default; --execute to apply, --optimize to reclaim disk. Measured on a production snapshot: ~32M deletable rows, shrinking the dump from ~40 GB to ~4 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
ArchLens - No architecturally relevant changes to the existing views |
…h data When pruning an article, only delete its regenerable/computed children (cefr_assessment, classification, tokenization_cache, topic_map, url_keyword_map, difficulty_lingo_rank, broken_code_map, grammar_correction_log) plus fragments. Deliberately stop deleting the user/research/teacher cascade children — user_activity_data (learning analytics), personal_copy, cohort_article_map, article_topic_user_feedback, user_article_broken_report — which are worth preserving even after an article is pruned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…api into cleanup-orphaned-article-content
Strengthen referenced_article_ids() so an article is never pruned while it is pointed at by personal_copy, cohort_article_map, user_activity_data, article_topic_user_feedback, or user_article_broken_report. These tables hold only pointers (no content of their own), so the article must survive for them to mean anything; protecting the article also guarantees none of the user/research cascade children we intentionally don't delete can become a dangling orphan. This exactly complements CASCADE_CHILDREN (the derived/ computed children we do delete). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the fragile FOREIGN_KEY_CHECKS=0 + manual cascade approach (which
crashed mid-run when the session setting was lost across pooled connections,
and silently cascade-deleted data we meant to keep) with FK-checks-ON deletion:
- DB CASCADE removes the derived/regenerable children automatically; the
manual CASCADE_CHILDREN list and delete_article_owned_children are gone.
- Migration 26-05-26 switches the data we keep (personal_copy,
user_activity_data, cohort_article_map, article_topic_user_feedback,
user_article_broken_report) from ON DELETE CASCADE to RESTRICT, so deleting
a referenced article is blocked instead of silently destroying that data.
- parent_article_id stays CASCADE; prune additionally protects an original
whose AI-simplification is referenced, so families stay intact and no
simplification is orphaned. Unreferenced simplifications cascade out with
their pruned original.
- referenced_article_ids() pre-filters the blocking tables (so we don't
attempt doomed deletes); if it ever drifts, delete_in_batches ABORTS LOUDLY
naming the blocking FK rather than skipping — so the gap is noticed.
Validated on a full un-anonymized prod mirror (2.47M articles): prune --apply
ran to completion with zero new dangling references across every article-child
table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pruning an article orphans its de-duplicated new_text/source/source_text (no FK from article, so the cascade can't reach them). Rather than relying on a separate sweep, prune now reclaims them per batch: it captures the text_id / source_id / source_text_id the batch points at, deletes the articles, then deletes those rows that are now referenced by nobody (same NOT EXISTS guards as cleanup_orphaned_content.py, but scoped by id -> no full-table scan). Net: one command does the whole job. tools/cleanup_orphaned_content.py stays for the one-time historical backlog and the anonymization pipeline. Smoke-tested: pruning one article reclaimed 16 new_text + 1 source + 1 source_text. Article-deletion path validated at 595K-article scale earlier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anonymize_users.py deleted its unreferenced articles with the same leaky
pattern prune used to (FOREIGN_KEY_CHECKS=0, no content reclaim) — which is what
created the ~15M-row orphan backlog that bloated every anon backup.
Extract the validated FK-checks-ON deletion into zeeguu/core/article_pruning.py
(referenced_article_ids, reclaim_shared_content, delete_articles_in_batches) and
use it from BOTH prune_old_articles.py and anonymize_users.py. Now:
- neither path disables FK checks or leaves orphans;
- both protect exactly the same referenced set (incl. bookmarks, simplification
parents) and abort loudly on a pre-filter gap;
- the anon DB comes out clean, so the backup is small without a separate sweep.
Smoke-tested via prune --apply (shared path): deleted + reclaimed content, zero
orphans across all article-child tables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A production snapshot revealed that ~95–98% of every article-content table was orphaned — rows belonging to articles that had already been deleted:
This bloats both production storage (~25 GB) and the anonymized backup dump.
Root causes (both fixed here)
prune_old_articles.pydeletes withFOREIGN_KEY_CHECKS = 0(necessary to get pastarticle'sNO ACTIONchildren likeuser_article/user_reading_session). But FK-checks-off also suppresses theON DELETE CASCADEchildren — so fragments, tokenization cache, CEFR assessment, etc. were never cleaned.delete_in_batchesnow deletes the cascade-owned children explicitly (incl.article_fragment_context, which is nested underarticle_fragment).new_text/source/source_texthave no cascade path at all and are content-deduplicated (one row shared across articles and user data), so no per-article delete can safely remove them. New toolcleanup_orphaned_content.pyreclaims them — and clears the historical backlog — deleting only rows not referenced by any surviving article fragment,bookmark_context,caption,bookmark,user_activity_data, orvideo. Dry-run by default;--executeto apply,--optimizeto return disk to the OS.Validation (on a real production snapshot)
cleanup_orphaned_content.py --executedeleted 56,396,735 rows with no FK errors; every table landed exactly on its computed keep-count.Follow-up
Recommend running
cleanup_orphaned_content.pyin cron right afterprune_old_articles.py --applyto reclaim shared content on an ongoing basis.🤖 Generated with Claude Code