fix(migrations): make spans.task_id rollout safe for large tables#223
fix(migrations): make spans.task_id rollout safe for large tables#223declan-scale merged 6 commits intomainfrom
Conversation
9d81691 to
cec6afa
Compare
mohammadatallah-scale
left a comment
There was a problem hiding this comment.
What would happen if a migration adds an operation that cannot run inside a transaction without an autocommit_block? Such as CREATE INDEX CONCURRENTLY
Postgres rejects it immediately with
So it fails loudly and fast, which is the right outcome — much better than a silent corrupt state. The author has to wrap the statement in A few specifics worth noting:
Net: this is one of the better failure modes in the migration safety story. The wrong shape is mechanically impossible to deploy successfully — you find out at the latest at pod startup, with no DB damage to clean up. |
The original 57c5ed4f59ae migration combined a multi-million-row UPDATE backfill, a foreign-key add (full-table validation under AccessExclusiveLock), and a non-concurrent CREATE INDEX. On a sufficiently large spans table this combination exhausts the application connection pool while concurrent writes pile up behind the lock. Changes: * Reduce 57c5ed4f59ae to a fast, idempotent ADD COLUMN IF NOT EXISTS. Adding a nullable column with no default is metadata-only on PG >= 11 and does not block writes. Idempotency makes it a no-op on environments where the original heavier version already completed. * Add follow-up migration a9959ebcbe98 that finalizes the FK and index using non-blocking operations: ADD CONSTRAINT NOT VALID (skips the full-table scan) and CREATE INDEX CONCURRENTLY (does not block writes). Both are guarded with pg_constraint / IF NOT EXISTS so the migration is a no-op on environments that already finished the original migration. * Skip the in-band backfill. The application reads tolerate NULL task_id by ORing on trace_id at query time (SpanRepository.list), which returns the same set of spans for task-scoped traces. Full backfill is now an operator-driven, batched runbook (separate PR). * Set transaction_per_migration=True so individual migrations can use alembic's autocommit_block() for CREATE INDEX CONCURRENTLY etc. * Apply default lock_timeout=3s and statement_timeout=30s per migration via SET LOCAL inside the transaction. This prevents future long-running migrations from queueing behind active writes and caps total runtime so they abort cleanly instead of blocking pod startup. autocommit_block() statements run outside the transaction and bypass these timeouts deliberately (they are inherently long but non-blocking).
A None task_id filter would expand to
(task_id IS NULL OR trace_id IS NULL), which on a partially-backfilled
spans table where most historical rows have task_id NULL would return
nearly every row instead of the caller's expected NULL-scoped subset.
Use filters.get("task_id") is not None to gate the OR fallback so a
None value falls through to the parent repository's normal IS NULL
handling. Adds a test that creates a row with task_id NULL and asserts
filtering by task_id=None does not pick up rows whose only NULL column
is task_id-via-trace_id-fallback.
…cape hatch Two follow-ups to the runner-default-timeouts work: * Add a per-migration SET LOCAL idle_in_transaction_session_timeout = '10s' alongside lock_timeout and statement_timeout. Without it, a stalled migration that has already acquired AccessExclusiveLock can hold the lock indefinitely until its connection drops — the other two timeouts do not catch the "open transaction making no progress" case. 10 s is short enough to recover quickly, long enough to absorb normal pauses between statements within a single migration. * Document the escape-hatch convention (`# migration-unsafe-ack: <reason>` top-of-file directive paired with a `migration-unsafe-ack` PR label) that the planned migration linter will enforce. Until the linter ships, this comment is the canonical signal that a migration's author has consciously opted out of the runner defaults and expects review under maintenance-window assumptions.
Updates the env.py escape-hatch comment to reference the linter as actually shipped rather than as planned work.
Switches the migration runner timeout setup to mirror the egp-api-backend implementation so the two repos are aligned on outcome: * Use a DEFAULT_MIGRATION_TIMEOUTS dict (lock_timeout, statement_timeout, idle_in_transaction_session_timeout) and a _format_set_statements helper, so the values are easy to keep in sync between repos. * Apply session-level SET (no LOCAL) at connection start instead of per-transaction SET LOCAL via a SQLAlchemy begin event listener. Session-level values persist across per-migration transactions and across autocommit_block boundaries on the same connection — the latter matters because autocommit_block is exactly what CREATE INDEX CONCURRENTLY uses, and we want the timeouts to cover those statements too. * Apply timeouts in both online and offline modes (offline emits the SET statements at the top of the generated SQL via context.execute). * Update the docstring to reference scripts/ci_tools/migration_lint.py (the linter location aligned with sgp's layout) and the migration-unsafe-ack PR label as the documented escape hatch. Adds tests/unit/database/test_alembic_env_timeouts.py with a focused sanity-check on the constants, the helper shape, and the both-modes wiring — same shape as the sgp test.
exec_driver_sql for the session SET statements autobegins a SQLAlchemy transaction. context.configure() then sees the connection as already in a transaction and sets _in_external_transaction=True, which silently disables transaction_per_migration: begin_transaction() returns nullcontext() and self._transaction is never assigned. Any migration that opens an autocommit_block() then trips its `assert self._transaction is not None` check (the connection IS in a txn, but alembic's handle to it is None). Committing immediately after the SETs closes the autobegun txn so configure() sees a clean connection. Postgres SET is session-level, so the timeouts persist past the commit. Surfaced by finalize_spans_task_id_a9959ebcbe98, the first migration in this tree to use autocommit_block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd14699 to
acfea10
Compare
Summary
Splits the broken
57c5ed4f59aemigration into a safe, idempotent column add plus a non-blocking finalize migration, and addslock_timeout/statement_timeoutdefaults to the migration runner. Eliminates the in-band backfill by tolerating NULLtask_idat read time.Why
The original
57c5ed4f59aecombined three heavy operations on a large, write-heavyspanstable in a single Alembic revision:UPDATEbackfill inside the migration transaction (held row locks, prevented autovacuum, bloated the table).ADD CONSTRAINT … FOREIGN KEY(full-table validation underAccessExclusiveLock).CREATE INDEX(blocks writes during build).On a sufficiently large table this combination exhausts the application connection pool while concurrent span writes pile up behind the lock. The fix splits the work into safe pieces.
Changes
Migration
57c5ed4f59ae— reduced to an idempotent column add only:ALTER TABLE spans ADD COLUMN IF NOT EXISTS task_id VARCHAR;Metadata-only on PG ≥ 11; runs in milliseconds. Idempotency makes it a no-op on environments where the original (heavier) version already completed.
New migration
a9959ebcbe98(tail aftere9c4ff9e6542) — finalizes the FK + index using non-blocking operations:ALTER TABLE … ADD CONSTRAINT … FOREIGN KEY … NOT VALID(skips full-table scan, brief lock only).CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_spans_task_id(does not block writes).pg_constraintlookup /IF NOT EXISTSfor idempotency.op.get_context().autocommit_block()so they run outside the migration transaction.VALIDATE CONSTRAINTis intentionally not run. The FK is enforced on all new inserts/updates andON DELETE SET NULLstill applies —NOT VALIDonly skips the historical scan. Acceptable tradeoff to avoid a multi-minute table scan on a large prod table.No in-band backfill.
SpanRepository.listnow ORs ontrace_idwhen filtering bytask_id, returning historical task-scoped spans correctly without populatingtask_idon every old row. Both columns are indexed. The full backfill (cleanup, optional) is a separate operator-driven runbook (companion PR).env.py:transaction_per_migration=Trueso individual migrations can useautocommit_block().SET LOCAL lock_timeout = '3s'andSET LOCAL statement_timeout = '30s'applied per migration via a SQLAlchemybeginlistener. Future long-running migrations now abort cleanly instead of queueing behind active writes.autocommit_block()ops bypass these (no transaction = no listener trigger), which is the right behavior forCREATE INDEX CONCURRENTLYetc.Behavior matrix
alembic_versionat4a9b7787ccd757c5ed4f59aeadds the column fast, intermediate migrations run as before,a9959ebcbe98adds NOT VALID FK + concurrent indexalembic_versionpast57c5ed4f59ae(FK + index already in place)57c5ed4f59aedoes not re-run;a9959ebcbe98runs but thepg_constraintandIF NOT EXISTSguards make every operation a no-opFollow-ups (separate PRs)
trace_idfallback inSpanRepository.list.Test plan
make test FILE=tests/unit/repositories/test_span_repository.pypasses (3 new tests cover the OR fallback)make dev: pod starts, migrations apply,\d spansshows column + FK + indexalembic upgrade headagainst a prod-shaped DB (20M+ rows on a perf clone) to confirm the new migrations don't block writesGreptile Summary
spansmigration into a fast column-add revision (57c5ed4f59ae, idempotentADD COLUMN IF NOT EXISTS) and a new finalize migration (a9959ebcbe98) that adds the FK withNOT VALIDand builds the indexCONCURRENTLYinsideautocommit_block(), avoiding write-blocking operations on large tables.SpanRepository.listnow ORstask_idandtrace_idat read time for non-null filters, correctly tolerating pre-backfill rows without a mass in-migrationUPDATE. TheNoneguard is present and tested.env.pyapplies session-levellock_timeout/statement_timeoutdefaults; the session-levelstatement_timeout = '30s'persists intoautocommit_block()and will abortCREATE INDEX CONCURRENTLYon a large prod table — the PR description's claim thatautocommit_block()bypasses it is incorrect.Confidence Score: 3/5
Not safe to merge until the statement_timeout vs CREATE INDEX CONCURRENTLY conflict is resolved — the finalize migration will reliably fail on large prod tables.
One P1 remains: the session-level statement_timeout=30s set in env.py persists across autocommit_block() boundaries and will cancel CREATE INDEX CONCURRENTLY after 30s on a 20M+ row table. The env.py comment itself confirms the session-level persistence. Two P2s (offline mode missing transaction_per_migration, lock wait vs lock_timeout race) are low probability but worth fixing. The span repository fix and migration splitting are correct.
agentex/database/migrations/alembic/env.py — the timeout application strategy needs to distinguish between transactional statements and long-running concurrent operations.
Important Files Changed
transaction_per_migration=True; session-levelstatement_timeout=30swill abortCREATE INDEX CONCURRENTLYin the finalize migration, and offline mode is missingtransaction_per_migration=True.ALTER TABLE spans ADD COLUMN IF NOT EXISTS task_id VARCHAR; removes the original heavy backfill, FK, and non-concurrent index creation.NOT VALIDandCREATE INDEX CONCURRENTLYinsideautocommit_block(); both are idempotent, but the session-levelstatement_timeout=30sfromenv.pywill abort the index build on large tables.trace_id-as-task-id rows; correctly guards againstNonetask_id to prevent accidental full-table scan, and delegates remaining filters to the base class.list()override.Sequence Diagram
sequenceDiagram participant Pod as Migration Pod participant PG as PostgreSQL Note over Pod,PG: env.py run_migrations_online() Pod->>PG: SET lock_timeout = '3s' (session-level) Pod->>PG: SET statement_timeout = '30s' (session-level) Pod->>PG: SET idle_in_transaction_session_timeout = '10s' (session-level) Pod->>PG: COMMIT (persists session GUCs) Note over Pod,PG: Migration 57c5ed4f59ae Pod->>PG: BEGIN Pod->>PG: ALTER TABLE spans ADD COLUMN IF NOT EXISTS task_id VARCHAR Pod->>PG: COMMIT (fast, metadata-only on PG>=11) Note over Pod,PG: Migration a9959ebcbe98 (autocommit_block) Pod->>PG: autocommit=True on connection Pod->>PG: DO block - ADD CONSTRAINT fk_spans_task_id_tasks NOT VALID Pod->>PG: CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_spans_task_id ON spans(task_id) Note over Pod,PG: statement_timeout=30s still active - CIC on 20M+ rows exceeds 30s and is aborted Note over Pod,PG: SpanRepository.list(filters={task_id: id}) Pod->>PG: SELECT ... WHERE (task_id = :v OR trace_id = :v) AND remaining filters PG-->>Pod: rows (new-style + historical pre-backfill)Comments Outside Diff (3)
agentex/src/domain/repositories/span_repository.py, line 48 (link)task_idfilter returns massive unintended result setWhen a caller passes
filters={"task_id": None}, SQLAlchemy translatesSpanORM.task_id == None→task_id IS NULLandSpanORM.trace_id == None→trace_id IS NULL. The generated predicate becomesWHERE (task_id IS NULL OR trace_id IS NULL). On the 24M-row spans table where virtually all pre-backfill rows havetask_id NULL, this silently returns an enormous result set instead of the caller's expected empty/null-scoped result. Thetest_list_by_task_id_falls_back_to_trace_idtest only exercises a non-null UUID, so the None path is untested.Prompt To Fix With AI
agentex/database/migrations/alembic/env.py, line 34-38 (link)statement_timeoutis session-level and will abortCREATE INDEX CONCURRENTLYon large tablesSET statement_timeout = '30s'is applied once as a session-level GUC viaconnection.exec_driver_sql. UnlikeSET LOCAL, this persists across all subsequent statements on that connection — including those executed insideautocommit_block().CREATE INDEX CONCURRENTLYis a single statement in PostgreSQL's view and the timer runs for its entire duration; on a 20M+ row table it will reliably exceed 30 s and be cancelled.The PR description states that
autocommit_block()ops "bypass" the timeout because "no transaction = no listener trigger", but no listener is used here — the settings are applied directly at session scope — so they are not bypassed. The finalize migration is specifically designed to handle the large-table case, yet the timeout introduced in this same PR will kill it.Prompt To Fix With AI
agentex/database/migrations/alembic/env.py, line 150-152 (link)statement_timeoutabortsCREATE INDEX CONCURRENTLYon large tablesconnection.exec_driver_sql("SET statement_timeout = '30s'")sets a session-level GUC. UnlikeSET LOCAL, session-level GUCs persist across transaction boundaries — including theautocommit_block()entered by the finalize migration. Theenv.pycomment itself confirms this: "These are session-level so they persist across…autocommit_block boundaries on the same connection." The PR description's claim thatautocommit_block()"bypasses" the timeout because "no transaction = no listener trigger" is incorrect — there is no listener; the GUC is set directly on the connection.CREATE INDEX CONCURRENTLYon a 20 M+ rowspanstable will run well past 30 s, and PostgreSQL will abort it withERROR: canceling statement due to statement timeout, leaving the index in an invalid state. TheIF NOT EXISTSguard won't prevent a re-run from failing the same way.Fix: apply
statement_timeoutviaSET LOCALinside each transactional migration instead of at session level, or addSET statement_timeout = 0before enteringautocommit_block()inside the finalize migration.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (8): Last reviewed commit: "fix(migrations): commit autobegun txn be..." | Re-trigger Greptile