Skip to content

ci(migrations): add migration safety linter#225

Merged
declan-scale merged 7 commits intomainfrom
declan-scale/sgp-5785-migration-linter
May 6, 2026
Merged

ci(migrations): add migration safety linter#225
declan-scale merged 7 commits intomainfrom
declan-scale/sgp-5785-migration-linter

Conversation

@declan-scale
Copy link
Copy Markdown
Collaborator

@declan-scale declan-scale commented May 6, 2026

Summary

Adds a CI linter that fails PRs introducing the four migration anti-patterns the runtime timeouts cannot reliably catch on a quiet hour: blocking index builds, validating FK adds, in-band UPDATE/DELETE backfills, and SET/RESET overrides of the runner's default timeouts.

Closes the third sub-issue under the migration-safety umbrella, the linter-on-PR layer that complements the runtime defaults (already shipped) and the CLAUDE.md docs (companion PR).

What's caught

Anti-pattern Rule Why
op.create_index without postgresql_concurrently=True (or raw CREATE INDEX outside autocommit_block) no-concurrently Plain CREATE INDEX takes ShareLock for the entire build, blocking writes.
op.create_index(postgresql_concurrently=True) outside autocommit_block concurrently-outside-autocommit CONCURRENTLY cannot run inside a transaction; would fail at runtime.
op.create_foreign_key without postgresql_not_valid=True (or raw ADD CONSTRAINT FOREIGN KEY without NOT VALID) fk-without-not-valid Validating FK creation full-scans the table under ShareRowExclusiveLock.
op.execute("UPDATE ...") / DELETE data backfills in-band-backfill Hold row locks for the migration transaction; prevent autovacuum; should be operator-driven runbooks.
SET / RESET of lock_timeout / statement_timeout / idle_in_transaction_session_timeout forbidden-set The runner sets these defaults intentionally; quietly overriding them defeats the purpose.

Why custom AST instead of squawk

squawk operates on rendered SQL, which loses the surrounding Alembic context — autocommit_block wrapping and kwargs like postgresql_concurrently / postgresql_not_valid are exactly the signals needed to distinguish safe and unsafe shapes here. Custom AST inspection of the migration source is tighter, has zero external dependencies, runs in milliseconds, and produces messages that point at the correct fix.

The acceptance criteria in the linter ticket explicitly says "squawk or equivalent" — this delivers the equivalent rule set against the right surface.

Escape hatch

Two signals required, both deliberate:

  1. # migration-unsafe-ack: <one-line reason> directive at the top of the migration file. The directive must include a non-empty reason on the same line — bare # migration-unsafe-ack: does not suppress.
  2. migration-unsafe-ack label on the PR. When present, the workflow runs the linter in --warn-only mode so violations surface in CI logs without blocking merge.

Use the escape hatch for "this needs a maintenance window with traffic shifted away," not "I want to ship faster."

Scope

  • agentex/scripts/lint_migrations.py — the linter (Python 3.12 stdlib only; no external deps).
  • .github/workflows/migration-lint.yml — workflow triggered on PRs touching agentex/database/migrations/alembic/versions/**.
  • agentex/tests/unit/scripts/test_lint_migrations.py — 12 tests covering each rule, the autocommit_block detection, and the ack-directive escape hatch (including that bare directives don't suppress).

The CI step lints only files changed in the PR vs origin/<base>. Existing migrations are not retro-flagged. New migrations going forward must follow the rules.

Test plan

  • All 12 linter unit tests pass locally
  • Ran the linter against a synthetic bad migration with all four anti-patterns and confirmed each rule fires at the right line number
  • Confirmed the rewritten migrations from the companion code PR pass the linter
  • Confirmed the legacy migration corpus is not flagged by --base-ref mode (only NEW changes vs base ref are checked)
  • First post-merge PR touching a migration exercises the workflow end-to-end (label-based escape hatch tested in a follow-up if needed)

Greptile Summary

  • Adds a CI migration safety linter (migration_lint.py) and matching GitHub Actions workflow that blocks PRs introducing blocking index builds, validating FK adds, in-band backfills, or timeout overrides; previous-review gaps (AST-based autocommit_block span detection, per-occurrence concurrently checks, python-comment guards, raw-SQL op.execute paths, fresh-table exclusions) are all addressed in this revision with corresponding tests.
  • The escape hatch has been simplified to a single MIGRATION_UNSAFE_ACK env var (set by the workflow when the migration-unsafe-ack PR label is present) plus per-line # noqa: migration-lint; the previous file-level directive that could bypass CI without the label is gone.
  • uv.lock adds pyyaml to requires-dist (implying an undiffed pyproject.toml change) and downgrades aiohttp from 3.13.5 to 3.13.4; both appear unrelated to the linter and should be verified before merge.

Confidence Score: 5/5

Safe to merge once the uv.lock noise is confirmed or cleaned up — no logic defects remain in the linter itself.

All previously identified P1 issues are addressed in this revision. The only remaining finding is a P2 about unexplained lock file changes that do not affect linter correctness. All six rules have comprehensive tests, the AST-based autocommit_block detection is sound, and the CI workflow is correctly scoped with minimal permissions.

uv.lock — contains an unexplained pyyaml dependency addition and aiohttp downgrade that should be confirmed or reverted.

Important Files Changed

Filename Overview
agentex/scripts/ci_tools/migration_lint.py Core linter — 791-line new file implementing all six rules with AST-based autocommit_block span detection, fresh-table exclusions, and per-line noqa suppression; previous-review issues are all addressed in this revision.
.github/workflows/migration-lint.yml New CI workflow triggered on PR paths, minimal read-only permissions, label-based escape hatch via MIGRATION_UNSAFE_ACK env var, full fetch-depth for reliable merge-base diff.
agentex/tests/unit/scripts/test_migration_lint.py Comprehensive 812-line test suite covering all rules, fresh-table exclusions, AST-based autocommit_block detection (including the PEP 617 parenthesized form and column-0 multiline SQL), commented-out call guards, and CLI escape hatch.
uv.lock Adds pyyaml as a required project dependency in requires-dist (implying a pyproject.toml change not visible in the diff) and downgrades aiohttp from 3.13.5 to 3.13.4 — both changes appear unrelated to the migration linter.

Comments Outside Diff (4)

  1. agentex/scripts/lint_migrations.py, line 231-243 (link)

    P1 op.execute(text(...)) silently bypasses all SQL checks

    _string_arg only handles bare string literals (ast.Constant) and f-strings (ast.JoinedStr). When the first argument is a text(...) call — op.execute(text("UPDATE spans SET ...")) — the function returns None and every downstream check (in-band-backfill, forbidden-set, raw CREATE INDEX, raw ADD CONSTRAINT FOREIGN KEY) is silently skipped. This is the SQLAlchemy 2.0-recommended calling convention for raw SQL in Alembic, so it's the pattern most likely to appear in new migrations going forward, and is not covered by any of the 12 tests.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/scripts/lint_migrations.py
    Line: 231-243
    
    Comment:
    **`op.execute(text(...))` silently bypasses all SQL checks**
    
    `_string_arg` only handles bare string literals (`ast.Constant`) and f-strings (`ast.JoinedStr`). When the first argument is a `text(...)` call — `op.execute(text("UPDATE spans SET ..."))` — the function returns `None` and every downstream check (`in-band-backfill`, `forbidden-set`, raw `CREATE INDEX`, raw `ADD CONSTRAINT FOREIGN KEY`) is silently skipped. This is the SQLAlchemy 2.0-recommended calling convention for raw SQL in Alembic, so it's the pattern most likely to appear in new migrations going forward, and is not covered by any of the 12 tests.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

  2. agentex/scripts/lint_migrations.py, line 200-209 (link)

    P2 File directive alone suppresses violations; PR label is not required

    The docstring (line 22–24) and the PR description both state "Two signals required, both deliberate: (1) file directive AND (2) PR label." In practice, when has_ack is True, _maybe_violation returns None for every check, check_file returns an empty list, and the process exits 0 — entirely independently of whether --warn-only was passed. A developer can merge a file containing # migration-unsafe-ack: reason without the PR ever receiving the label, bypassing the intended two-person visibility mechanism.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/scripts/lint_migrations.py
    Line: 200-209
    
    Comment:
    **File directive alone suppresses violations; PR label is not required**
    
    The docstring (line 22–24) and the PR description both state "Two signals required, both deliberate: (1) file directive AND (2) PR label." In practice, when `has_ack` is `True`, `_maybe_violation` returns `None` for every check, `check_file` returns an empty list, and the process exits 0 — entirely independently of whether `--warn-only` was passed. A developer can merge a file containing `# migration-unsafe-ack: reason` without the PR ever receiving the label, bypassing the intended two-person visibility mechanism.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

  3. agentex/tests/unit/scripts/test_lint_migrations.py, line 25-34 (link)

    P2 os.chdir in _check_relative is not thread-safe under parallel pytest

    os.chdir changes the process-wide working directory. If tests run with pytest-xdist or any -n parallelism, concurrent invocations of _check_relative will race on the CWD, causing one test to open a relative path while another test has already changed the directory, producing spurious FileNotFoundError failures. Prefer constructing an absolute path via path.resolve() before calling check_file, or passing the absolute path directly, to eliminate the CWD dependency.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/tests/unit/scripts/test_lint_migrations.py
    Line: 25-34
    
    Comment:
    **`os.chdir` in `_check_relative` is not thread-safe under parallel pytest**
    
    `os.chdir` changes the process-wide working directory. If tests run with `pytest-xdist` or any `-n` parallelism, concurrent invocations of `_check_relative` will race on the CWD, causing one test to open a relative path while another test has already changed the directory, producing spurious `FileNotFoundError` failures. Prefer constructing an absolute path via `path.resolve()` before calling `check_file`, or passing the absolute path directly, to eliminate the CWD dependency.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

  4. agentex/scripts/ci_tools/migration_lint.py, line 284-301 (link)

    P1 _check_transaction_nesting silently clears all findings when autocommit_block appears anywhere in the file

    The guard at line 288 short-circuits with no findings as soon as autocommit_block( appears anywhere in the source. A migration with two concurrent index creates — one correctly wrapped in autocommit_block and one accidentally outside — passes the linter entirely. At runtime the unwrapped index fails with ActiveSQLError: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. The rule needs to track which specific postgresql_concurrently=True occurrences fall inside the block, not just whether the block exists somewhere in the file.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/scripts/ci_tools/migration_lint.py
    Line: 284-301
    
    Comment:
    **`_check_transaction_nesting` silently clears all findings when `autocommit_block` appears anywhere in the file**
    
    The guard at line 288 short-circuits with no findings as soon as `autocommit_block(` appears *anywhere* in the source. A migration with two concurrent index creates — one correctly wrapped in `autocommit_block` and one accidentally outside — passes the linter entirely. At runtime the unwrapped index fails with `ActiveSQLError: CREATE INDEX CONCURRENTLY cannot run inside a transaction block`. The rule needs to track which specific `postgresql_concurrently=True` occurrences fall inside the block, not just whether the block exists somewhere in the file.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
uv.lock:138
**Unrelated `pyyaml` dependency and `aiohttp` downgrade in lock file**

`pyyaml` appears in `requires-dist` (the section derived from `pyproject.toml`), which means `pyproject.toml` was edited on this branch to add the dependency — but that file is not in the diff. The linter is documented as stdlib-only with no external deps, so `pyyaml` is not needed for this PR. Separately, `aiohttp` is downgraded from `3.13.5` to `3.13.4` with no stated motivation. Both changes appear to be stale branch noise; please confirm they are intentional or rebase/regenerate the lock file against main.

Reviews (9): Last reviewed commit: "fix(migrations): comment-guard every ful..." | Re-trigger Greptile

@declan-scale declan-scale requested a review from a team as a code owner May 6, 2026 16:10
@declan-scale declan-scale force-pushed the declan-scale/sgp-5785-migration-linter branch from 083c735 to 7dd2b66 Compare May 6, 2026 16:23
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 6, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​aiohttp@​3.13.5 ⏵ 3.13.497100100100100

View full report

declan-scale added a commit that referenced this pull request May 6, 2026
Address greptile review on PR #225:

- transaction-nesting now scopes per call site (indent-based span detection)
  rather than short-circuiting on a file-level autocommit_block presence
  check. A migration that wraps one CREATE INDEX CONCURRENTLY in
  autocommit_block() but leaves a sibling call outside no longer slips
  past the linter.
- in-band-backfill rule restored: flags op.execute() calls whose SQL
  contains UPDATE / DELETE FROM. The PR description claimed this rule
  existed but the implementation didn't; now it does, and it's added to
  _RULES.
- Drop unused ENV_PY constant.

Adds three unit tests: mixed concurrent-index scenario, UPDATE backfill,
and DELETE FROM backfill. SELECT remains unflagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread agentex/scripts/ci_tools/migration_lint.py
declan-scale added a commit that referenced this pull request May 6, 2026
Address greptile review on PR #225:

- transaction-nesting now scopes per call site (indent-based span detection)
  rather than short-circuiting on a file-level autocommit_block presence
  check. A migration that wraps one CREATE INDEX CONCURRENTLY in
  autocommit_block() but leaves a sibling call outside no longer slips
  past the linter.
- in-band-backfill rule restored: flags op.execute() calls whose SQL
  contains UPDATE / DELETE FROM. The PR description claimed this rule
  existed but the implementation didn't; now it does, and it's added to
  _RULES.
- Drop unused ENV_PY constant.

Adds three unit tests: mixed concurrent-index scenario, UPDATE backfill,
and DELETE FROM backfill. SELECT remains unflagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/sgp-5785-migration-linter branch from 982ee9f to 6b2b138 Compare May 6, 2026 19:19
Comment thread agentex/scripts/ci_tools/migration_lint.py
Comment thread agentex/scripts/ci_tools/migration_lint.py
declan-scale and others added 4 commits May 6, 2026 15:42
Adds a custom Python AST-based linter that fails CI on PRs introducing
the four anti-patterns the migration runtime defaults cannot reliably
catch on a quiet hour:

* op.create_index (and raw CREATE INDEX) without CONCURRENTLY,
  or with CONCURRENTLY but outside an autocommit_block().
* op.create_foreign_key (and raw ADD CONSTRAINT FOREIGN KEY)
  without NOT VALID.
* op.execute("UPDATE ...") / DELETE in-band data backfills.
* SET / RESET of lock_timeout, statement_timeout, or
  idle_in_transaction_session_timeout inside a migration file.

Each rule maps to a documented anti-pattern in CLAUDE.md.

Why custom AST instead of squawk: squawk operates on rendered SQL,
which loses the surrounding Alembic context (autocommit_block wrapping,
kwarg options on op.create_index/op.create_foreign_key) — exactly the
signals needed to distinguish safe and unsafe shapes here. AST
inspection of the migration source is tighter, has zero external deps,
runs in milliseconds, and lets each rule produce a message that points
at the correct fix.

Escape hatch: a `# migration-unsafe-ack: <one-line reason>` directive
at the top of the migration file suppresses all rules for that file.
The GitHub Actions workflow additionally checks for a
`migration-unsafe-ack` PR label; when present it runs the linter in
--warn-only mode so violations surface in logs without blocking merge.
Both signals are required to deploy a migration that intentionally
overrides the runner defaults.

The CI step lints only files changed in the PR (via
`--base-ref origin/<base>`), so existing migrations are not
retro-flagged.

Includes 12 unit tests covering each rule, the ack-directive escape
hatch (including that bare directives without a reason do not
suppress), and the autocommit_block detection.
Address greptile review on PR #225:

- transaction-nesting now scopes per call site (indent-based span detection)
  rather than short-circuiting on a file-level autocommit_block presence
  check. A migration that wraps one CREATE INDEX CONCURRENTLY in
  autocommit_block() but leaves a sibling call outside no longer slips
  past the linter.
- in-band-backfill rule restored: flags op.execute() calls whose SQL
  contains UPDATE / DELETE FROM. The PR description claimed this rule
  existed but the implementation didn't; now it does, and it's added to
  _RULES.
- Drop unused ENV_PY constant.

Adds three unit tests: mixed concurrent-index scenario, UPDATE backfill,
and DELETE FROM backfill. SELECT remains unflagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- prefer-robust-stmts now also inspects raw SQL inside op.execute(...):
  flags op.execute("CREATE INDEX ...") without CONCURRENTLY and
  op.execute("ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY ...") without
  NOT VALID. Closes the documented escape hatch where copying SQL from a
  DBA runbook into op.execute bypassed every helper-level check (P1).
- in-band-backfill no longer false-positives on
  INSERT ... ON CONFLICT DO UPDATE SET upserts (legitimate schema-init
  shape). Tightened _DATA_BACKFILL with a (?!SET\b) negative lookahead (P2).
- no-timeout-overrides skips matches that fall after a # on the same line
  so a Python comment quoting an old SET command is no longer flagged (P2).
- Module docstring's Rules section now lists in-band-backfill alongside
  the other five (developer searching the docstring for a CI failure
  rule will find it).

Adds five unit tests: raw CREATE INDEX in op.execute, raw CREATE INDEX
CONCURRENTLY passes, raw ADD CONSTRAINT FK in op.execute, raw FK with
NOT VALID passes, upsert clause not flagged, and SET in Python comment
not flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address two P1 Greptile findings on PR #225 and bring the agentex linter
into parity with the egp-api-backend version (scaleapi#142283):

- Issue 1: `op.execute("CREATE INDEX CONCURRENTLY ...")` outside an
  autocommit_block was silently accepted (the raw-SQL prefer-robust-stmts
  check excludes CONCURRENTLY because it's the safe form). Add a second
  pass in `_check_transaction_nesting` that catches this shape so it
  fails lint instead of failing at runtime with "CREATE INDEX
  CONCURRENTLY cannot run inside a transaction block".

- Issue 2: indentation-based `_autocommit_spans` truncated the block on
  any non-blank line at column 0, so a triple-quoted SQL string starting
  at the margin would push trailing `postgresql_concurrently=True` out
  of the span and produce a false `transaction-nesting` finding on
  correctly-written migrations. Switch to AST-based span detection,
  which also handles the PEP 617 parenthesized `with` form for free.

Parity with sgp linter:
- `SET statement_timeout` is now exempted inside autocommit_block (the
  documented escape valve for long CIC builds), with `RESET` still
  flagged because RESET reverts to the role/server default rather than
  restoring the runner's 30s ceiling.
- `_changed_migrations` resolves paths via `git rev-parse --show-toplevel`
  instead of `REPO_ROOT.parent` so it keeps working if the package is
  ever moved within the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/sgp-5785-migration-linter branch from 6b2b138 to 3308632 Compare May 6, 2026 19:45
Comment thread agentex/scripts/ci_tools/migration_lint.py
Comment thread agentex/scripts/ci_tools/migration_lint.py
declan-scale and others added 3 commits May 6, 2026 16:58
- Raw-SQL `op.execute("CREATE INDEX ...")` /
  `op.execute("ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY ...")`
  on a freshly-created table now skips `prefer-robust-stmts`, matching
  the helper-call paths. Pattern shows up when DBA-runbook SQL gets
  pasted directly into a migration that also creates the target table.

- `_check_in_band_backfill` and the raw-SQL blocks in
  `_check_prefer_robust_stmts` and `_check_transaction_nesting` now
  call `_is_in_python_comment` before flagging, the same guard the
  `no-timeout-overrides` rule already uses. A commented-out
  `# op.execute("UPDATE ...")` no longer produces a finding on a line
  that isn't executed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing-required-field

Two more false-positive gaps in the migration linter:

- Kwarg detection used literal substring matches
  (``"postgresql_concurrently=True" in call``) that silently failed when
  the migration spaced the kwarg
  (``postgresql_concurrently = True``). The lint then flagged a valid
  CIC inside an autocommit_block. Replace with module-level regexes
  (``_KWARG_CONCURRENTLY_TRUE``, ``_KWARG_NOT_VALID_TRUE``,
  ``_KWARG_NULLABLE_FALSE``, ``_KWARG_SERVER_DEFAULT``) that allow
  surrounding whitespace, mirroring the existing
  ``_check_transaction_nesting`` regex (which is now refactored to use
  the same constant).

- ``_check_adding_required_field`` was the only structural rule that
  did not skip operations on freshly-created tables, so an
  ``op.create_table("foo", ...)`` immediately followed by
  ``op.add_column("foo", sa.Column(..., nullable=False,
  server_default=...))`` would be flagged on a safe migration. Mirror
  the ``_tables_created`` exclusion already used by every other rule;
  for ``op.add_column`` the *first* quoted name is the target table
  (the second is the column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ackfills

Reviewer caught a missing `_is_in_python_comment` guard on the
`_KWARG_CONCURRENTLY_TRUE` sweep in `_check_transaction_nesting`. An
audit of every other full-source regex scan turned up the same gap on
five more spots:

- `_tables_created` (`_OP_CREATE_TABLE.finditer`) — false NEGATIVE risk:
  a commented-out `# op.create_table("foo", ...)` would have polluted
  fresh_tables and silently masked findings on the real `foo` table.
- `_check_prefer_robust_stmts` (`_OP_CREATE_INDEX`, `_OP_CREATE_FK`),
  `_check_disallowed_unique_constraint` (`_OP_CREATE_UNIQUE`),
  `_check_adding_required_field` (`_OP_ADD_COLUMN`),
  `_check_transaction_nesting` (`_KWARG_CONCURRENTLY_TRUE`) — all false
  POSITIVES on commented-out call sites.

OP_EXECUTE-based loops and the SET/RESET timeout loops already had the
guard.

Also adds the fresh-tables exclusion to `_check_in_band_backfill` for
parity with every other rule. New `_RAW_BACKFILL_TARGET` regex extracts
the target from `UPDATE tbl …` / `DELETE FROM tbl …` so seed-data and
staging-cleanup patterns on tables created in the same migration no
longer require a `# noqa`.

Tests: extended the existing commented-out test to cover all six
shapes plus the kwarg sweep, and added an in-band-backfill-on-fresh-
table positive case. 44 -> 46.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale merged commit d9d9543 into main May 6, 2026
32 checks passed
@declan-scale declan-scale deleted the declan-scale/sgp-5785-migration-linter branch May 6, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants