Skip to content

Add hypothesmith-based fuzz testing for Python front-end#984

Open
tautschnig wants to merge 19 commits into
main2from
tautschnig/hypothesmith
Open

Add hypothesmith-based fuzz testing for Python front-end#984
tautschnig wants to merge 19 commits into
main2from
tautschnig/hypothesmith

Conversation

@tautschnig
Copy link
Copy Markdown
Contributor

Adds fuzz testing infrastructure for Strata's Python front-end, inspired by CBMC's use of CSmith and the hypothesmith project for random Python program generation.

Test modes

Syntax mode — Uses hypothesmith to generate syntactically valid Python from the grammar, then runs each program through the parse → Ion → round-trip pipeline. Tests that the parser doesn't crash or lose information on arbitrary valid Python.

Semantic mode — Uses a custom generator to produce typed Python programs with assertions whose expected values are computed by CPython at generation time (the CSmith checksum analogy). Each program is validated with CPython first, then verified with pyAnalyzeLaurel. A verification failure on a CPython-passing program indicates a semantic modelling bug in the Python → Laurel → Core pipeline.

--unrestricted option

By default, only constructs known to work in Strata are generated. With --unrestricted, ~60 additional generators are included covering classes, generators, comprehensions, try/except, match, async, decorators, walrus operator, *args/**kwargs, and more. Programs hitting unsupported constructs are reported as SKIP, not failures.

CI integration

A new workflow (.github/workflows/python-fuzz.yml) runs on every PR:

Pick random seed
        │
        ▼
Run 10 fuzz tests (both modes, --unrestricted)
        │
    ┌───┴───┐
  PASS    FAIL
    │       │
    ▼       ▼
  Done   Checkout merge base, rebuild, re-run same seed
            │
        ┌───┴───────┐
    Base FAILS    Base PASSES
        │              │
        ▼              ▼
  Create issue    CI FAILS
  (pre-existing   (regression
   bug, labeled    from this PR)
   'fuzz-bug')
  • Pre-existing bugs create a GitHub issue but don't block the PR
  • Regressions fail CI with reproduction instructions
  • The seed is always reported for deterministic reproduction

Local usage

# Set up (one-time)
cd Tools/Python
python3 -m venv .venv
.venv/bin/pip install -e . hypothesmith

# Run 10 tests with both modes
./scripts/hypothesmith.sh 10

# Reproduce a specific failure
./scripts/hypothesmith.sh 10 1745165432 both --unrestricted

# Only semantic mode, restricted to known-working constructs
./scripts/hypothesmith.sh 20 42 semantic

Files

File Purpose
Tools/Python/scripts/hypothesmith.sh Main driver script (analogous to CBMC's csmith.sh)
Tools/Python/scripts/gen_random_python.py Program generator with syntax and semantic modes
Tools/Python/scripts/gen_unrestricted.py ~60 extended generators for --unrestricted mode
.github/workflows/python-fuzz.yml CI workflow for PR-triggered fuzz testing

Testing

Tested locally with multiple seeds and program counts. All generated programs pass CPython validation (500+ tested). The restricted semantic mode produces 100% pass rate through pyAnalyzeLaurel. The unrestricted mode correctly identifies unsupported constructs as SKIP.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

tautschnig and others added 4 commits April 20, 2026 19:57
Two test modes:

1. Syntax mode: Uses hypothesmith to generate syntactically valid Python
   programs, then runs them through the parse -> Ion -> round-trip pipeline.
   Tests that the parser doesn't crash on arbitrary valid Python.

2. Semantic mode: Custom generator produces typed programs with int
   arithmetic, comparisons, boolean logic, string concatenation, if/else,
   floor division, modulo, and negation — all with assertions whose
   expected values are computed by CPython. Programs are first validated
   with CPython, then run through pyAnalyzeLaurel. A verification failure
   on a CPython-passing program indicates a semantic modelling bug.

Size control:
- --max-nodes filters syntax-mode programs by AST node count
- --max-examples controls how many programs to generate
- --seed enables reproducible runs

Usage: cd Tools/Python && ./scripts/hypothesmith.sh [N [SEED [MODE]]]

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Adds gen_unrestricted.py with ~60 generators covering Python constructs
well beyond what Strata supports today:

- Bitwise ops (&, |, ^, ~, <<, >>), power (**)
- Augmented assignment (+=, -=, *=)
- List ops (index, slice, append, concat, in, comprehensions)
- Dict ops (create, access, in, comprehensions)
- String methods (upper, lower, strip, len, index, multiply, format)
- For loops (over lists, range), while loops, break, nested loops
- Nested if, elif chains, ternary expressions
- Tuple unpacking, star unpacking, chained comparisons
- None checks, Optional types
- Classes (fields, methods), nested function calls
- Lambda, list/dict comprehensions
- Try/except/else/finally, multiple except clauses, raise
- With statement (context managers)
- Nonlocal, walrus operator (:=), f-strings
- Type conversions (int, str, bool, float, list)
- isinstance, recursion, default args, *args, **kwargs
- Generators (yield), decorators, async/await
- Match statement (Python 3.10+)

Usage: ./scripts/hypothesmith.sh 20 42 semantic --unrestricted

Programs that hit unsupported Strata constructs are reported as SKIP.
Only true internal errors (panics) or semantic mismatches (Strata
rejects a CPython-passing assertion) count as failures.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Runs on every PR (not merge queue or push to main). For each run:

1. Picks a fresh random seed (reported in step name and logs)
2. Generates 10 programs in both syntax and semantic modes with
   --unrestricted (covering constructs beyond current Strata support)
3. On failure, checks out the merge base and re-runs with the same seed:
   - If merge base also fails: creates a GitHub issue labeled 'fuzz-bug'
     (pre-existing bug, PR is not blocked)
   - If merge base passes: CI fails with reproduction instructions
     (the PR introduced a regression)
   - If merge base can't build: CI fails as inconclusive

The seed is always reported so failures can be reproduced locally:
  cd Tools/Python
  ./scripts/hypothesmith.sh 10 <SEED> both --unrestricted

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
All four files now include thorough documentation:

- python-fuzz.yml: Header block explaining the trigger policy, test modes,
  failure analysis flow (with ASCII diagram), reproducibility, and file
  inventory. Each step has inline comments explaining its purpose.

- hypothesmith.sh: Full header documenting background/motivation, both
  test modes, result classification (OK/SKIP/TIMEOUT/FAIL with subtypes),
  reproducibility, usage, prerequisites, and exit codes. Inline comments
  on each pipeline step.

- gen_random_python.py: Module docstring covering both modes, generator
  categories (base vs unrestricted), reproducibility mechanism, usage
  examples, and all CLI options. Section headers and docstrings on all
  functions.

- gen_unrestricted.py: Module docstring with categorised inventory of all
  ~60 generators, instructions for adding new generators, and section
  headers grouping generators by category.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@tautschnig tautschnig requested review from a team and Copilot April 20, 2026 19:59
@tautschnig tautschnig self-assigned this Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds hypothesmith-based fuzzing infrastructure to exercise Strata’s Python front-end (parser round-trips + semantic verification) and wires it into PR CI to catch regressions and file issues for pre-existing bugs.

Changes:

  • Add a bash driver to generate random Python programs and run syntax/semantic fuzz pipelines.
  • Add Python generators for both restricted and --unrestricted fuzz modes.
  • Add a GitHub Actions workflow to run fuzzing on PRs and re-run on merge base for regression classification.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
Tools/Python/scripts/hypothesmith.sh Driver script for syntax + semantic fuzz runs and result classification.
Tools/Python/scripts/gen_random_python.py Core generator for syntax-mode (hypothesmith) and semantic-mode (typed asserts).
Tools/Python/scripts/gen_unrestricted.py Extended generator blocks for unsupported/edge Python constructs under --unrestricted.
.github/workflows/python-fuzz.yml PR CI job to run fuzzing, analyze failures vs merge base, and optionally file issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Tools/Python/scripts/hypothesmith.sh
Comment thread Tools/Python/scripts/hypothesmith.sh Outdated
Comment thread Tools/Python/scripts/gen_unrestricted.py
Comment thread Tools/Python/scripts/gen_unrestricted.py Outdated
Comment thread .github/workflows/python-fuzz.yml
tautschnig and others added 2 commits April 22, 2026 14:34
- Parse RESULT: line for non-zero pyAnalyzeLaurel exits to correctly
  classify verification failures and internal errors instead of
  blanket-skipping them
- Use $PYTHON instead of python3 for CPython validation so
  generation/validation/parsing all use the same interpreter
- Fix gen_fstring() to declare a variable and use single-brace
  interpolation (triple braces produced literal braces, mismatching
  the expected string)
- Guard issue creation on fork PRs where GITHUB_TOKEN is read-only,
  and add continue-on-error as safety net
- Add clarifying comment on gen_dict_create() f-string evaluation

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@github-actions github-actions Bot added Waiting-For-Review Python github_actions Pull requests that update GitHub Actions code labels Apr 22, 2026
lean-action runs lake test by default when lake check-test succeeds.
The fuzz workflow only needs the strata executable, not the full test
suite. Running lake test fails because z3 is not installed in the fuzz
runner (T19_InvokeOn requires z3).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@robin-aws
Copy link
Copy Markdown
Contributor

robin-aws commented Apr 22, 2026

High level feedback (without having looked at the details) - can you ensure these tests are run somehow as part of lake test, rather than having to add this as yet another custom GH workflow step that developers will not think of running?

I see at least two options:

  1. Include a *.lean file that runs the shell script in an #eval
  2. Avoid the shell script and implement the test harness in Lean directly.

@shigoel shigoel enabled auto-merge April 22, 2026 15:36
Copy link
Copy Markdown
Contributor

@shigoel shigoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One inline comment: semantic mode parse failure classification.

Comment thread Tools/Python/scripts/hypothesmith.sh Outdated
tautschnig and others added 3 commits April 23, 2026 10:20
Parse failures in semantic mode are now SKIP (parse), matching syntax
mode behavior and the documented result classification. The generator
may produce constructs the Strata parser doesn't support yet, which
are expected limitations, not regressions.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
- hypothesmith.sh now respects PYTHON env var (falls back to .venv)
- New StrataTest/Fuzz/HypotheSmith.lean runs 3 deterministic fuzz
  tests via withPython when Python + hypothesmith are available,
  skips gracefully otherwise
- Uses the established findPython3/withPython/pythonCheckModule
  pattern from StrataTest/Util/Python.lean
- Full random-seed fuzzing remains in the CI workflow

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@tautschnig
Copy link
Copy Markdown
Contributor Author

High level feedback (without having looked at the details) - can you ensure these tests are run somehow as part of lake test, rather than having to add this as yet another custom GH workflow step that developers will not think of running?

I see at least two options:

1. Include a `*.lean` file that runs the shell script in an `#eval`

2. Avoid the shell script and implement the test harness in Lean directly.

Addressed in 7f9aaa5 closer to option 1 - doing everything in Lean would mean re-implementing hypothesmith in Lean (for all that I can tell).

@tautschnig tautschnig assigned shigoel and robin-aws and unassigned tautschnig Apr 23, 2026
shigoel
shigoel previously approved these changes Apr 24, 2026
@tautschnig tautschnig self-assigned this May 4, 2026
tautschnig and others added 2 commits May 4, 2026 11:15
- Remove warnOnSkip parameter from withPython call (parameter removed on main)
- Fix hypothesmith.sh: use 'command -v' instead of '[ -x ]' to check
  PYTHON, so bare command names (e.g. 'python3') work alongside full paths

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@olivier-aws olivier-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments mostly because I'm unsure how the Python code generation works from the list of generators.

Comment thread .github/workflows/python-fuzz.yml
Comment thread Tools/Python/scripts/gen_random_python.py Outdated
tautschnig and others added 3 commits May 4, 2026 19:08
The HypotheSmith.lean test used withPython which throws when strata.gen
is not installed. The 'Build and test Lean' CI job doesn't install Python
Strata libraries, and --exclude Languages.Python doesn't cover
StrataTest.Fuzz.*. Replace withPython with try/catch on findPython3 and
explicit pythonCheckModule guards so the test skips with a warning.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
When filing GitHub issues for pre-existing fuzz bugs, extract the first
failing program's source code from the fuzz log and include it in the
issue body. This lets agents group issues by root cause without having
to re-generate the code.

Addresses feedback from olivier-aws in PR #984 thread 6.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extend _gen_if_else() to randomly produce compound conditions using
'and', 'or', 'not', and None identity checks ('None is None',
'None is not None') in addition to simple boolean literals. This
exercises more of the Python front-end's condition handling.

Addresses feedback from olivier-aws in PR #984 thread 7.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Comment thread Tools/Python/scripts/hypothesmith.sh
Comment thread Tools/Python/scripts/gen_random_python.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of complexity in a GH actions workflow, and that has a lot of downsides since you can't run the logic locally (there are tools that let you execute actions locally in a container but I've never had much success with them).

I like the automatic cutting of issues. I feel that the sophistication around determining whether it's new or a regression is unnecessary though. If anything it seems unfair for someone to cut a PR and then for new random tests to appear and fail the CI, when it passed for them before cutting the PR. :) I'd vote for always cutting an issue and not blocking the PR. And given that, perhaps it would be much simpler to make this a scheduled workflow that runs once a day instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's pretty easy to run locally, most of the complexity is just to avoid unduly blocking a PR. Running locally can by done via ./scripts/hypothesmith.sh. Most of the remaining logic is to determine the cause of a failure: if the same test would have passed before the changes in the PR, then the PR broke something, and it is upon the PR author to fix this. If the test equally failed before the changes in the PR, an issue is filed as surely the PR should not be blocked.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.

tautschnig and others added 2 commits May 5, 2026 11:15
'N failed' in the DETAIL line means the solver proved the assertion
does not hold (found a counterexample), not merely 'failed to prove'.
Inconclusive results are handled separately. Also note pyInterpret
(#869) as a future improvement for concrete comparison.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@olivier-aws
Copy link
Copy Markdown
Contributor

LGTM. This PR will provide nice automation for bugs in our Python support.

olivier-aws
olivier-aws previously approved these changes May 5, 2026
Copy link
Copy Markdown
Contributor

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not review the actual generation logic super thoroughly, partially because it is only generating test input and there are no major consequences if it is wrong. It would be good to have a bit more meta-testing to have higher confidence it's at least mostly right.

Comment on lines +111 to +113
# Generate a seed from the current nanosecond timestamp. This is
# reported in the step name and all log output so that any failure
# can be reproduced deterministically.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation, not sure there's anything we can do about it: in many cases just pushing an empty commit to re-run the CI with a different seed could easily result in a green run. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I acknowledge that, and it's down to developer discipline to (not) do that. This is fuzz testing, our real defense is proof. My goal is to establish infrastructure that helps us build a better system. Once we have it, we could have a farm of hosts run those tests all day long.

#
# ## Failure analysis
#
# When fuzz tests fail on the PR, the workflow performs a bisection-style
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bisection-style" makes it sound much more expensive than it is, since it's just re-running on a single other commit. :)

name: Python Fuzz

on:
pull_request:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that it's not running on merge_group, since at that point we don't know which PR caused the regression if it failed. But perhaps add a comment that this is intentional and not an omission?

Comment thread .github/workflows/python-fuzz.yml
chmod +x cvc5-Linux-${ARCH_NAME}-static/bin/cvc5
echo "$GITHUB_WORKSPACE/cvc5-Linux-${ARCH_NAME}-static/bin/" >> $GITHUB_PATH

# Try to restore a cached Lean build to avoid rebuilding from scratch.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sound in general, but okay since this is only a best-effort random testing attempt. Might be worth a comment though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.

run: |
echo "::error::Fuzz test regression detected (seed=${{ steps.seed.outputs.value }}). The merge base passes but this PR fails."
echo ""
echo "Reproduce locally:"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also provide instructions to explicitly add the failing tests to the actual test suite somewhere, once the PR addresses them?

Comment on lines +178 to +179
# Parse failures are acceptable — hypothesmith may generate constructs
# that the Strata parser doesn't support yet.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never failing on parse errors is a bit unfortunate, since we could potentially find cases we think are supported but don't work. Any way to be more precise?

Comment on lines +240 to +241
# that computes the reference checksum). If CPython fails, the bug
# is in our generator, not in Strata.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to cut an issue when this happens? We're definitely never going to notice otherwise.

Comment on lines +320 to +321
# TODO: Once pyInterpret (#869) is available, use it for concrete
# pass/fail comparison with CPython (no unknown middle ground).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now! :) But this can be added later.

@tautschnig tautschnig changed the base branch from main to main2 May 15, 2026 21:05
@tautschnig tautschnig dismissed olivier-aws’s stale review May 15, 2026 21:05

The base branch was changed.

@tautschnig tautschnig assigned tautschnig and unassigned shigoel and robin-aws May 15, 2026
tautschnig added a commit that referenced this pull request May 15, 2026
Replace stale python-fuzz.yml references with anchors to #984 (which
introduces python-fuzz.yml) and tighten the restore-lake-cache key-prefix
input description so it no longer suggests the action can serve sub-
projects with their own toolchain/manifest (the action hardcodes
hashFiles('lean-toolchain') and hashFiles('lake-manifest.json') from the
repo root).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github_actions Pull requests that update GitHub Actions code Has 1 approval Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants