Add hypothesmith-based fuzz testing for Python front-end#984
Add hypothesmith-based fuzz testing for Python front-end#984tautschnig wants to merge 19 commits into
Conversation
Two test modes: 1. Syntax mode: Uses hypothesmith to generate syntactically valid Python programs, then runs them through the parse -> Ion -> round-trip pipeline. Tests that the parser doesn't crash on arbitrary valid Python. 2. Semantic mode: Custom generator produces typed programs with int arithmetic, comparisons, boolean logic, string concatenation, if/else, floor division, modulo, and negation — all with assertions whose expected values are computed by CPython. Programs are first validated with CPython, then run through pyAnalyzeLaurel. A verification failure on a CPython-passing program indicates a semantic modelling bug. Size control: - --max-nodes filters syntax-mode programs by AST node count - --max-examples controls how many programs to generate - --seed enables reproducible runs Usage: cd Tools/Python && ./scripts/hypothesmith.sh [N [SEED [MODE]]] Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Adds gen_unrestricted.py with ~60 generators covering Python constructs well beyond what Strata supports today: - Bitwise ops (&, |, ^, ~, <<, >>), power (**) - Augmented assignment (+=, -=, *=) - List ops (index, slice, append, concat, in, comprehensions) - Dict ops (create, access, in, comprehensions) - String methods (upper, lower, strip, len, index, multiply, format) - For loops (over lists, range), while loops, break, nested loops - Nested if, elif chains, ternary expressions - Tuple unpacking, star unpacking, chained comparisons - None checks, Optional types - Classes (fields, methods), nested function calls - Lambda, list/dict comprehensions - Try/except/else/finally, multiple except clauses, raise - With statement (context managers) - Nonlocal, walrus operator (:=), f-strings - Type conversions (int, str, bool, float, list) - isinstance, recursion, default args, *args, **kwargs - Generators (yield), decorators, async/await - Match statement (Python 3.10+) Usage: ./scripts/hypothesmith.sh 20 42 semantic --unrestricted Programs that hit unsupported Strata constructs are reported as SKIP. Only true internal errors (panics) or semantic mismatches (Strata rejects a CPython-passing assertion) count as failures. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Runs on every PR (not merge queue or push to main). For each run:
1. Picks a fresh random seed (reported in step name and logs)
2. Generates 10 programs in both syntax and semantic modes with
--unrestricted (covering constructs beyond current Strata support)
3. On failure, checks out the merge base and re-runs with the same seed:
- If merge base also fails: creates a GitHub issue labeled 'fuzz-bug'
(pre-existing bug, PR is not blocked)
- If merge base passes: CI fails with reproduction instructions
(the PR introduced a regression)
- If merge base can't build: CI fails as inconclusive
The seed is always reported so failures can be reproduced locally:
cd Tools/Python
./scripts/hypothesmith.sh 10 <SEED> both --unrestricted
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
All four files now include thorough documentation: - python-fuzz.yml: Header block explaining the trigger policy, test modes, failure analysis flow (with ASCII diagram), reproducibility, and file inventory. Each step has inline comments explaining its purpose. - hypothesmith.sh: Full header documenting background/motivation, both test modes, result classification (OK/SKIP/TIMEOUT/FAIL with subtypes), reproducibility, usage, prerequisites, and exit codes. Inline comments on each pipeline step. - gen_random_python.py: Module docstring covering both modes, generator categories (base vs unrestricted), reproducibility mechanism, usage examples, and all CLI options. Section headers and docstrings on all functions. - gen_unrestricted.py: Module docstring with categorised inventory of all ~60 generators, instructions for adding new generators, and section headers grouping generators by category. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds hypothesmith-based fuzzing infrastructure to exercise Strata’s Python front-end (parser round-trips + semantic verification) and wires it into PR CI to catch regressions and file issues for pre-existing bugs.
Changes:
- Add a bash driver to generate random Python programs and run syntax/semantic fuzz pipelines.
- Add Python generators for both restricted and
--unrestrictedfuzz modes. - Add a GitHub Actions workflow to run fuzzing on PRs and re-run on merge base for regression classification.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
Tools/Python/scripts/hypothesmith.sh |
Driver script for syntax + semantic fuzz runs and result classification. |
Tools/Python/scripts/gen_random_python.py |
Core generator for syntax-mode (hypothesmith) and semantic-mode (typed asserts). |
Tools/Python/scripts/gen_unrestricted.py |
Extended generator blocks for unsupported/edge Python constructs under --unrestricted. |
.github/workflows/python-fuzz.yml |
PR CI job to run fuzzing, analyze failures vs merge base, and optionally file issues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Parse RESULT: line for non-zero pyAnalyzeLaurel exits to correctly classify verification failures and internal errors instead of blanket-skipping them - Use $PYTHON instead of python3 for CPython validation so generation/validation/parsing all use the same interpreter - Fix gen_fstring() to declare a variable and use single-brace interpolation (triple braces produced literal braces, mismatching the expected string) - Guard issue creation on fork PRs where GITHUB_TOKEN is read-only, and add continue-on-error as safety net - Add clarifying comment on gen_dict_create() f-string evaluation Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
lean-action runs lake test by default when lake check-test succeeds. The fuzz workflow only needs the strata executable, not the full test suite. Running lake test fails because z3 is not installed in the fuzz runner (T19_InvokeOn requires z3). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
|
High level feedback (without having looked at the details) - can you ensure these tests are run somehow as part of I see at least two options:
|
shigoel
left a comment
There was a problem hiding this comment.
One inline comment: semantic mode parse failure classification.
Parse failures in semantic mode are now SKIP (parse), matching syntax mode behavior and the documented result classification. The generator may produce constructs the Strata parser doesn't support yet, which are expected limitations, not regressions. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
- hypothesmith.sh now respects PYTHON env var (falls back to .venv) - New StrataTest/Fuzz/HypotheSmith.lean runs 3 deterministic fuzz tests via withPython when Python + hypothesmith are available, skips gracefully otherwise - Uses the established findPython3/withPython/pythonCheckModule pattern from StrataTest/Util/Python.lean - Full random-seed fuzzing remains in the CI workflow Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Addressed in 7f9aaa5 closer to option 1 - doing everything in Lean would mean re-implementing hypothesmith in Lean (for all that I can tell). |
- Remove warnOnSkip parameter from withPython call (parameter removed on main) - Fix hypothesmith.sh: use 'command -v' instead of '[ -x ]' to check PYTHON, so bare command names (e.g. 'python3') work alongside full paths Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
olivier-aws
left a comment
There was a problem hiding this comment.
Two comments mostly because I'm unsure how the Python code generation works from the list of generators.
The HypotheSmith.lean test used withPython which throws when strata.gen is not installed. The 'Build and test Lean' CI job doesn't install Python Strata libraries, and --exclude Languages.Python doesn't cover StrataTest.Fuzz.*. Replace withPython with try/catch on findPython3 and explicit pythonCheckModule guards so the test skips with a warning. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
When filing GitHub issues for pre-existing fuzz bugs, extract the first failing program's source code from the fuzz log and include it in the issue body. This lets agents group issues by root cause without having to re-generate the code. Addresses feedback from olivier-aws in PR #984 thread 6. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extend _gen_if_else() to randomly produce compound conditions using
'and', 'or', 'not', and None identity checks ('None is None',
'None is not None') in addition to simple boolean literals. This
exercises more of the Python front-end's condition handling.
Addresses feedback from olivier-aws in PR #984 thread 7.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
There was a problem hiding this comment.
This is a lot of complexity in a GH actions workflow, and that has a lot of downsides since you can't run the logic locally (there are tools that let you execute actions locally in a container but I've never had much success with them).
I like the automatic cutting of issues. I feel that the sophistication around determining whether it's new or a regression is unnecessary though. If anything it seems unfair for someone to cut a PR and then for new random tests to appear and fail the CI, when it passed for them before cutting the PR. :) I'd vote for always cutting an issue and not blocking the PR. And given that, perhaps it would be much simpler to make this a scheduled workflow that runs once a day instead?
There was a problem hiding this comment.
Actually it's pretty easy to run locally, most of the complexity is just to avoid unduly blocking a PR. Running locally can by done via ./scripts/hypothesmith.sh. Most of the remaining logic is to determine the cause of a failure: if the same test would have passed before the changes in the PR, then the PR broke something, and it is upon the PR author to fix this. If the test equally failed before the changes in the PR, an issue is filed as surely the PR should not be blocked.
Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.
There was a problem hiding this comment.
Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.
Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.
It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.
'N failed' in the DETAIL line means the solver proved the assertion does not hold (found a counterexample), not merely 'failed to prove'. Inconclusive results are handled separately. Also note pyInterpret (#869) as a future improvement for concrete comparison. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
|
LGTM. This PR will provide nice automation for bugs in our Python support. |
robin-aws
left a comment
There was a problem hiding this comment.
Did not review the actual generation logic super thoroughly, partially because it is only generating test input and there are no major consequences if it is wrong. It would be good to have a bit more meta-testing to have higher confidence it's at least mostly right.
| # Generate a seed from the current nanosecond timestamp. This is | ||
| # reported in the step name and all log output so that any failure | ||
| # can be reproduced deterministically. |
There was a problem hiding this comment.
Just an observation, not sure there's anything we can do about it: in many cases just pushing an empty commit to re-run the CI with a different seed could easily result in a green run. :)
There was a problem hiding this comment.
Yes, I acknowledge that, and it's down to developer discipline to (not) do that. This is fuzz testing, our real defense is proof. My goal is to establish infrastructure that helps us build a better system. Once we have it, we could have a farm of hosts run those tests all day long.
| # | ||
| # ## Failure analysis | ||
| # | ||
| # When fuzz tests fail on the PR, the workflow performs a bisection-style |
There was a problem hiding this comment.
"bisection-style" makes it sound much more expensive than it is, since it's just re-running on a single other commit. :)
| name: Python Fuzz | ||
|
|
||
| on: | ||
| pull_request: |
There was a problem hiding this comment.
Good that it's not running on merge_group, since at that point we don't know which PR caused the regression if it failed. But perhaps add a comment that this is intentional and not an omission?
| chmod +x cvc5-Linux-${ARCH_NAME}-static/bin/cvc5 | ||
| echo "$GITHUB_WORKSPACE/cvc5-Linux-${ARCH_NAME}-static/bin/" >> $GITHUB_PATH | ||
|
|
||
| # Try to restore a cached Lean build to avoid rebuilding from scratch. |
There was a problem hiding this comment.
Not sound in general, but okay since this is only a best-effort random testing attempt. Might be worth a comment though.
There was a problem hiding this comment.
Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.
Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.
It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.
| run: | | ||
| echo "::error::Fuzz test regression detected (seed=${{ steps.seed.outputs.value }}). The merge base passes but this PR fails." | ||
| echo "" | ||
| echo "Reproduce locally:" |
There was a problem hiding this comment.
Should we also provide instructions to explicitly add the failing tests to the actual test suite somewhere, once the PR addresses them?
| # Parse failures are acceptable — hypothesmith may generate constructs | ||
| # that the Strata parser doesn't support yet. |
There was a problem hiding this comment.
Never failing on parse errors is a bit unfortunate, since we could potentially find cases we think are supported but don't work. Any way to be more precise?
| # that computes the reference checksum). If CPython fails, the bug | ||
| # is in our generator, not in Strata. |
There was a problem hiding this comment.
Any way to cut an issue when this happens? We're definitely never going to notice otherwise.
| # TODO: Once pyInterpret (#869) is available, use it for concrete | ||
| # pass/fail comparison with CPython (no unknown middle ground). |
There was a problem hiding this comment.
It is now! :) But this can be added later.
Replace stale python-fuzz.yml references with anchors to #984 (which introduces python-fuzz.yml) and tighten the restore-lake-cache key-prefix input description so it no longer suggests the action can serve sub- projects with their own toolchain/manifest (the action hardcodes hashFiles('lean-toolchain') and hashFiles('lake-manifest.json') from the repo root). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Adds fuzz testing infrastructure for Strata's Python front-end, inspired by CBMC's use of CSmith and the hypothesmith project for random Python program generation.
Test modes
Syntax mode — Uses hypothesmith to generate syntactically valid Python from the grammar, then runs each program through the parse → Ion → round-trip pipeline. Tests that the parser doesn't crash or lose information on arbitrary valid Python.
Semantic mode — Uses a custom generator to produce typed Python programs with assertions whose expected values are computed by CPython at generation time (the CSmith checksum analogy). Each program is validated with CPython first, then verified with
pyAnalyzeLaurel. A verification failure on a CPython-passing program indicates a semantic modelling bug in the Python → Laurel → Core pipeline.--unrestrictedoptionBy default, only constructs known to work in Strata are generated. With
--unrestricted, ~60 additional generators are included covering classes, generators, comprehensions, try/except, match, async, decorators, walrus operator, *args/**kwargs, and more. Programs hitting unsupported constructs are reported as SKIP, not failures.CI integration
A new workflow (
.github/workflows/python-fuzz.yml) runs on every PR:Local usage
Files
Tools/Python/scripts/hypothesmith.shcsmith.sh)Tools/Python/scripts/gen_random_python.pyTools/Python/scripts/gen_unrestricted.py--unrestrictedmode.github/workflows/python-fuzz.ymlTesting
Tested locally with multiple seeds and program counts. All generated programs pass CPython validation (500+ tested). The restricted semantic mode produces 100% pass rate through
pyAnalyzeLaurel. The unrestricted mode correctly identifies unsupported constructs as SKIP.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.