Add hypothesmith-based fuzz testing for Python front-end by tautschnig · Pull Request #984 · strata-org/Strata

tautschnig · 2026-04-20T19:59:37Z

Adds fuzz testing infrastructure for Strata's Python front-end, inspired by CBMC's use of CSmith and the hypothesmith project for random Python program generation.

Test modes

Syntax mode — Uses hypothesmith to generate syntactically valid Python from the grammar, then runs each program through the parse → Ion → round-trip pipeline. Tests that the parser doesn't crash or lose information on arbitrary valid Python.

Semantic mode — Uses a custom generator to produce typed Python programs with assertions whose expected values are computed by CPython at generation time (the CSmith checksum analogy). Each program is validated with CPython first, then verified with pyAnalyzeLaurel. A verification failure on a CPython-passing program indicates a semantic modelling bug in the Python → Laurel → Core pipeline.

`--unrestricted` option

By default, only constructs known to work in Strata are generated. With --unrestricted, ~60 additional generators are included covering classes, generators, comprehensions, try/except, match, async, decorators, walrus operator, *args/**kwargs, and more. Programs hitting unsupported constructs are reported as SKIP, not failures.

CI integration

A new workflow (.github/workflows/python-fuzz.yml) runs on every PR:

Pick random seed
        │
        ▼
Run 10 fuzz tests (both modes, --unrestricted)
        │
    ┌───┴───┐
  PASS    FAIL
    │       │
    ▼       ▼
  Done   Checkout merge base, rebuild, re-run same seed
            │
        ┌───┴───────┐
    Base FAILS    Base PASSES
        │              │
        ▼              ▼
  Create issue    CI FAILS
  (pre-existing   (regression
   bug, labeled    from this PR)
   'fuzz-bug')

Pre-existing bugs create a GitHub issue but don't block the PR
Regressions fail CI with reproduction instructions
The seed is always reported for deterministic reproduction

Local usage

# Set up (one-time)
cd Tools/Python
python3 -m venv .venv
.venv/bin/pip install -e . hypothesmith

# Run 10 tests with both modes
./scripts/hypothesmith.sh 10

# Reproduce a specific failure
./scripts/hypothesmith.sh 10 1745165432 both --unrestricted

# Only semantic mode, restricted to known-working constructs
./scripts/hypothesmith.sh 20 42 semantic

Files

File	Purpose
`Tools/Python/scripts/hypothesmith.sh`	Main driver script (analogous to CBMC's `csmith.sh`)
`Tools/Python/scripts/gen_random_python.py`	Program generator with syntax and semantic modes
`Tools/Python/scripts/gen_unrestricted.py`	~60 extended generators for `--unrestricted` mode
`.github/workflows/python-fuzz.yml`	CI workflow for PR-triggered fuzz testing

Testing

Tested locally with multiple seeds and program counts. All generated programs pass CPython validation (500+ tested). The restricted semantic mode produces 100% pass rate through pyAnalyzeLaurel. The unrestricted mode correctly identifies unsupported constructs as SKIP.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Two test modes: 1. Syntax mode: Uses hypothesmith to generate syntactically valid Python programs, then runs them through the parse -> Ion -> round-trip pipeline. Tests that the parser doesn't crash on arbitrary valid Python. 2. Semantic mode: Custom generator produces typed programs with int arithmetic, comparisons, boolean logic, string concatenation, if/else, floor division, modulo, and negation — all with assertions whose expected values are computed by CPython. Programs are first validated with CPython, then run through pyAnalyzeLaurel. A verification failure on a CPython-passing program indicates a semantic modelling bug. Size control: - --max-nodes filters syntax-mode programs by AST node count - --max-examples controls how many programs to generate - --seed enables reproducible runs Usage: cd Tools/Python && ./scripts/hypothesmith.sh [N [SEED [MODE]]] Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Adds gen_unrestricted.py with ~60 generators covering Python constructs well beyond what Strata supports today: - Bitwise ops (&, |, ^, ~, <<, >>), power (**) - Augmented assignment (+=, -=, *=) - List ops (index, slice, append, concat, in, comprehensions) - Dict ops (create, access, in, comprehensions) - String methods (upper, lower, strip, len, index, multiply, format) - For loops (over lists, range), while loops, break, nested loops - Nested if, elif chains, ternary expressions - Tuple unpacking, star unpacking, chained comparisons - None checks, Optional types - Classes (fields, methods), nested function calls - Lambda, list/dict comprehensions - Try/except/else/finally, multiple except clauses, raise - With statement (context managers) - Nonlocal, walrus operator (:=), f-strings - Type conversions (int, str, bool, float, list) - isinstance, recursion, default args, *args, **kwargs - Generators (yield), decorators, async/await - Match statement (Python 3.10+) Usage: ./scripts/hypothesmith.sh 20 42 semantic --unrestricted Programs that hit unsupported Strata constructs are reported as SKIP. Only true internal errors (panics) or semantic mismatches (Strata rejects a CPython-passing assertion) count as failures. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Runs on every PR (not merge queue or push to main). For each run: 1. Picks a fresh random seed (reported in step name and logs) 2. Generates 10 programs in both syntax and semantic modes with --unrestricted (covering constructs beyond current Strata support) 3. On failure, checks out the merge base and re-runs with the same seed: - If merge base also fails: creates a GitHub issue labeled 'fuzz-bug' (pre-existing bug, PR is not blocked) - If merge base passes: CI fails with reproduction instructions (the PR introduced a regression) - If merge base can't build: CI fails as inconclusive The seed is always reported so failures can be reproduced locally: cd Tools/Python ./scripts/hypothesmith.sh 10 <SEED> both --unrestricted Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

All four files now include thorough documentation: - python-fuzz.yml: Header block explaining the trigger policy, test modes, failure analysis flow (with ASCII diagram), reproducibility, and file inventory. Each step has inline comments explaining its purpose. - hypothesmith.sh: Full header documenting background/motivation, both test modes, result classification (OK/SKIP/TIMEOUT/FAIL with subtypes), reproducibility, usage, prerequisites, and exit codes. Inline comments on each pipeline step. - gen_random_python.py: Module docstring covering both modes, generator categories (base vs unrestricted), reproducibility mechanism, usage examples, and all CLI options. Section headers and docstrings on all functions. - gen_unrestricted.py: Module docstring with categorised inventory of all ~60 generators, instructions for adding new generators, and section headers grouping generators by category. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Copilot

Pull request overview

Adds hypothesmith-based fuzzing infrastructure to exercise Strata’s Python front-end (parser round-trips + semantic verification) and wires it into PR CI to catch regressions and file issues for pre-existing bugs.

Changes:

Add a bash driver to generate random Python programs and run syntax/semantic fuzz pipelines.
Add Python generators for both restricted and --unrestricted fuzz modes.
Add a GitHub Actions workflow to run fuzzing on PRs and re-run on merge base for regression classification.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`Tools/Python/scripts/hypothesmith.sh`	Driver script for syntax + semantic fuzz runs and result classification.
`Tools/Python/scripts/gen_random_python.py`	Core generator for syntax-mode (hypothesmith) and semantic-mode (typed asserts).
`Tools/Python/scripts/gen_unrestricted.py`	Extended generator blocks for unsupported/edge Python constructs under `--unrestricted`.
`.github/workflows/python-fuzz.yml`	PR CI job to run fuzzing, analyze failures vs merge base, and optionally file issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Parse RESULT: line for non-zero pyAnalyzeLaurel exits to correctly classify verification failures and internal errors instead of blanket-skipping them - Use $PYTHON instead of python3 for CPython validation so generation/validation/parsing all use the same interpreter - Fix gen_fstring() to declare a variable and use single-brace interpolation (triple braces produced literal braces, mismatching the expected string) - Guard issue creation on fork PRs where GITHUB_TOKEN is read-only, and add continue-on-error as safety net - Add clarifying comment on gen_dict_create() f-string evaluation Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

lean-action runs lake test by default when lake check-test succeeds. The fuzz workflow only needs the strata executable, not the full test suite. Running lake test fails because z3 is not installed in the fuzz runner (T19_InvokeOn requires z3). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

robin-aws · 2026-04-22T15:35:18Z

High level feedback (without having looked at the details) - can you ensure these tests are run somehow as part of lake test, rather than having to add this as yet another custom GH workflow step that developers will not think of running?

I see at least two options:

Include a *.lean file that runs the shell script in an #eval
Avoid the shell script and implement the test harness in Lean directly.

shigoel

One inline comment: semantic mode parse failure classification.

Parse failures in semantic mode are now SKIP (parse), matching syntax mode behavior and the documented result classification. The generator may produce constructs the Strata parser doesn't support yet, which are expected limitations, not regressions. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

- hypothesmith.sh now respects PYTHON env var (falls back to .venv) - New StrataTest/Fuzz/HypotheSmith.lean runs 3 deterministic fuzz tests via withPython when Python + hypothesmith are available, skips gracefully otherwise - Uses the established findPython3/withPython/pythonCheckModule pattern from StrataTest/Util/Python.lean - Full random-seed fuzzing remains in the CI workflow Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig · 2026-04-23T11:26:01Z

High level feedback (without having looked at the details) - can you ensure these tests are run somehow as part of lake test, rather than having to add this as yet another custom GH workflow step that developers will not think of running?

I see at least two options:
1. Include a `*.lean` file that runs the shell script in an `#eval`

2. Avoid the shell script and implement the test harness in Lean directly.

Addressed in 7f9aaa5 closer to option 1 - doing everything in Lean would mean re-implementing hypothesmith in Lean (for all that I can tell).

- Remove warnOnSkip parameter from withPython call (parameter removed on main) - Fix hypothesmith.sh: use 'command -v' instead of '[ -x ]' to check PYTHON, so bare command names (e.g. 'python3') work alongside full paths Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

olivier-aws

Two comments mostly because I'm unsure how the Python code generation works from the list of generators.

The HypotheSmith.lean test used withPython which throws when strata.gen is not installed. The 'Build and test Lean' CI job doesn't install Python Strata libraries, and --exclude Languages.Python doesn't cover StrataTest.Fuzz.*. Replace withPython with try/catch on findPython3 and explicit pythonCheckModule guards so the test skips with a warning. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

When filing GitHub issues for pre-existing fuzz bugs, extract the first failing program's source code from the fuzz log and include it in the issue body. This lets agents group issues by root cause without having to re-generate the code. Addresses feedback from olivier-aws in PR #984 thread 6. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Extend _gen_if_else() to randomly produce compound conditions using 'and', 'or', 'not', and None identity checks ('None is None', 'None is not None') in addition to simple boolean literals. This exercises more of the Python front-end's condition handling. Addresses feedback from olivier-aws in PR #984 thread 7. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

robin-aws · 2026-04-24T23:36:12Z

This is a lot of complexity in a GH actions workflow, and that has a lot of downsides since you can't run the logic locally (there are tools that let you execute actions locally in a container but I've never had much success with them).

I like the automatic cutting of issues. I feel that the sophistication around determining whether it's new or a regression is unnecessary though. If anything it seems unfair for someone to cut a PR and then for new random tests to appear and fail the CI, when it passed for them before cutting the PR. :) I'd vote for always cutting an issue and not blocking the PR. And given that, perhaps it would be much simpler to make this a scheduled workflow that runs once a day instead?

Actually it's pretty easy to run locally, most of the complexity is just to avoid unduly blocking a PR. Running locally can by done via ./scripts/hypothesmith.sh. Most of the remaining logic is to determine the cause of a failure: if the same test would have passed before the changes in the PR, then the PR broke something, and it is upon the PR author to fix this. If the test equally failed before the changes in the PR, an issue is filed as surely the PR should not be blocked.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.

'N failed' in the DETAIL line means the solver proved the assertion does not hold (found a counterexample), not merely 'failed to prove'. Inconclusive results are handled separately. Also note pyInterpret (#869) as a future improvement for concrete comparison. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

olivier-aws · 2026-05-05T12:55:41Z

LGTM. This PR will provide nice automation for bugs in our Python support.

robin-aws

Did not review the actual generation logic super thoroughly, partially because it is only generating test input and there are no major consequences if it is wrong. It would be good to have a bit more meta-testing to have higher confidence it's at least mostly right.

robin-aws · 2026-05-06T13:41:11Z

+      # Generate a seed from the current nanosecond timestamp. This is
+      # reported in the step name and all log output so that any failure
+      # can be reproduced deterministically.


Just an observation, not sure there's anything we can do about it: in many cases just pushing an empty commit to re-run the CI with a different seed could easily result in a green run. :)

Yes, I acknowledge that, and it's down to developer discipline to (not) do that. This is fuzz testing, our real defense is proof. My goal is to establish infrastructure that helps us build a better system. Once we have it, we could have a farm of hosts run those tests all day long.

robin-aws · 2026-05-06T13:42:02Z

+#
+# ## Failure analysis
+#
+# When fuzz tests fail on the PR, the workflow performs a bisection-style


"bisection-style" makes it sound much more expensive than it is, since it's just re-running on a single other commit. :)

robin-aws · 2026-05-06T13:43:25Z

+name: Python Fuzz
+
+on:
+  pull_request:


Good that it's not running on merge_group, since at that point we don't know which PR caused the regression if it failed. But perhaps add a comment that this is intentional and not an omission?

robin-aws · 2026-05-06T13:45:29Z

+          chmod +x cvc5-Linux-${ARCH_NAME}-static/bin/cvc5
+          echo "$GITHUB_WORKSPACE/cvc5-Linux-${ARCH_NAME}-static/bin/" >> $GITHUB_PATH
+
+      # Try to restore a cached Lean build to avoid rebuilding from scratch.


Not sound in general, but okay since this is only a best-effort random testing attempt. Might be worth a comment though.

robin-aws · 2026-05-06T13:49:46Z

Fair, but it's still a lot of other shell script content (and javascript) that could be moved into separate files. My philosophy is to minimize how much code is only in workflow files and therefore not locally testable.

Running once a day means we cannot tie regressions to root causes and we don't have a clear owner for who is to deal with any failing outcomes. Even worse, nobody might notice that the action failed in the first place.

It will cut an issue if it fails though, and we have to ensure we don't ignore those entirely anyway. Agreed we would lose some visibility about which PR introduced a regression.

robin-aws · 2026-05-06T13:52:14Z

+        run: |
+          echo "::error::Fuzz test regression detected (seed=${{ steps.seed.outputs.value }}). The merge base passes but this PR fails."
+          echo ""
+          echo "Reproduce locally:"


Should we also provide instructions to explicitly add the failing tests to the actual test suite somewhere, once the PR addresses them?

robin-aws · 2026-05-06T13:57:26Z

+    # Parse failures are acceptable — hypothesmith may generate constructs
+    # that the Strata parser doesn't support yet.


Never failing on parse errors is a bit unfortunate, since we could potentially find cases we think are supported but don't work. Any way to be more precise?

robin-aws · 2026-05-06T13:59:01Z

+    # that computes the reference checksum). If CPython fails, the bug
+    # is in our generator, not in Strata.


Any way to cut an issue when this happens? We're definitely never going to notice otherwise.

robin-aws · 2026-05-06T14:00:38Z

+    # TODO: Once pyInterpret (#869) is available, use it for concrete
+    # pass/fail comparison with CPython (no unknown middle ground).


It is now! :) But this can be added later.

The base branch was changed.

Replace stale python-fuzz.yml references with anchors to #984 (which introduces python-fuzz.yml) and tighten the restore-lake-cache key-prefix input description so it no longer suggests the action can serve sub- projects with their own toolchain/manifest (the action hardcodes hashFiles('lean-toolchain') and hashFiles('lake-manifest.json') from the repo root). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig and others added 4 commits April 20, 2026 19:57

tautschnig requested review from a team and Copilot April 20, 2026 19:59

Copilot started reviewing on behalf of tautschnig April 20, 2026 20:00 View session

tautschnig self-assigned this Apr 20, 2026

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread Tools/Python/scripts/hypothesmith.sh

Comment thread Tools/Python/scripts/hypothesmith.sh Outdated

Comment thread Tools/Python/scripts/gen_unrestricted.py

Comment thread Tools/Python/scripts/gen_unrestricted.py Outdated

Comment thread .github/workflows/python-fuzz.yml

tautschnig and others added 2 commits April 22, 2026 14:34

Merge remote-tracking branch 'origin/main' into tautschnig/hypothesmith

730dc7f

github-actions Bot added Waiting-For-Review Python github_actions Pull requests that update GitHub Actions code labels Apr 22, 2026

tautschnig assigned olivier-aws and tautschnig and unassigned tautschnig Apr 22, 2026

shigoel enabled auto-merge April 22, 2026 15:36

shigoel reviewed Apr 22, 2026

View reviewed changes

Comment thread Tools/Python/scripts/hypothesmith.sh Outdated

tautschnig and others added 3 commits April 23, 2026 10:20

Merge remote-tracking branch 'origin/main' into tautschnig/hypothesmith

01f8323

tautschnig assigned shigoel and robin-aws and unassigned tautschnig Apr 23, 2026

shigoel previously approved these changes Apr 24, 2026

View reviewed changes

github-actions Bot removed the Waiting-For-Review label Apr 24, 2026

Merge branch 'main' into tautschnig/hypothesmith

6b0a5c1

tautschnig self-assigned this May 4, 2026

tautschnig and others added 2 commits May 4, 2026 11:15

Merge remote-tracking branch 'origin/main' into pr-984-fix

68cdc83

tautschnig dismissed shigoel’s stale review via cc70d55 May 4, 2026 11:34

github-actions Bot added Waiting-For-Review and removed Has 1 approval labels May 4, 2026

tautschnig removed their assignment May 4, 2026

github-actions Bot mentioned this pull request May 4, 2026

[fuzz] Pre-existing Python front-end bug (seed 1777894483) #1102

Open

olivier-aws reviewed May 4, 2026

View reviewed changes

Comment thread .github/workflows/python-fuzz.yml

Comment thread Tools/Python/scripts/gen_random_python.py Outdated

tautschnig and others added 3 commits May 4, 2026 19:08

robin-aws requested changes May 4, 2026

View reviewed changes

tautschnig and others added 2 commits May 5, 2026 11:15

Merge remote-tracking branch 'origin/main' into tautschnig/hypothesmith

400eb9e

github-actions Bot mentioned this pull request May 5, 2026

[fuzz] Pre-existing Python front-end bug (seed 1777979839) #1111

Open

olivier-aws previously approved these changes May 5, 2026

View reviewed changes

github-actions Bot added Has 1 approval and removed Waiting-For-Review labels May 5, 2026

tautschnig unassigned olivier-aws May 5, 2026

robin-aws requested changes May 6, 2026

View reviewed changes

tautschnig mentioned this pull request May 7, 2026

ci: extract install-cvc5, install-z3, restore-lake-cache into composite actions #1137

Open

tautschnig changed the base branch from main to main2 May 15, 2026 21:05

tautschnig assigned tautschnig and unassigned shigoel and robin-aws May 15, 2026

		# Parse failures are acceptable — hypothesmith may generate constructs
		# that the Strata parser doesn't support yet.

		# that computes the reference checksum). If CPython fails, the bug
		# is in our generator, not in Strata.

		# TODO: Once pyInterpret (#869) is available, use it for concrete
		# pass/fail comparison with CPython (no unknown middle ground).

Conversation

tautschnig commented Apr 20, 2026

Test modes

--unrestricted option

CI integration

Local usage

Files

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robin-aws commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shigoel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tautschnig commented Apr 23, 2026

Uh oh!

olivier-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olivier-aws commented May 5, 2026

Uh oh!

robin-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

`--unrestricted` option

robin-aws commented Apr 22, 2026 •

edited

Loading