Benchmarking Claude Opus 4.6 Vulnerability Detection

Benchmarking Claude Opus 4.6's ability to detect real-world C/C++ vulnerabilities across four prompting and agent strategies. We evaluate on the PrimeVul paired test set (435 vulnerability/fix pairs from open-source projects), measuring precision, recall, and CVE-correctness to understand how structured reasoning, justification depth, and verification agents affect detection quality.

Key Finding

Requiring the model to produce increasingly rigorous justifications (execution traces, state proofs) improves pair-correct precision (P-C) from 13.6% to 20.3%, with rigorous precision nearly doubling from 8.7% to 15.8%. Adding a verification agent pushes P-C to 23.3% and CVE recall to 28.9%.

Experiments

Each experiment uses Claude Opus 4.6 as the analyzer and runs 3 times for consistency. All experiments share the same three-phase pipeline but differ in the structured output the model must produce.

#	Experiment	P-C	P-C Rigorous	P-C Flexible	CVE Recall	Vuln Findings	Benign Findings	Benign-Only Findings
1	No Justification	13.6%	8.7%	13.6%	27.6%	63.4%	52.4%	27.4%
2	Limited Justification	19.3%	14.5%	17.7%	25.5%	52.2%	36.6%	15.6%
3	Extensive Justification	20.3%	15.8%	17.7%	28.5%	54.6%	37.7%	18.6%
4	Verification Agent	23.3%	16.2%	18.5%	28.9%	57.2%	43.2%	24.7%

All values are medians across 3 runs. Reference baseline: GPT-4 CoT = 12.94% P-C.

Cross-Run Consistency

Each experiment runs 3 times. "All 3" means the result held in every run; "Any" means it held in at least one. This captures how stable the results were across runs.

#	Experiment	CVE Recall All 3	CVE Recall Any	P-C Rigorous All 3	P-C Rigorous Any	P-C Flexible All 3	P-C Flexible Any
1	No Justification	22.0%	33.1%	5.9%	11.1%	7.6%	19.6%
2	Limited Justification	21.7%	30.0%	12.5%	17.5%	14.2%	22.2%
3	Extensive Justification	23.6%	33.3%	11.6%	20.1%	12.8%	24.3%
4	Verification Agent	19.4%	37.6%	9.9%	23.4%	10.6%	27.2%

1. No Justification

Simple vulnerability analysis. The model reports CWEs, code snippets, and descriptions with no structured reasoning required.

2. Limited Justification

Requires a Justification with an UndesiredOperation (code + CWEs) and step_by_step_execution tracking variable state through ProgramSteps. The model must demonstrate a concrete execution path from function entry to the undesired operation.

3. Extensive Justification

Full proof of reachability. The model must provide:

UndesiredOperation: description, code, CWEs, impact, and the variable states required to trigger it
Justification: initial variable state at function entry, then a trace of DataTransformation steps (in_state -> out_state) and ConditionalStep steps (prove each branch is taken given current state)

4. Verification Agent

Same structured reasoning as experiment 3, plus a Claude Sonnet 4.6 verifier agent that checks each finding before inclusion. The verifier validates: is the undesired operation real, is the initial state correct, do steps follow logically, are conditionals justified, and does the final state match the preconditions. Findings get up to 2 verification attempts; unverified findings are discarded.

Pipeline

Each experiment follows the same three-phase pipeline:

analyze.py -> diff_judge.py -> judge.py

Analyze (analyze.py): Claude Opus 4.6 analyzes each of the 870 functions independently, producing structured vulnerability findings.
Diff Judge (diff_judge.py): For each commit pair (vulnerable + fixed), matches findings across versions and categorizes them as vuln_only, benign_only, or shared.
Judge (judge.py): Evaluates vuln_only findings against ground-truth CVE data to determine correctness.

Metrics

All metrics are computed over the 435 vulnerability/fix pairs. After analysis, the diff judge categorizes each finding as vuln_only (unique to vulnerable version), benign_only (unique to fixed version), or shared (present in both). The judge then evaluates whether each vuln_only finding correctly identifies the ground-truth CVE.

Metric	Definition
P-C	% of pairs where the vulnerable side has at least one finding and the benign side has zero findings (no `benign_only`, no `shared`). Measures raw discrimination: can the model tell vulnerable code from fixed code?
P-C Rigorous	P-C with the additional requirement that all `vuln_only` findings are judged as related to the ground-truth CVE. The strictest metric — the model must flag only the real vulnerability and nothing else, with a clean benign side.
P-C Flexible	% of pairs where all `vuln_only` findings are CVE-correct (at least one exists) and there are no `benign_only` findings. `shared` findings are permitted — these represent underlying issues not addressed by the patch. Every benign-side finding must have a corresponding linked vulnerable-side finding.
CVE Recall	% of pairs where the vulnerable side has at least one finding judged as related to the ground-truth CVE, regardless of what appears on the benign side. Measures the model's ability to detect the actual vulnerability.
Vuln Findings	% of vulnerable functions that have at least one finding (any category).
Benign Findings	% of benign functions that have at least one finding (any category).
Benign-Only Findings	% of benign functions that have at least one finding not also found on the vulnerable side (i.e., a `benign_only` finding with no linked `shared` counterpart).

Dataset

The PrimeVul paired test set contains 435 pairs (870 functions) from real security fixes across open-source C/C++ projects including Linux, TensorFlow, ImageMagick, FFmpeg, OpenSSL, mruby, and others. Each pair consists of:

A vulnerable function (before the fix, target=0)
A benign function (after the fix, target=1)

Ground truth includes CVE ID, CWE classification, NVD URL, and commit message.

Project Structure

src/
  experiments/
    no-justification/        # Experiment 1
    limited-justification/   # Experiment 2
    extensive-justification/  # Experiment 3
    verification-agent/       # Experiment 4
  common/
    primevul.duckdb           # Dataset in DuckDB format

data/
  experiments/
    */experiment.json         # Experiment metadata
    */runs/{1,2,3}/           # Per-run outputs (analysis, diffed, judged, stats)
  experiment_comparison.json  # Cross-experiment metrics comparison

Setup

Requires Python 3.12+ and uv.

uv sync

Set your Anthropic API key:

export ANTHROPIC_API_KEY=sk-...

Running an experiment

Each experiment has a run_experiment.sh script:

cd src/experiments/extensive-justification
bash run_experiment.sh 1  # run number

Dependencies

pydantic-ai - Claude agent framework with structured outputs
duckdb - Dataset storage and querying
datasets - HuggingFace dataset loading

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
primevul.duckdb		primevul.duckdb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Claude Opus 4.6 Vulnerability Detection

Key Finding

Experiments

Cross-Run Consistency

1. No Justification

2. Limited Justification

3. Extensive Justification

4. Verification Agent

Pipeline

Metrics

Dataset

Project Structure

Setup

Running an experiment

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Claude Opus 4.6 Vulnerability Detection

Key Finding

Experiments

Cross-Run Consistency

1. No Justification

2. Limited Justification

3. Extensive Justification

4. Verification Agent

Pipeline

Metrics

Dataset

Project Structure

Setup

Running an experiment

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages