Guarddog v3: Custom correlation engine by sobregosodd · Pull Request #706 · DataDog/guarddog

sobregosodd · 2026-04-06T20:57:01Z

This PR introduces guarddog v3.
This version replaces GuardDog's independent-alert model with a risk correlation engine.

Highlights:

Risk score: Packages are now scored 0-10 based on attack chain completeness rather than individual pattern matches.
Capability + Threat = Risk: Findings only form risks when a code capability (e.g., network access) pairs with a threat indicator (e.g., suspicious domain) in the same category
MITRE ATT&CK mapping: Risks are formed and mapped to the matrix and this is leveraged to assign a risk score based on the risks present along the attack chain.
Semgrep → YARA: All source code rules migrated to language-agnostic YARA rules (34 rules, 4 reusable .meta files)
Metadata in the risk model: Metadata rules play a similar role in the risk scoring by having severity grading and MITRE mapping.

Other changes:

Metadata rules that are loosely related to maliciousness were dropped to reduce noise.

Copilot

Copilot reviewed 143 out of 144 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tesnim5hamdouni

letsgoooo

When scanning a local directory, metadata detectors (typosquatting, deceptive author, compromised email, etc.) previously could not run because no registry metadata was available. The new --metadata flag accepts a path to a package metadata JSON file (matching the PyPI JSON API or npm registry format), enabling the full detection pipeline for local scans. The recall benchmark worker now automatically passes package_info-*.json files from the malicious-software-packages-dataset ZIPs to guarddog via this flag, so metadata rules contribute to recall measurement.

cluster.py now identifies ZIPs with zero source files and records them in cluster_index.json under "empty_packages". recall.py filters these out during regenerate_samples so benchmarks don't waste budget on packages that have no code to analyze.

Packages like litellm ship as a ZIP containing another ZIP. These are not empty; they just need double extraction. Count nested archives (.zip, .whl, .tar.gz) as having content.

Add 12 new YARA threat rules targeting common malware patterns that were previously undetected: download-and-execute chains, chr/hex obfuscation, PowerShell encoded commands, dynamic import+exec, reverse shells, Telegram/Discord exfil, DNS exfil, npm preinstall hooks, dependency confusion indicators, setup.py suspicious imports, and system info exfiltration. Tighten 6 existing rules to reduce false positives: threat-process-hooks (inline meta rules, exclude prepare/prepack), threat-process-injection-dll (remove overly broad .dll/.exe string matches), threat-runtime-system-info (require 3+ calls instead of 1), threat-process-spawn-silent (require both stdout+stderr suppressed), threat-runtime-obfuscation-general (raise hex threshold to 50+, remove bracket notation), threat-runtime-obfuscation-base64exec (tighten JS Buffer.from pattern, require explicit base64 encoding). Update risk engine: add "setup" and "npm" to valid categories (was silently dropping findings), make HIGH-specificity threats form standalone risks, add cross-category risk formation, add specificity gate (LOW-specificity-only capped at 4.9 unless MEDIUM+ specificity present), bump single-stage chain value from 0.3 to 0.4. Benchmark results (threshold 5.0, 1000 benign + 745 malicious packages): | | Baseline | Final | Change | |----------|----------|--------|---------| | Recall | 79.3% | 87.0% | +7.7pp | | Precision| 75.5% | 80.2% | +4.7pp | | F1 | 77.3% | 83.5% | +6.2pp | | MCC | 0.600 | 0.704 | +0.104 |

…lusters Removed 95 packages that had no source files (empty placeholders, dep confusion probes with no payload, nested-archive dataset bugs). Backfilled 94 replacements from previously unrepresented clusters for better diversity. Recall on cleaned dataset: 88.8% (was 87.0% on dirty dataset with empty packages dragging it down). PyPI recall 98.9%, compromised_lib 100%.

Add threat-runtime-obfuscation-log-suppress rule for console.log suppression combined with hex arrays/fromCharCode (common npm malware evasion). Extend threat-process-download-exec to catch Node.js child_process + fetch patterns. Recall now at 90.6% (threshold 5.0), up from 88.8%.

extension_scanner.scan_local was missing the info parameter added to the PackageScanner superclass, causing a mypy override error. test_single_runtime_threat expected 8.6 but single-stage chain value was changed from 0.3 to 0.4, making the correct score 8.8.

Replace per-package GitHub Contents API calls (~2 per package) with bulk Git Trees API (~10 calls total). Fixes sampling failures from rate limiting when resolving 1000+ packages. Add ASCII pipeline diagram to evals/README.md showing the full cluster -> sample -> scan -> report workflow.

Build the ZIP index upfront and only sample from packages that actually have ZIPs in the dataset. Fixes resolution failures when the manifest lists packages that don't have archived samples yet. Resampled: 1251 packages (251 pypi + 1000 npm), 1 per cluster, max diversity.

@0xengine

The dataset stores scoped npm packages with @ as separator (@0xengine@meow) but the manifest uses / (@0xengine/meow). Convert when parsing tree paths so scoped packages are found during sampling. This adds ~1359 previously invisible packages to the available pool.

christophetd · 2026-04-14T16:21:34Z

Detection Benchmark Results

Benchmark comparing detection quality before (e2a21fd) and after the latest rule + risk engine changes. Evaluated on 1,225 malicious packages (1 per cluster, from the malicious-software-packages-dataset) and 1,000 top legitimate packages, at detection threshold 5.0.

Aggregate

Metric	Before	After	Change
Recall	88.5%	90.2%	+1.7pp
Precision	70.2%	86.4%	+16.2pp
F1	78.3%	88.3%	+10.0pp
MCC	0.453	0.733	+0.280
FPs (benign >= 5.0)	467	174	-293

By ecosystem

	PyPI		npm
	Before	After	Before	After
Recall	79.8%	85.7%	90.7%	91.3%
Precision	38.3%	61.2%	85.9%	94.9%
F1	51.8%	71.4%	88.2%	93.0%
MCC	0.163	0.567	0.630	0.801
FPs	319	125	148	49

By category

Category	Before	After
`malicious_intent`	84.8%	87.4%
`compromised_lib`	95.1%	95.1%

What changed

12 new YARA threat rules targeting previously undetected patterns: download-and-execute chains, chr/hex obfuscation, PowerShell encoded commands, dynamic import+exec, reverse shells, Telegram/Discord exfil, DNS exfil, npm preinstall hooks, setup.py suspicious imports, system info exfiltration, and log suppression + obfuscation.

6 tightened existing rules to reduce FP rate: threat-process-hooks (inlined), threat-process-injection-dll, threat-runtime-system-info, threat-process-spawn-silent, threat-runtime-obfuscation-general, threat-runtime-obfuscation-base64exec.

Risk engine improvements: added "setup"/"npm" categories (was silently dropping findings), standalone risk formation for HIGH-specificity threats, cross-category risk pairing, specificity gate (LOW-specificity-only capped below threshold), single-stage chain value bump.

Eval infrastructure: cluster-aware sampling with empty-package filtering, scoped npm package support, bulk Git Trees API for ZIP resolution.

sobregosodd added 17 commits December 30, 2025 16:12

initial rewrite

c83a054

rules improvement

1832820

rules improve

80c35b6

fixing core rules and docs

5132dcc

update readme

4c51500

Merge branch 'main' into s.obregoso/v3

47abed5

Merge branch 'main' into s.obregoso/v3

0f7d58a

Merge branch 'main' into s.obregoso/v3

8fdfd06

exclude comments from matches

0d14fe4

Merge branch 'main' into s.obregoso/v3

4d592a9

Merge branch 'main' into s.obregoso/v3

4f4db2e

migrate ruby rules

2a14e06

improving lolbas

9d58c65

Merge branch 'main' into s.obregoso/v3

3b0c290

including pyarmor

a9285cf

incoroporing metadata rules into framework, fixing risk ranges

1275c8c

add correlation risk downgrading

aceb4b9

sobregosodd requested a review from a team as a code owner April 6, 2026 20:57

code quallity

3d732ac

sobregosodd requested a review from Copilot April 6, 2026 21:24

Copilot started reviewing on behalf of sobregosodd April 6, 2026 21:24 View session

more code quality

e2a21fd

Copilot AI reviewed Apr 6, 2026

View reviewed changes

tesnim5hamdouni previously approved these changes Apr 7, 2026

View reviewed changes

ikretz previously approved these changes Apr 7, 2026

View reviewed changes

Comment thread guarddog/analyzer/risk_engine.py

Add sandboxing via nono-py for the scanning process (#712)

acf7b4e

christophetd dismissed stale reviews from ikretz and tesnim5hamdouni via acf7b4e April 10, 2026 20:11

christophetd and others added 2 commits April 10, 2026 22:49

Reduce false positive noise, improve recall, add evaluation suite (#713)

6db2031

fixing code quality and tests

6f4c8f5

sobregosodd and others added 12 commits April 10, 2026 18:02

fixing sandbox unittest

58cbcbb

Add clustering of eval dataset

8a2ed02

Fix empty-package detection to not flag nested archives

2f06841

Packages like litellm ship as a ZIP containing another ZIP. These are not empty; they just need double extraction. Count nested archives (.zip, .whl, .tar.gz) as having content.

christophetd and others added 2 commits April 14, 2026 18:34

Fix formatting (black)

f6868fe

fix reverse-shell rule

8956ece

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guarddog v3: Custom correlation engine#706

Guarddog v3: Custom correlation engine#706
sobregosodd wants to merge 36 commits intomainfrom
s.obregoso/v3

sobregosodd commented Apr 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tesnim5hamdouni left a comment

Uh oh!

Uh oh!

christophetd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sobregosodd commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tesnim5hamdouni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christophetd commented Apr 14, 2026

Detection Benchmark Results

Aggregate

By ecosystem

By category

What changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sobregosodd commented Apr 6, 2026 •

edited

Loading