Guarddog v3: Custom correlation engine#706
Conversation
There was a problem hiding this comment.
Copilot reviewed 143 out of 144 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
acf7b4e
When scanning a local directory, metadata detectors (typosquatting, deceptive author, compromised email, etc.) previously could not run because no registry metadata was available. The new --metadata flag accepts a path to a package metadata JSON file (matching the PyPI JSON API or npm registry format), enabling the full detection pipeline for local scans. The recall benchmark worker now automatically passes package_info-*.json files from the malicious-software-packages-dataset ZIPs to guarddog via this flag, so metadata rules contribute to recall measurement.
cluster.py now identifies ZIPs with zero source files and records them in cluster_index.json under "empty_packages". recall.py filters these out during regenerate_samples so benchmarks don't waste budget on packages that have no code to analyze.
Packages like litellm ship as a ZIP containing another ZIP. These are not empty; they just need double extraction. Count nested archives (.zip, .whl, .tar.gz) as having content.
Add 12 new YARA threat rules targeting common malware patterns that were previously undetected: download-and-execute chains, chr/hex obfuscation, PowerShell encoded commands, dynamic import+exec, reverse shells, Telegram/Discord exfil, DNS exfil, npm preinstall hooks, dependency confusion indicators, setup.py suspicious imports, and system info exfiltration. Tighten 6 existing rules to reduce false positives: threat-process-hooks (inline meta rules, exclude prepare/prepack), threat-process-injection-dll (remove overly broad .dll/.exe string matches), threat-runtime-system-info (require 3+ calls instead of 1), threat-process-spawn-silent (require both stdout+stderr suppressed), threat-runtime-obfuscation-general (raise hex threshold to 50+, remove bracket notation), threat-runtime-obfuscation-base64exec (tighten JS Buffer.from pattern, require explicit base64 encoding). Update risk engine: add "setup" and "npm" to valid categories (was silently dropping findings), make HIGH-specificity threats form standalone risks, add cross-category risk formation, add specificity gate (LOW-specificity-only capped at 4.9 unless MEDIUM+ specificity present), bump single-stage chain value from 0.3 to 0.4. Benchmark results (threshold 5.0, 1000 benign + 745 malicious packages): | | Baseline | Final | Change | |----------|----------|--------|---------| | Recall | 79.3% | 87.0% | +7.7pp | | Precision| 75.5% | 80.2% | +4.7pp | | F1 | 77.3% | 83.5% | +6.2pp | | MCC | 0.600 | 0.704 | +0.104 |
…lusters Removed 95 packages that had no source files (empty placeholders, dep confusion probes with no payload, nested-archive dataset bugs). Backfilled 94 replacements from previously unrepresented clusters for better diversity. Recall on cleaned dataset: 88.8% (was 87.0% on dirty dataset with empty packages dragging it down). PyPI recall 98.9%, compromised_lib 100%.
Add threat-runtime-obfuscation-log-suppress rule for console.log suppression combined with hex arrays/fromCharCode (common npm malware evasion). Extend threat-process-download-exec to catch Node.js child_process + fetch patterns. Recall now at 90.6% (threshold 5.0), up from 88.8%.
extension_scanner.scan_local was missing the info parameter added to the PackageScanner superclass, causing a mypy override error. test_single_runtime_threat expected 8.6 but single-stage chain value was changed from 0.3 to 0.4, making the correct score 8.8.
Replace per-package GitHub Contents API calls (~2 per package) with bulk Git Trees API (~10 calls total). Fixes sampling failures from rate limiting when resolving 1000+ packages. Add ASCII pipeline diagram to evals/README.md showing the full cluster -> sample -> scan -> report workflow.
Build the ZIP index upfront and only sample from packages that actually have ZIPs in the dataset. Fixes resolution failures when the manifest lists packages that don't have archived samples yet. Resampled: 1251 packages (251 pypi + 1000 npm), 1 per cluster, max diversity.
The dataset stores scoped npm packages with @ as separator (@0xengine@meow) but the manifest uses / (@0xengine/meow). Convert when parsing tree paths so scoped packages are found during sampling. This adds ~1359 previously invisible packages to the available pool.
Detection Benchmark ResultsBenchmark comparing detection quality before ( Aggregate
By ecosystem
By category
What changed12 new YARA threat rules targeting previously undetected patterns: download-and-execute chains, chr/hex obfuscation, PowerShell encoded commands, dynamic import+exec, reverse shells, Telegram/Discord exfil, DNS exfil, npm preinstall hooks, setup.py suspicious imports, system info exfiltration, and log suppression + obfuscation. 6 tightened existing rules to reduce FP rate: threat-process-hooks (inlined), threat-process-injection-dll, threat-runtime-system-info, threat-process-spawn-silent, threat-runtime-obfuscation-general, threat-runtime-obfuscation-base64exec. Risk engine improvements: added "setup"/"npm" categories (was silently dropping findings), standalone risk formation for HIGH-specificity threats, cross-category risk pairing, specificity gate (LOW-specificity-only capped below threshold), single-stage chain value bump. Eval infrastructure: cluster-aware sampling with empty-package filtering, scoped npm package support, bulk Git Trees API for ZIP resolution. |
This PR introduces guarddog v3.
This version replaces GuardDog's independent-alert model with a risk correlation engine.
Highlights:
Other changes: