OMEN evaluation is inspired by the deterministic, multi-dimensional assessment methodology of DaScient ARES-E. The core principle: mission software must be continuously and verifiably trustworthy, not just tested once before release.
| # | Dimension | Key Questions |
|---|---|---|
| 1 | Functional Correctness | Does the system do what it is supposed to do? |
| 2 | Mission Utility | Does it improve aircrew decision-making under operational conditions? |
| 3 | Performance | Does it meet startup, render, and response time targets? |
| 4 | DDIL Resilience | Does it remain useful when connectivity is degraded or absent? |
| 5 | Usability | Can aircrew operate it under realistic cognitive load? |
| 6 | Security | Does it resist adversarial inputs, unauthorized access, and tamper? |
| 7 | Interoperability | Does it correctly ingest and emit data across all supported protocols? |
| 8 | Maintainability | Can it be updated, debugged, and extended by a small team? |
| 9 | Auditability | Can every action, decision, and data transformation be traced? |
| 10 | Energy Efficiency | Does it stay within compute and thermal budgets on constrained hardware? |
Location: evaluation/harnesses/
Deterministic, reproducible test harnesses for each system component.
- Fixed random seeds for reproducibility
- Synthetic data only in automated harnesses (no operational data)
- Each harness produces a structured JSON report
- Reports include pass/fail verdict, dimension scores, and evidence artifacts
- Harnesses run offline (no external service dependencies)
| Harness | Target | Location |
|---|---|---|
engine_harness |
Plugin runtime and event bus | harnesses/engine_harness.py |
cal_harness |
Normalization pipeline and adapters | harnesses/cal_harness.py |
map_harness |
Map rendering and overlay correctness | harnesses/map_harness.py |
adapter_harness |
Individual adapter contract compliance | harnesses/adapter_harness.py |
ai_harness |
AI/agentic service output validation | harnesses/ai_harness.py |
security_harness |
Input validation and policy enforcement | harnesses/security_harness.py |
Location: evaluation/ddil/
| Profile | Bandwidth | Latency | Packet Loss | Description |
|---|---|---|---|---|
full |
Unlimited | < 5 ms | 0% | Full connectivity baseline |
degraded |
256 kbps | 200 ms | 2% | Degraded tactical link |
intermittent |
64 kbps | 500 ms | 15% | Intermittent SATCOM |
near-offline |
8 kbps | 2000 ms | 40% | Near-disconnected |
offline |
0 | N/A | 100% | Fully disconnected |
Impairment is applied using tc netem (Linux traffic control) in the test environment. A containerized network shim is provided for CI use.
- Graceful degradation — transition from full → degraded → offline; verify display remains usable
- Link recovery — transition from offline → degraded → full; verify data reconciliation
- Partial source loss — one of three data sources goes offline; verify fallback behavior
- Mission package integrity — load mission package in offline mode; verify all layers render
- Cache eviction under resource pressure — force TTL eviction; verify staleness indicators appear
Location: evaluation/red-team/
Adversarial validation tests that probe the system's boundaries.
| Category | Examples |
|---|---|
| Malformed inputs | Invalid XML, truncated CoT messages, out-of-bounds coordinates |
| Data poisoning | Spoofed track positions, forged threat overlays |
| Resource exhaustion | High-rate message injection, large payload flooding |
| Protocol abuse | Replay attacks, out-of-sequence messages |
| Schema drift | Future-version schemas the current adapter has not seen |
| AI adversarial | Adversarial prompts to AI summarization services |
Red-team scenarios are run in isolated environments and produce:
- A structured findings report
- Pass/fail verdict per scenario
- Evidence artifacts (logs, traces, captures)
- Recommended mitigations for any failures
Location: evaluation/scenarios/
HITL review gates are triggered for:
- AI-generated route recommendations
- Threat summarization outputs
- Conflict resolution suggestions from the CAL
- Any AI action with mission-level consequence
AI Service → [Generate Recommendation] → [HITL Queue]
│
Reviewer receives recommendation
with provenance, confidence, and
supporting evidence
│
┌─────────┴───────────┐
▼ ▼
[Approve] [Reject / Modify]
│ │
Action proceeds Action blocked;
with HITL stamp feedback logged
Location: evaluation/scenarios/
Operational scenarios can be recorded and replayed deterministically for:
- Regression testing after code changes
- Evaluating new adapter or plugin behavior against known-good scenarios
- Training and demonstration
{
"scenario_id": "alpha-01",
"description": "Single aircraft route with one NOTAM and two threat updates",
"events": [
{"t": 0, "type": "track_update", "source": "cot", "payload": "..."},
{"t": 5, "type": "notam_received", "source": "notam", "payload": "..."},
{"t": 12, "type": "threat_update", "source": "cot", "payload": "..."}
],
"expected_state": { "..." : "..." },
"pass_criteria": ["track_visible", "notam_overlay_active", "threat_corridor_rendered"]
}A build is considered evaluation-passing when:
| Criterion | Threshold |
|---|---|
| No critical mission workflow breaks | 0 failures in engine_harness, cal_harness, map_harness |
| Stable offline operation | All offline DDIL scenarios pass within resource budget |
| Data provenance preserved | 100% of canonical entities carry complete provenance chain |
| UI usable under constrained conditions | Render time SLO met in degraded and intermittent profiles |
| No unapproved AI action | 0 AI actions executed without HITL approval in governance-gated scenarios |
| Logging and telemetry intact | 100% of auditable events produce a log entry |
cd evaluation
pip install -r requirements.txt
# Run all harnesses
pytest harnesses/ -v
# Run DDIL scenarios (requires Docker for network shim)
pytest ddil/ -v --docker
# Run red-team tests (isolated environment)
pytest red-team/ -v --isolated
# Generate evaluation report
python report.py --output evaluation-report.json