Skip to content

Latest commit

 

History

History
206 lines (150 loc) · 7.65 KB

File metadata and controls

206 lines (150 loc) · 7.65 KB

Evaluation Framework

Philosophy

OMEN evaluation is inspired by the deterministic, multi-dimensional assessment methodology of DaScient ARES-E. The core principle: mission software must be continuously and verifiably trustworthy, not just tested once before release.


Evaluation Dimensions

# Dimension Key Questions
1 Functional Correctness Does the system do what it is supposed to do?
2 Mission Utility Does it improve aircrew decision-making under operational conditions?
3 Performance Does it meet startup, render, and response time targets?
4 DDIL Resilience Does it remain useful when connectivity is degraded or absent?
5 Usability Can aircrew operate it under realistic cognitive load?
6 Security Does it resist adversarial inputs, unauthorized access, and tamper?
7 Interoperability Does it correctly ingest and emit data across all supported protocols?
8 Maintainability Can it be updated, debugged, and extended by a small team?
9 Auditability Can every action, decision, and data transformation be traced?
10 Energy Efficiency Does it stay within compute and thermal budgets on constrained hardware?

Test Harnesses

Location: evaluation/harnesses/

Deterministic, reproducible test harnesses for each system component.

Harness Design Principles

  • Fixed random seeds for reproducibility
  • Synthetic data only in automated harnesses (no operational data)
  • Each harness produces a structured JSON report
  • Reports include pass/fail verdict, dimension scores, and evidence artifacts
  • Harnesses run offline (no external service dependencies)

Available Harnesses

Harness Target Location
engine_harness Plugin runtime and event bus harnesses/engine_harness.py
cal_harness Normalization pipeline and adapters harnesses/cal_harness.py
map_harness Map rendering and overlay correctness harnesses/map_harness.py
adapter_harness Individual adapter contract compliance harnesses/adapter_harness.py
ai_harness AI/agentic service output validation harnesses/ai_harness.py
security_harness Input validation and policy enforcement harnesses/security_harness.py

DDIL Simulation

Location: evaluation/ddil/

Network Impairment Profiles

Profile Bandwidth Latency Packet Loss Description
full Unlimited < 5 ms 0% Full connectivity baseline
degraded 256 kbps 200 ms 2% Degraded tactical link
intermittent 64 kbps 500 ms 15% Intermittent SATCOM
near-offline 8 kbps 2000 ms 40% Near-disconnected
offline 0 N/A 100% Fully disconnected

Impairment is applied using tc netem (Linux traffic control) in the test environment. A containerized network shim is provided for CI use.

DDIL Test Scenarios

  1. Graceful degradation — transition from full → degraded → offline; verify display remains usable
  2. Link recovery — transition from offline → degraded → full; verify data reconciliation
  3. Partial source loss — one of three data sources goes offline; verify fallback behavior
  4. Mission package integrity — load mission package in offline mode; verify all layers render
  5. Cache eviction under resource pressure — force TTL eviction; verify staleness indicators appear

Red-Team and Fault Injection

Location: evaluation/red-team/

Adversarial validation tests that probe the system's boundaries.

Fault Categories

Category Examples
Malformed inputs Invalid XML, truncated CoT messages, out-of-bounds coordinates
Data poisoning Spoofed track positions, forged threat overlays
Resource exhaustion High-rate message injection, large payload flooding
Protocol abuse Replay attacks, out-of-sequence messages
Schema drift Future-version schemas the current adapter has not seen
AI adversarial Adversarial prompts to AI summarization services

Red-Team Execution

Red-team scenarios are run in isolated environments and produce:

  • A structured findings report
  • Pass/fail verdict per scenario
  • Evidence artifacts (logs, traces, captures)
  • Recommended mitigations for any failures

Human-in-the-Loop (HITL) Review

Location: evaluation/scenarios/

Review Gates

HITL review gates are triggered for:

  • AI-generated route recommendations
  • Threat summarization outputs
  • Conflict resolution suggestions from the CAL
  • Any AI action with mission-level consequence

HITL Workflow

AI Service → [Generate Recommendation] → [HITL Queue]
                                              │
                                    Reviewer receives recommendation
                                    with provenance, confidence, and
                                    supporting evidence
                                              │
                                    ┌─────────┴───────────┐
                                    ▼                     ▼
                               [Approve]             [Reject / Modify]
                                    │                     │
                            Action proceeds         Action blocked;
                            with HITL stamp          feedback logged

Scenario Replay

Location: evaluation/scenarios/

Operational scenarios can be recorded and replayed deterministically for:

  • Regression testing after code changes
  • Evaluating new adapter or plugin behavior against known-good scenarios
  • Training and demonstration

Scenario Format

{
  "scenario_id": "alpha-01",
  "description": "Single aircraft route with one NOTAM and two threat updates",
  "events": [
    {"t": 0, "type": "track_update", "source": "cot", "payload": "..."},
    {"t": 5, "type": "notam_received", "source": "notam", "payload": "..."},
    {"t": 12, "type": "threat_update", "source": "cot", "payload": "..."}
  ],
  "expected_state": { "..." : "..." },
  "pass_criteria": ["track_visible", "notam_overlay_active", "threat_corridor_rendered"]
}

Pass/Fail Criteria

A build is considered evaluation-passing when:

Criterion Threshold
No critical mission workflow breaks 0 failures in engine_harness, cal_harness, map_harness
Stable offline operation All offline DDIL scenarios pass within resource budget
Data provenance preserved 100% of canonical entities carry complete provenance chain
UI usable under constrained conditions Render time SLO met in degraded and intermittent profiles
No unapproved AI action 0 AI actions executed without HITL approval in governance-gated scenarios
Logging and telemetry intact 100% of auditable events produce a log entry

Running the Evaluation Suite

cd evaluation
pip install -r requirements.txt

# Run all harnesses
pytest harnesses/ -v

# Run DDIL scenarios (requires Docker for network shim)
pytest ddil/ -v --docker

# Run red-team tests (isolated environment)
pytest red-team/ -v --isolated

# Generate evaluation report
python report.py --output evaluation-report.json

Related Documents