Skip to content

[Feature] Hardware-aware model recommendation + optional model performance tracking #453

@jkrauska

Description

@jkrauska

Outcome

A user picking a local model can do it with confidence: Cotabby inspects their actual Mac (chip, total and free RAM, GPU/Metal budget), recommends the best-fit model, clearly warns when a model won't fit, and — optionally — measures real generation speed so the user can see the concrete performance impact of changing models. All local, all offline.

Problem / Motivation

  • Model selection today is raw Hugging Face browsing (HuggingFaceSearchService) with no guidance on what fits or runs well. A user can pick a model too large for their RAM (swap thrash / OOM / crawl) or needlessly small (leaving quality on the table).
  • After switching models there's no feedback on whether it got faster or slower.

Proposed Solution

Part A — Hardware-aware recommendation

Detection (extend Support/DeviceInfo.swift, still side-effect-free):

  • Chip class (Apple Silicon vs Intel), P/E core counts (hw.perflevel0.physicalcpu, hw.perflevel1.physicalcpu), GPU core count where available.
  • Free/available RAM: host_statistics64 + vm_statistics64 (free + inactive + purgeable × page size) and memory-pressure state; account for unified memory on Apple Silicon.
  • Metal budget: MTLDevice.recommendedMaxWorkingSetSize — the real GPU memory ceiling for the llama Metal backend.

Fit logic (new pure Support/ModelFitEvaluator.swift + Models/CuratedModelCatalog.swift):

  • Each catalog entry carries an estimated footprint (params × quant bytes/param + KV-cache for the configured context).
  • Classify per model for this host: Recommended / Will run (tight) / Too large, against free RAM and Metal budget with a headroom margin.
  • Default recommendation = largest model that comfortably fits and clears a quality floor.

UI (EngineAndModelPaneView):

  • Detected-hardware summary (chip, total + free RAM, GPU).
  • Per-model badge: ✅ Recommended / ⚠️ Tight / ⛔ Won't fit, with a live free-RAM readout.
  • Confirm/warn before downloading or loading an over-budget model.

Part B — Model performance tracking (opt-in)

Builds on existing infra: LLMIOFileHandler already writes latency_ms + request_id to llm-io.jsonl, and LlamaRuntimeCore owns generation.

  • Per-generation metrics: time-to-first-token (TTFT), total latency, decode tokens/sec, prompt token count, eval count, engine + model id + quant + context length.
  • Aggregation (new Services/Metrics/ModelPerformanceTracker.swift, opt-in, local-only): rolling per-model median/p90 TTFT and tok/s + sample count; persisted locally (UserDefaults rolling aggregate now, or a future encrypted store).
  • UI: a "Performance" section showing the current model's median TTFT & tok/s and a comparison vs the previously used model (e.g. "Gemma 3 4B: 180 ms TTFT, 28 tok/s — 2.1× slower than Llama-3.2 1B").
  • Privacy: strictly local, opt-in, wipeable, never networked.

Acceptance Criteria

  • DeviceInfo reports free + total RAM, chip class, core/GPU counts, and Metal budget.
  • Model pane shows fit badges + live free RAM and warns/blocks on over-budget load.
  • With tracking ON, after a few generations the pane shows TTFT and tok/s, and switching models shows a side-by-side comparison.
  • Works fully offline; no new network calls; tracking is opt-in and clearable.

Out of Scope

  • Auto-downloading the recommended model without consent.
  • Any cloud benchmark sharing / telemetry.

Open Questions

  • Footprint estimation: static per-model table vs measured-after-load calibration?
  • Aggregate storage: UserDefaults vs a future encrypted local store?
  • Should the recommendation factor thermal state / battery (on battery → bias smaller)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions