Outcome
A user picking a local model can do it with confidence: Cotabby inspects their actual Mac (chip, total and free RAM, GPU/Metal budget), recommends the best-fit model, clearly warns when a model won't fit, and — optionally — measures real generation speed so the user can see the concrete performance impact of changing models. All local, all offline.
Problem / Motivation
- Model selection today is raw Hugging Face browsing (
HuggingFaceSearchService) with no guidance on what fits or runs well. A user can pick a model too large for their RAM (swap thrash / OOM / crawl) or needlessly small (leaving quality on the table).
- After switching models there's no feedback on whether it got faster or slower.
Proposed Solution
Part A — Hardware-aware recommendation
Detection (extend Support/DeviceInfo.swift, still side-effect-free):
- Chip class (Apple Silicon vs Intel), P/E core counts (
hw.perflevel0.physicalcpu, hw.perflevel1.physicalcpu), GPU core count where available.
- Free/available RAM:
host_statistics64 + vm_statistics64 (free + inactive + purgeable × page size) and memory-pressure state; account for unified memory on Apple Silicon.
- Metal budget:
MTLDevice.recommendedMaxWorkingSetSize — the real GPU memory ceiling for the llama Metal backend.
Fit logic (new pure Support/ModelFitEvaluator.swift + Models/CuratedModelCatalog.swift):
- Each catalog entry carries an estimated footprint (params × quant bytes/param + KV-cache for the configured context).
- Classify per model for this host: Recommended / Will run (tight) / Too large, against free RAM and Metal budget with a headroom margin.
- Default recommendation = largest model that comfortably fits and clears a quality floor.
UI (EngineAndModelPaneView):
- Detected-hardware summary (chip, total + free RAM, GPU).
- Per-model badge: ✅ Recommended / ⚠️ Tight / ⛔ Won't fit, with a live free-RAM readout.
- Confirm/warn before downloading or loading an over-budget model.
Part B — Model performance tracking (opt-in)
Builds on existing infra: LLMIOFileHandler already writes latency_ms + request_id to llm-io.jsonl, and LlamaRuntimeCore owns generation.
- Per-generation metrics: time-to-first-token (TTFT), total latency, decode tokens/sec, prompt token count, eval count, engine + model id + quant + context length.
- Aggregation (new
Services/Metrics/ModelPerformanceTracker.swift, opt-in, local-only): rolling per-model median/p90 TTFT and tok/s + sample count; persisted locally (UserDefaults rolling aggregate now, or a future encrypted store).
- UI: a "Performance" section showing the current model's median TTFT & tok/s and a comparison vs the previously used model (e.g. "Gemma 3 4B: 180 ms TTFT, 28 tok/s — 2.1× slower than Llama-3.2 1B").
- Privacy: strictly local, opt-in, wipeable, never networked.
Acceptance Criteria
Out of Scope
- Auto-downloading the recommended model without consent.
- Any cloud benchmark sharing / telemetry.
Open Questions
- Footprint estimation: static per-model table vs measured-after-load calibration?
- Aggregate storage: UserDefaults vs a future encrypted local store?
- Should the recommendation factor thermal state / battery (on battery → bias smaller)?
Outcome
A user picking a local model can do it with confidence: Cotabby inspects their actual Mac (chip, total and free RAM, GPU/Metal budget), recommends the best-fit model, clearly warns when a model won't fit, and — optionally — measures real generation speed so the user can see the concrete performance impact of changing models. All local, all offline.
Problem / Motivation
HuggingFaceSearchService) with no guidance on what fits or runs well. A user can pick a model too large for their RAM (swap thrash / OOM / crawl) or needlessly small (leaving quality on the table).Proposed Solution
Part A — Hardware-aware recommendation
Detection (extend
Support/DeviceInfo.swift, still side-effect-free):hw.perflevel0.physicalcpu,hw.perflevel1.physicalcpu), GPU core count where available.host_statistics64+vm_statistics64(free + inactive + purgeable × page size) and memory-pressure state; account for unified memory on Apple Silicon.MTLDevice.recommendedMaxWorkingSetSize— the real GPU memory ceiling for the llama Metal backend.Fit logic (new pure
Support/ModelFitEvaluator.swift+Models/CuratedModelCatalog.swift):UI (
EngineAndModelPaneView):Part B — Model performance tracking (opt-in)
Builds on existing infra:
LLMIOFileHandleralready writeslatency_ms+request_idtollm-io.jsonl, andLlamaRuntimeCoreowns generation.Services/Metrics/ModelPerformanceTracker.swift, opt-in, local-only): rolling per-model median/p90 TTFT and tok/s + sample count; persisted locally (UserDefaults rolling aggregate now, or a future encrypted store).Acceptance Criteria
DeviceInforeports free + total RAM, chip class, core/GPU counts, and Metal budget.Out of Scope
Open Questions