Skip to content

[Feature] Add MLX inference backend #457

@FuJacob

Description

@FuJacob

Summary

Add an MLX-based inference backend as an alternative to the existing llama.cpp path, targeting Apple Silicon Macs.

Problem

The current local inference path uses llama.cpp, which runs on the CPU (and partially on Metal via GGML). MLX is Apple's own machine learning framework optimized for Apple Silicon's unified memory architecture. On M-series chips, MLX can deliver significantly better throughput and lower latency than llama.cpp for the same model because it is designed ground-up for the hardware.

Users with Apple Silicon Macs would get faster completions and lower energy draw from local inference without switching to the Apple Intelligence engine.

Proposed direction

  • Add a new SuggestionEngineKind case (e.g. .llamaMLX or .mlx) alongside the existing .llamaOpenSource and .appleIntelligence cases.
  • Implement an MLXSuggestionEngine conforming to the existing SuggestionEngineProtocol contract in SuggestionSubsystemContracts.swift.
  • Route through SuggestionEngineRouter the same way the llama path does today.
  • Support GGUF or MLX-native quantized weights (e.g. via mlx-community HuggingFace models). Reuse the existing ModelDownloadManager / BundledRuntimeLocator where possible.
  • Gate the engine option on Apple Silicon availability so it never appears on Intel Macs.
  • Surface the new engine in the Engine picker in both the menu bar popup and Settings > Engine & Model.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions