Skip to content
Behnam Ebrahimi edited this page Mar 29, 2026 · 1 revision

Models

Available Models

Friendly Name HuggingFace Repo Parameters Speed Notes
tiny mlx-community/whisper-tiny-mlx 39M Fastest English-optimized variant available
tiny.en mlx-community/whisper-tiny-mlx-8bit 39M Fastest English-only
base mlx-community/whisper-base-mlx 74M Fast English-optimized variant available
base.en mlx-community/whisper-base-mlx-8bit 74M Fast English-only
small mlx-community/whisper-small-mlx 244M Medium Good accuracy/speed tradeoff
small.en mlx-community/whisper-small-mlx-8bit 244M Medium English-only
medium mlx-community/whisper-medium-mlx 769M Slow High accuracy
medium.en mlx-community/whisper-medium-mlx-8bit 769M Slow English-only
large-v3 mlx-community/whisper-large-v3-mlx 1.5B Slowest Highest accuracy
turbo mlx-community/whisper-turbo 809M Fast Speed-optimized large model
distil-large-v3 mlx-community/distil-whisper-large-v3 756M Fast Distilled — best speed/accuracy balance

Model Selection Guide

Use Case Recommended Model Batch Size
Quick drafts, testing tiny or base 24-32
General transcription distil-large-v3 12-16
Speed-critical production turbo 4-8
Maximum accuracy large-v3 4-8
Memory-constrained Any model + quant="4bit" Adjust down

Quantized Models

Vayu supports 4-bit and 8-bit quantized models for reduced memory usage:

from whisper_mlx import LightningWhisperMLX

# 4-bit quantization
whisper = LightningWhisperMLX(model="distil-large-v3", quant="4bit")

# 8-bit quantization
whisper = LightningWhisperMLX(model="large-v3", quant="8bit")

Available quantized variants:

Model 4-bit 8-bit
tiny Yes Yes
small Yes Yes
medium Yes Yes
large-v3 Yes Yes
distil-large-v3 Yes Yes

Batch Size Recommendations

Batch size controls how many 30-second audio segments are processed in a single forward pass. Higher batch sizes are faster but use more memory.

Model Size Recommended batch_size Memory Usage
tiny / base 24-32 Low
small 16-24 Medium
medium 8-12 High
large / turbo 4-8 High
distil-large-v3 12-16 Medium

Start with the recommended values and adjust based on your Mac's available memory. If you encounter out-of-memory errors, reduce the batch size.

Using Custom Models

You can use any Whisper model hosted on HuggingFace:

# By HuggingFace repo ID
whisper = LightningWhisperMLX(model="mlx-community/whisper-turbo")

# Or via the transcribe function
result = transcribe("audio.mp3", path_or_hf_repo="mlx-community/whisper-turbo")

Models are automatically downloaded and cached via huggingface_hub.

Model Architecture

All models follow the Whisper encoder-decoder architecture:

  • Encoder: Convolutional layers (stride 2) + Transformer blocks
  • Decoder: Transformer blocks with cross-attention to encoder output
  • Audio input: 30-second chunks → 80/128-channel mel spectrograms
  • Text output: Up to 448 tokens per segment
  • Vocabulary: 51,864 tokens (multilingual models)

Clone this wiki locally