-
Notifications
You must be signed in to change notification settings - Fork 0
Models
Behnam Ebrahimi edited this page Mar 29, 2026
·
1 revision
| Friendly Name | HuggingFace Repo | Parameters | Speed | Notes |
|---|---|---|---|---|
tiny |
mlx-community/whisper-tiny-mlx |
39M | Fastest | English-optimized variant available |
tiny.en |
mlx-community/whisper-tiny-mlx-8bit |
39M | Fastest | English-only |
base |
mlx-community/whisper-base-mlx |
74M | Fast | English-optimized variant available |
base.en |
mlx-community/whisper-base-mlx-8bit |
74M | Fast | English-only |
small |
mlx-community/whisper-small-mlx |
244M | Medium | Good accuracy/speed tradeoff |
small.en |
mlx-community/whisper-small-mlx-8bit |
244M | Medium | English-only |
medium |
mlx-community/whisper-medium-mlx |
769M | Slow | High accuracy |
medium.en |
mlx-community/whisper-medium-mlx-8bit |
769M | Slow | English-only |
large-v3 |
mlx-community/whisper-large-v3-mlx |
1.5B | Slowest | Highest accuracy |
turbo |
mlx-community/whisper-turbo |
809M | Fast | Speed-optimized large model |
distil-large-v3 |
mlx-community/distil-whisper-large-v3 |
756M | Fast | Distilled — best speed/accuracy balance |
| Use Case | Recommended Model | Batch Size |
|---|---|---|
| Quick drafts, testing |
tiny or base
|
24-32 |
| General transcription | distil-large-v3 |
12-16 |
| Speed-critical production | turbo |
4-8 |
| Maximum accuracy | large-v3 |
4-8 |
| Memory-constrained | Any model + quant="4bit"
|
Adjust down |
Vayu supports 4-bit and 8-bit quantized models for reduced memory usage:
from whisper_mlx import LightningWhisperMLX
# 4-bit quantization
whisper = LightningWhisperMLX(model="distil-large-v3", quant="4bit")
# 8-bit quantization
whisper = LightningWhisperMLX(model="large-v3", quant="8bit")Available quantized variants:
| Model | 4-bit | 8-bit |
|---|---|---|
tiny |
Yes | Yes |
small |
Yes | Yes |
medium |
Yes | Yes |
large-v3 |
Yes | Yes |
distil-large-v3 |
Yes | Yes |
Batch size controls how many 30-second audio segments are processed in a single forward pass. Higher batch sizes are faster but use more memory.
| Model Size | Recommended batch_size
|
Memory Usage |
|---|---|---|
| tiny / base | 24-32 | Low |
| small | 16-24 | Medium |
| medium | 8-12 | High |
| large / turbo | 4-8 | High |
| distil-large-v3 | 12-16 | Medium |
Start with the recommended values and adjust based on your Mac's available memory. If you encounter out-of-memory errors, reduce the batch size.
You can use any Whisper model hosted on HuggingFace:
# By HuggingFace repo ID
whisper = LightningWhisperMLX(model="mlx-community/whisper-turbo")
# Or via the transcribe function
result = transcribe("audio.mp3", path_or_hf_repo="mlx-community/whisper-turbo")Models are automatically downloaded and cached via huggingface_hub.
All models follow the Whisper encoder-decoder architecture:
- Encoder: Convolutional layers (stride 2) + Transformer blocks
- Decoder: Transformer blocks with cross-attention to encoder output
- Audio input: 30-second chunks → 80/128-channel mel spectrograms
- Text output: Up to 448 tokens per segment
- Vocabulary: 51,864 tokens (multilingual models)