Models

Available Models

Friendly Name	HuggingFace Repo	Parameters	Speed	Notes
`tiny`	`mlx-community/whisper-tiny-mlx`	39M	Fastest	English-optimized variant available
`tiny.en`	`mlx-community/whisper-tiny-mlx-8bit`	39M	Fastest	English-only
`base`	`mlx-community/whisper-base-mlx`	74M	Fast	English-optimized variant available
`base.en`	`mlx-community/whisper-base-mlx-8bit`	74M	Fast	English-only
`small`	`mlx-community/whisper-small-mlx`	244M	Medium	Good accuracy/speed tradeoff
`small.en`	`mlx-community/whisper-small-mlx-8bit`	244M	Medium	English-only
`medium`	`mlx-community/whisper-medium-mlx`	769M	Slow	High accuracy
`medium.en`	`mlx-community/whisper-medium-mlx-8bit`	769M	Slow	English-only
`large-v3`	`mlx-community/whisper-large-v3-mlx`	1.5B	Slowest	Highest accuracy
`turbo`	`mlx-community/whisper-turbo`	809M	Fast	Speed-optimized large model
`distil-large-v3`	`mlx-community/distil-whisper-large-v3`	756M	Fast	Distilled — best speed/accuracy balance

Model Selection Guide

Use Case	Recommended Model	Batch Size
Quick drafts, testing	`tiny` or `base`	24-32
General transcription	`distil-large-v3`	12-16
Speed-critical production	`turbo`	4-8
Maximum accuracy	`large-v3`	4-8
Memory-constrained	Any model + `quant="4bit"`	Adjust down

Quantized Models

Vayu supports 4-bit and 8-bit quantized models for reduced memory usage:

from whisper_mlx import LightningWhisperMLX

# 4-bit quantization
whisper = LightningWhisperMLX(model="distil-large-v3", quant="4bit")

# 8-bit quantization
whisper = LightningWhisperMLX(model="large-v3", quant="8bit")

Available quantized variants:

Model	4-bit	8-bit
`tiny`	Yes	Yes
`small`	Yes	Yes
`medium`	Yes	Yes
`large-v3`	Yes	Yes
`distil-large-v3`	Yes	Yes

Batch Size Recommendations

Batch size controls how many 30-second audio segments are processed in a single forward pass. Higher batch sizes are faster but use more memory.

Model Size	Recommended `batch_size`	Memory Usage
tiny / base	24-32	Low
small	16-24	Medium
medium	8-12	High
large / turbo	4-8	High
distil-large-v3	12-16	Medium

Start with the recommended values and adjust based on your Mac's available memory. If you encounter out-of-memory errors, reduce the batch size.

Using Custom Models

You can use any Whisper model hosted on HuggingFace:

# By HuggingFace repo ID
whisper = LightningWhisperMLX(model="mlx-community/whisper-turbo")

# Or via the transcribe function
result = transcribe("audio.mp3", path_or_hf_repo="mlx-community/whisper-turbo")

Models are automatically downloaded and cached via huggingface_hub.

Model Architecture

All models follow the Whisper encoder-decoder architecture:

Encoder: Convolutional layers (stride 2) + Transformer blocks
Decoder: Transformer blocks with cross-attention to encoder output
Audio input: 30-second chunks → 80/128-channel mel spectrograms
Text output: Up to 448 tokens per segment
Vocabulary: 51,864 tokens (multilingual models)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models

Models

Available Models

Model Selection Guide

Quantized Models

Batch Size Recommendations

Using Custom Models

Model Architecture

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally