-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Behnam Ebrahimi edited this page Mar 29, 2026
·
1 revision
Vayu's signature feature is batched decoding, which processes multiple audio segments in a single neural network forward pass instead of sequentially.
Sequential (standard Whisper):
Segment 1 → Forward Pass → Result 1
Segment 2 → Forward Pass → Result 2
Segment 3 → Forward Pass → Result 3
Segment 4 → Forward Pass → Result 4
Total: 4 forward passes
Batched (Vayu, batch_size=4):
Segment 1 ─┐
Segment 2 ─┼→ Single Forward Pass → Results 1-4
Segment 3 ─┤
Segment 4 ─┘
Total: 1 forward pass
This yields 3-5x faster transcription on Apple Silicon.
Batch size controls the speed/memory tradeoff:
| Model | Recommended batch_size
|
Memory |
|---|---|---|
| tiny / base | 24-32 | Low |
| small | 16-24 | Medium |
| medium | 8-12 | High |
| large / turbo | 4-8 | High |
| distil-large-v3 | 12-16 | Medium |
Tips:
- Start with the recommended value and increase until you hit memory limits
- If you get
MemoryError, reduce batch size -
batch_size=1falls back to sequential decoding (no speedup)
Reduce memory usage with 4-bit or 8-bit quantized models:
# ~4x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="4bit", batch_size=8)
# ~2x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="8bit", batch_size=8)Quantization enables larger batch sizes on memory-constrained systems, which can offset any accuracy loss with increased throughput.
An advanced optimization that uses a small draft model to propose tokens, verified by the main model in parallel:
from whisper_mlx import SpeculativeDecoder
decoder = SpeculativeDecoder(
draft_model_path="mlx-community/whisper-tiny-mlx",
target_model_path="mlx-community/whisper-large-v3-mlx",
)This can provide an additional 2-3x speedup on top of batched decoding.
Vayu uses a temperature fallback strategy to maintain quality without sacrificing speed:
- First attempt: greedy decoding (temperature=0.0) — fastest
- If quality checks fail (high compression ratio or low confidence): retry with temperature=0.2
- Continue escalating through [0.4, 0.6, 0.8, 1.0] until quality passes
Quality thresholds:
-
compression_ratio_threshold=2.4— rejects repetitive/hallucinated text -
logprob_threshold=-1.0— rejects low-confidence segments -
no_speech_threshold=0.6— detects silence
| Optimization | Benefit | How to Enable |
|---|---|---|
| Batched decoding | 3-5x faster | batch_size > 1 |
| Quantization | Lower memory, fits larger batches |
quant="4bit" or "8bit"
|
| Speculative decoding | 2-3x additional |
SpeculativeDecoder class |
| FP16 inference | Faster compute, less memory |
--fp16 True (default) |
| Model caching | No reload overhead | Automatic via ModelHolder
|
| Numba JIT | 2-3x faster word timestamps | Automatic |
| Temperature fallback | Quality without re-encoding | Automatic |
-
Always set
batch_size— The default is 1 (no batching). Set it to 12+ for significant speedups. -
Use
distil-large-v3for most tasks — Best balance of speed and accuracy. -
Use
turbofor speed-critical applications. - Use quantization when running large models on limited memory.
-
Avoid
word_timestampsunless needed — It adds processing overhead via DTW alignment.