Performance

Batched Decoding — The Core Innovation

Vayu's signature feature is batched decoding, which processes multiple audio segments in a single neural network forward pass instead of sequentially.

Sequential vs. Batched

Sequential (standard Whisper):
  Segment 1 → Forward Pass → Result 1
  Segment 2 → Forward Pass → Result 2
  Segment 3 → Forward Pass → Result 3
  Segment 4 → Forward Pass → Result 4
  Total: 4 forward passes

Batched (Vayu, batch_size=4):
  Segment 1 ─┐
  Segment 2 ─┼→ Single Forward Pass → Results 1-4
  Segment 3 ─┤
  Segment 4 ─┘
  Total: 1 forward pass

This yields 3-5x faster transcription on Apple Silicon.

Tuning Batch Size

Batch size controls the speed/memory tradeoff:

Model	Recommended `batch_size`	Memory
tiny / base	24-32	Low
small	16-24	Medium
medium	8-12	High
large / turbo	4-8	High
distil-large-v3	12-16	Medium

Tips:

Start with the recommended value and increase until you hit memory limits
If you get MemoryError, reduce batch size
batch_size=1 falls back to sequential decoding (no speedup)

Quantization

Reduce memory usage with 4-bit or 8-bit quantized models:

# ~4x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="4bit", batch_size=8)

# ~2x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="8bit", batch_size=8)

Quantization enables larger batch sizes on memory-constrained systems, which can offset any accuracy loss with increased throughput.

Speculative Decoding (Experimental)

An advanced optimization that uses a small draft model to propose tokens, verified by the main model in parallel:

from whisper_mlx import SpeculativeDecoder

decoder = SpeculativeDecoder(
    draft_model_path="mlx-community/whisper-tiny-mlx",
    target_model_path="mlx-community/whisper-large-v3-mlx",
)

This can provide an additional 2-3x speedup on top of batched decoding.

Quality Fallback

Vayu uses a temperature fallback strategy to maintain quality without sacrificing speed:

First attempt: greedy decoding (temperature=0.0) — fastest
If quality checks fail (high compression ratio or low confidence): retry with temperature=0.2
Continue escalating through [0.4, 0.6, 0.8, 1.0] until quality passes

Quality thresholds:

compression_ratio_threshold=2.4 — rejects repetitive/hallucinated text
logprob_threshold=-1.0 — rejects low-confidence segments
no_speech_threshold=0.6 — detects silence

Optimization Summary

Optimization	Benefit	How to Enable
Batched decoding	3-5x faster	`batch_size > 1`
Quantization	Lower memory, fits larger batches	`quant="4bit"` or `"8bit"`
Speculative decoding	2-3x additional	`SpeculativeDecoder` class
FP16 inference	Faster compute, less memory	`--fp16 True` (default)
Model caching	No reload overhead	Automatic via `ModelHolder`
Numba JIT	2-3x faster word timestamps	Automatic
Temperature fallback	Quality without re-encoding	Automatic

Best Practices

Always set batch_size — The default is 1 (no batching). Set it to 12+ for significant speedups.
Use distil-large-v3 for most tasks — Best balance of speed and accuracy.
Use turbo for speed-critical applications.
Use quantization when running large models on limited memory.
Avoid word_timestamps unless needed — It adds processing overhead via DTW alignment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Performance

Batched Decoding — The Core Innovation

Sequential vs. Batched

Tuning Batch Size

Quantization

Speculative Decoding (Experimental)

Quality Fallback

Optimization Summary

Best Practices

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally