Skip to content

Performance

Behnam Ebrahimi edited this page Mar 29, 2026 · 1 revision

Performance

Batched Decoding — The Core Innovation

Vayu's signature feature is batched decoding, which processes multiple audio segments in a single neural network forward pass instead of sequentially.

Sequential vs. Batched

Sequential (standard Whisper):
  Segment 1 → Forward Pass → Result 1
  Segment 2 → Forward Pass → Result 2
  Segment 3 → Forward Pass → Result 3
  Segment 4 → Forward Pass → Result 4
  Total: 4 forward passes

Batched (Vayu, batch_size=4):
  Segment 1 ─┐
  Segment 2 ─┼→ Single Forward Pass → Results 1-4
  Segment 3 ─┤
  Segment 4 ─┘
  Total: 1 forward pass

This yields 3-5x faster transcription on Apple Silicon.

Tuning Batch Size

Batch size controls the speed/memory tradeoff:

Model Recommended batch_size Memory
tiny / base 24-32 Low
small 16-24 Medium
medium 8-12 High
large / turbo 4-8 High
distil-large-v3 12-16 Medium

Tips:

  • Start with the recommended value and increase until you hit memory limits
  • If you get MemoryError, reduce batch size
  • batch_size=1 falls back to sequential decoding (no speedup)

Quantization

Reduce memory usage with 4-bit or 8-bit quantized models:

# ~4x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="4bit", batch_size=8)

# ~2x less memory than full precision
whisper = LightningWhisperMLX(model="large-v3", quant="8bit", batch_size=8)

Quantization enables larger batch sizes on memory-constrained systems, which can offset any accuracy loss with increased throughput.

Speculative Decoding (Experimental)

An advanced optimization that uses a small draft model to propose tokens, verified by the main model in parallel:

from whisper_mlx import SpeculativeDecoder

decoder = SpeculativeDecoder(
    draft_model_path="mlx-community/whisper-tiny-mlx",
    target_model_path="mlx-community/whisper-large-v3-mlx",
)

This can provide an additional 2-3x speedup on top of batched decoding.

Quality Fallback

Vayu uses a temperature fallback strategy to maintain quality without sacrificing speed:

  1. First attempt: greedy decoding (temperature=0.0) — fastest
  2. If quality checks fail (high compression ratio or low confidence): retry with temperature=0.2
  3. Continue escalating through [0.4, 0.6, 0.8, 1.0] until quality passes

Quality thresholds:

  • compression_ratio_threshold=2.4 — rejects repetitive/hallucinated text
  • logprob_threshold=-1.0 — rejects low-confidence segments
  • no_speech_threshold=0.6 — detects silence

Optimization Summary

Optimization Benefit How to Enable
Batched decoding 3-5x faster batch_size > 1
Quantization Lower memory, fits larger batches quant="4bit" or "8bit"
Speculative decoding 2-3x additional SpeculativeDecoder class
FP16 inference Faster compute, less memory --fp16 True (default)
Model caching No reload overhead Automatic via ModelHolder
Numba JIT 2-3x faster word timestamps Automatic
Temperature fallback Quality without re-encoding Automatic

Best Practices

  1. Always set batch_size — The default is 1 (no batching). Set it to 12+ for significant speedups.
  2. Use distil-large-v3 for most tasks — Best balance of speed and accuracy.
  3. Use turbo for speed-critical applications.
  4. Use quantization when running large models on limited memory.
  5. Avoid word_timestamps unless needed — It adds processing overhead via DTW alignment.

Clone this wiki locally