Skip to content

Add Qwen3-ASR-1.7B contrib model#162

Open
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-asr
Open

Add Qwen3-ASR-1.7B contrib model#162
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-asr

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds NxDI implementation of Qwen3-ASR-1.7B (speech-to-text) using decomposed encoder-decoder pipeline
  • Traced audio encoder (3 bucket NEFFs: 5s/10s/30s) + NxDI text decoder (Qwen3-VL scatter pattern)
  • Full E2E pipeline validated with EXACT MATCH to CPU reference on trn2.3xlarge

Architecture

  • Audio Encoder: 24-layer Whisper-like transformer, Conv2D frontend, d_model=1024, output 2048-dim
  • Text Decoder: 28-layer Qwen3, hidden_size=2048, GQA 16/8, QK-norm, mRoPE [24,20,20]
  • Pipeline: mel spectrogram -> traced encoder -> scatter embeddings -> NxDI autoregressive decode

Performance (trn2.3xlarge, TP=4, LNC=2, SDK 2.29)

Metric Value
TTFT (5s audio) 27.5ms
TPOT 4.9ms
RTF (30s audio) 0.020x (50x real-time)
Throughput 194 tok/s
WER (LibriSpeech test-clean) 3.06%
Audio throughput (DP=2, TP=2x2) ~46 audio-sec/wall-sec

Validation

  • SDK: 2.29 (DLAMI 20260410, NxDI 0.9.17334)
  • Instance: trn2.3xlarge (LNC=2)
  • Accuracy: EXACT MATCH with CPU greedy decode
  • Edge cases: silence, short audio (<1s), long audio (30s), non-speech all handled correctly

Files

contrib/models/Qwen3-ASR-1.7B/
├── README.md                         # Usage, benchmarks, compatibility
├── src/
│   ├── __init__.py
│   ├── modeling_qwen3_asr.py         # NxDI model class (text decoder)
│   └── audio_encoder.py             # Encoder tracing wrapper
└── test/
    ├── integration/
    │   └── test_model.py            # E2E accuracy + performance tests
    └── unit/

Key Implementation Notes

  • Text decoder reuses NeuronQwen3VLForCausalLM with scatter_by_index_put for audio embedding injection
  • Audio encoder uses a static wrapper (StaticQwen3ASREncoder) that replaces dynamic cu_seqlens with pre-computed block-diagonal attention mask for trace compatibility
  • rope_scaling must use "rope_type": "default" (mRoPE applied externally via rotary_position_ids)
  • inline_weights_to_neff=True causes accuracy regression for the encoder (cosine 0.724 vs CPU)

Maintainer

Jim Burtoft (@jimburtoft)

jimburtoft added 2 commits May 8, 2026 23:05
Implements Qwen3-ASR-1.7B (speech-to-text) on NeuronX Distributed Inference
using a decomposed pipeline: traced audio encoder + NxDI text decoder.

Key features:
- Audio encoder traced with StaticQwen3ASREncoder (3 bucket NEFFs: 5s/10s/30s)
- Text decoder reuses NeuronQwen3VLForCausalLM scatter mechanism
- Full E2E pipeline with EXACT MATCH to CPU reference
- WER: 3.06% on LibriSpeech test-clean (50 samples)
- Performance: 4.9ms TPOT, 27.5ms TTFT, 50x real-time (30s audio)
- Validated on trn2.3xlarge TP=4, SDK 2.29

Includes integration tests and comprehensive README.
Adds vllm/ directory with:
- README.md: Patch instructions for vllm-neuron (constants, model loader,
  model runner, platform, utils)
- neuron_qwen3_asr_vllm.py: Full NeuronQwen3ASRForCausalLM class
- start-vllm-server.sh: Server launch script
- test_transcription.py: API test script

Validated: 140ms E2E latency for 2.9s audio via OpenAI chat completions API
on trn2.3xlarge (TP=4, SDK 2.29, vLLM 0.16.0 + vllm-neuron 0.5.0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant