Add Qwen3-ASR-1.7B contrib model by jimburtoft · Pull Request #162 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-05-09T03:08:31Z

Summary

Adds NxDI implementation of Qwen3-ASR-1.7B (speech-to-text) using decomposed encoder-decoder pipeline
Traced audio encoder (3 bucket NEFFs: 5s/10s/30s) + NxDI text decoder (Qwen3-VL scatter pattern)
Full E2E pipeline validated with EXACT MATCH to CPU reference on trn2.3xlarge

Architecture

Audio Encoder: 24-layer Whisper-like transformer, Conv2D frontend, d_model=1024, output 2048-dim
Text Decoder: 28-layer Qwen3, hidden_size=2048, GQA 16/8, QK-norm, mRoPE [24,20,20]
Pipeline: mel spectrogram -> traced encoder -> scatter embeddings -> NxDI autoregressive decode

Performance (trn2.3xlarge, TP=4, LNC=2, SDK 2.29)

Metric	Value
TTFT (5s audio)	27.5ms
TPOT	4.9ms
RTF (30s audio)	0.020x (50x real-time)
Throughput	194 tok/s
WER (LibriSpeech test-clean)	3.06%
Audio throughput (DP=2, TP=2x2)	~46 audio-sec/wall-sec

Validation

SDK: 2.29 (DLAMI 20260410, NxDI 0.9.17334)
Instance: trn2.3xlarge (LNC=2)
Accuracy: EXACT MATCH with CPU greedy decode
Edge cases: silence, short audio (<1s), long audio (30s), non-speech all handled correctly

Files

contrib/models/Qwen3-ASR-1.7B/
├── README.md                         # Usage, benchmarks, compatibility
├── src/
│   ├── __init__.py
│   ├── modeling_qwen3_asr.py         # NxDI model class (text decoder)
│   └── audio_encoder.py             # Encoder tracing wrapper
└── test/
    ├── integration/
    │   └── test_model.py            # E2E accuracy + performance tests
    └── unit/

Key Implementation Notes

Text decoder reuses NeuronQwen3VLForCausalLM with scatter_by_index_put for audio embedding injection
Audio encoder uses a static wrapper (StaticQwen3ASREncoder) that replaces dynamic cu_seqlens with pre-computed block-diagonal attention mask for trace compatibility
rope_scaling must use "rope_type": "default" (mRoPE applied externally via rotary_position_ids)
inline_weights_to_neff=True causes accuracy regression for the encoder (cosine 0.724 vs CPU)

Maintainer

Jim Burtoft (@jimburtoft)

Implements Qwen3-ASR-1.7B (speech-to-text) on NeuronX Distributed Inference using a decomposed pipeline: traced audio encoder + NxDI text decoder. Key features: - Audio encoder traced with StaticQwen3ASREncoder (3 bucket NEFFs: 5s/10s/30s) - Text decoder reuses NeuronQwen3VLForCausalLM scatter mechanism - Full E2E pipeline with EXACT MATCH to CPU reference - WER: 3.06% on LibriSpeech test-clean (50 samples) - Performance: 4.9ms TPOT, 27.5ms TTFT, 50x real-time (30s audio) - Validated on trn2.3xlarge TP=4, SDK 2.29 Includes integration tests and comprehensive README.

Adds vllm/ directory with: - README.md: Patch instructions for vllm-neuron (constants, model loader, model runner, platform, utils) - neuron_qwen3_asr_vllm.py: Full NeuronQwen3ASRForCausalLM class - start-vllm-server.sh: Server launch script - test_transcription.py: API test script Validated: 140ms E2E latency for 2.9s audio via OpenAI chat completions API on trn2.3xlarge (TP=4, SDK 2.29, vLLM 0.16.0 + vllm-neuron 0.5.0).

jimburtoft added 2 commits May 8, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3-ASR-1.7B contrib model#162

Add Qwen3-ASR-1.7B contrib model#162
jimburtoft wants to merge 2 commits into
aws-neuron:mainfrom
jimburtoft:contrib/qwen3-asr

jimburtoft commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented May 9, 2026

Summary

Architecture

Performance (trn2.3xlarge, TP=4, LNC=2, SDK 2.29)

Validation

Files

Key Implementation Notes

Maintainer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant