Add YOLO26 object detection contrib model#151
Open
jimburtoft wants to merge 8 commits into
Open
Conversation
Ultralytics YOLO26 (n/s/m/l/x) on Trainium2 via torch_neuronx.trace(). All 5 detection variants plus pose and OBB task heads compile and run with high accuracy (CosSim 0.987-0.997). Peak throughput on trn2.3xlarge (LNC=1, DP=8): - YOLO26s: 1,523 img/s (1.43x vs A10G compiled) - YOLO26m: 1,267 img/s (2.67x vs A10G compiled) - YOLO26l: 1,093 img/s (2.95x vs A10G compiled) - YOLO26x: 876 img/s (4.49x vs A10G compiled) Includes modeling module, 13 integration tests (all passing), Jupyter notebook, and README with benchmarks.
Tested all 4 combinations: - trn2.3xlarge SDK 2.28: 13/13 pytest passed - trn2.3xlarge SDK 2.29: 13/13 pytest passed - inf2.xlarge SDK 2.28: 6/6 standalone tests passed - inf2.xlarge SDK 2.29: 6/6 standalone tests passed inf2 single-core throughput: yolo26n 60-70 img/s, yolo26s 64-77 img/s. Updated compatibility matrix and notebook prerequisites.
tejasamx-aws
approved these changes
May 10, 2026
| This contrib uses `torch_neuronx.trace()` rather than NxDI model classes because: (1) all variants fit trivially on a single NeuronCore (<180 MB NEFF), (2) there is no KV cache or token generation, and (3) the Conv2d-dominant architecture does not benefit from NxDI's attention infrastructure. Data Parallelism across NeuronCores provides throughput scaling. | ||
|
|
||
| Key Neuron porting challenges: | ||
| - **`topk`/`sort` unsupported:** End-to-end postprocessing requires `torch.topk` which fails with `NCC_EVRF029`. Solution: trace with `end2end=False` for raw output, run postprocessing on CPU. |
There was a problem hiding this comment.
The topk/sort limitation requiring end2end=False means users get raw [B, 84, 8400] output and must implement their own NMS postprocessing. We should include a CPU postprocessing util function (even if simple) so users have a complete detection pipeline.
| data = json.loads(result.stdout) | ||
| total = sum(dev.get("nc_count", 0) for dev in data) | ||
| return total | ||
| except Exception: |
… document C2PSA issue Review feedback from tejasamx-aws: - Add postprocess_detections() CPU utility with NMS for complete detection pipeline from raw [B, 84, 8400] output - Replace bare 'except Exception' with specific exception types in get_neuron_core_count() C2PSA known issue (confirmed on SDK 2.29.1, inf2.xlarge): - C2PSA attention module (layer 10) produces numerically incorrect output on NCv2 at all resolutions (CosSim ~0.46 vs CPU reference) - Full-model CosSim of 0.99+ masks this because C2PSA output is diluted by correct backbone/neck/head outputs - Backbone compilation fails at batch size >= 4 with non-square input (NCC_IPCC901 compiler error) - Documented in Known Issues with workaround guidance
Root cause: neuronx-cc produces incorrect output for torch.Tensor.split() with unequal split sizes on dim=2 of a 4D tensor after a .view() reshape. The C2PSA Attention module's .split([key_dim, key_dim, head_dim], dim=2) triggers this, causing CosSim ~0.45 vs CPU (should be >0.99). Fix: Patch Attention.forward in prepare_yolo26() to use explicit tensor slicing ([:, :, :key_dim, :]) instead of .split(). This compiles correctly and produces CosSim 0.9999. Verified on inf2.xlarge, SDK 2.29.1: - Before fix: C2PSA CosSim 0.45, full-model detection accuracy broken - After fix: C2PSA CosSim 0.9999, matching CPU reference The underlying compiler issue is tracked in aws-neuron-sdk#1323.
…around) The neuronx-cc compiler crashes (exit code 70) when .split((c, c), dim=1) is used in C2f blocks at batch_size=4 with small spatial dimensions (H*W < ~264). Using .chunk(2, 1) -- semantically identical -- compiles correctly at all batch sizes and spatial dimensions. Root cause investigation: only bs=4 triggers the failure (bs=1-3,5-16 all pass). The boundary is ~264 total spatial pixels at 256 channels. This is the same underlying .split() compiler bug as the C2PSA issue. Verified: Layer 8 (C3k2) at [4, 256, 12, 20] now compiles with chunk. See: aws-neuron/aws-neuron-sdk#1323
…ound Root cause identified: C2PSA.forward uses .split((c,c), dim=1) which, when combined with downstream attention, corrupts all batch elements except element 0 (CosSim ~0.08-0.23 vs CPU reference). This is the same neuronx-cc .split() bug as Issues 1 and 2, manifesting as silent numerical corruption instead of a compilation failure. Fix: Patch C2PSA.forward to use .chunk(2, 1) which is semantically identical but produces correct HLO. Verified: full YOLO26l model at bs=2 now produces CosSim > 0.999 for all batch elements.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch_neuronx.trace()Validation
Validated on 4 configurations: trn2.3xlarge × {SDK 2.28, 2.29} and inf2.xlarge × {SDK 2.28, 2.29}.
Peak Throughput (trn2.3xlarge, LNC=1, DP=8)
Files
Key Design Decisions
torch_neuronx.trace()(not NxDI model classes): YOLO26 is a CNN with no KV cache, no attention matrices, no token generation. All variants fit on a single NeuronCore (<180 MB NEFF). Data Parallelism provides throughput scaling.end2end=False:topk/sortoperations are not supported on Neuron (NCC_EVRF029). Raw[B, 84, 8400]output with CPU postprocessing (~0.1ms overhead).NCC_IGCA030). n/s use FP32.--auto-castflags:matmultautocast produces NaN for Conv2d-dominant models.--lnc 1compiler flag required when running on LNC=1 mode.Target
aws-neuron/neuronx-distributed-inferencemain branch.