contrib: add SmolVLA-Libero (HuggingFaceVLA/smolvla_libero)#157
Open
KevGomes1403 wants to merge 2 commits into
Open
contrib: add SmolVLA-Libero (HuggingFaceVLA/smolvla_libero)#157KevGomes1403 wants to merge 2 commits into
KevGomes1403 wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a NeuronX Distributed Inference port of
HuggingFaceVLA/smolvla_libero— a SmolVLM2-backed flow-matching vision-language-action (VLA) policy fine-tuned on the LIBERO benchmark — undercontrib/models/SmolVLA-Libero/.The port compiles to three NEFFs (SigLIP vision encoder, SmolLM2-style 32-layer VLM prefix decoder returning a full KV cache, and a 32-layer action-expert denoiser with self/cross-attn alternation). The 10-step Euler denoising loop runs on CPU because static-shape compilation cannot host the Python
forloop; each step is one call into the compiled denoiser NEFF.Validated end-to-end against the upstream
lerobot.SmolVLAPolicyreference on CPU using shared seeded initial noise: cos_sim = 0.999921, mean abs diff = 0.0015. Closed-loop LIBEROlibero_objecttask 0 succeeds. Warm p50 latency for one full action chunk (vision + prefix + 10 Euler steps) is 62.5 ms ontrn3pd98.3xlarge.A working closed-loop LIBERO mujoco demo (with mp4 + GIF output) lives in
my development repo: https://github.com/KevGomes1403/trainium-model-translation/tree/main/models/smol_vla
The demo is intentionally not part of this PR because it pulls in
lerobot,mujoco, and the LIBERO suite at runtimeModel Information
Model Name: SmolVLA-Libero (
HuggingFaceVLA/smolvla_libero)Model Architecture: Flow-matching vision-language-action policy. SigLIP vision encoder (12 layers, hidden 768) + pixel-shuffle 4× connector + SmolLM2-style VLM text decoder (32 layers, hidden 960, 15 q-heads / 5 KV-heads GQA, RoPE, RMSNorm) + action-expert denoiser (32 layers, expert hidden 480, even layers self-attn over concatenated past-VLM-KV + suffix KV, odd layers cross-attn from suffix Q to past VLM KV, sinusoidal timestep embedding).
Purpose: Robotic manipulation — given two camera views, a natural-language instruction, and the current robot state, produces a 50-step action chunk (
[B, 50, 32]) of which the first 7 dims drive a Franka arm in the LIBEROmujoco simulation.
Checklist
Required Components
Accuracy Test (
contrib/models/SmolVLA-Libero/test/integration/test_model.py)test_lerobot_cpu_neuron_parity— loadslerobot.SmolVLAPolicy.from_pretrained(...)on CPU and compares the action chunk against the Neuron output with shared seeded initial noise. Asserts cos_sim ≥ 0.99 and mean abs diff < 0.05.test_smoke_synthetic_chunk— full pipeline returns a finite[B, 50, 32]action chunk with non-zero variance.test_warm_latency— sanity bound on warm p50 latency.README.md with the following sections:
run_inference.py --action compile|run)and programmatic (
from modeling_smolvla import SmolVLAPolicy).tested (the 15/5 head split makes TP > 1 unproductive, but the code
is portable).
HuggingFaceVLA/smolvla_libero.pytest contrib/models/SmolVLA-Libero/test/integration/test_model.pygated by
SMOLVLA_CKPTandSMOLVLA_NEFFenv vars.Source Code (
src/)config_constants.py— all architecture constants from the checkpoint.weight_mapping.py— HF safetensors → 3 per-subgraph state dicts.modeling_smolvla_vision.py— SigLIP-12L + connector (NEFF 1).modeling_smolvla_text.py— VLM prefix + action-expert denoiser (NEFFs 2 + 3).modeling_smolvla.py—SmolVLAPolicyorchestrator (compile / load / generate).neuron_action_head_base.py— minimalNeuronDenoisingConfigshim (SmolVLA isn't a CausalLM, so it cannot reuseInferenceConfig).run_inference.py— CLI (compile / run / benchmark).ColumnParallelLinear,RowParallelLinear,ParallelEmbedding) so the code is portable to instances with friendlier head counts; on this hardware they no-op (TP=1) — see "Notable architectural decisions" below.Optional Components
Folder Structure
Testing
How did you test this change?
Tests run on
trn3pd98.3xlargeunder/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference(Neuron 2.29 /neuronx_distributed_inference0.9.0):run_inference.py --action run) — cold + 10 warm iters.accuracy check.
Test Results:
CLI bench (10 warm iters):
Closed-loop LIBERO
libero_objecttask 0 (validated locally; demo scriptsnot part of this PR — see "Additional Information" below): success.
Compatibility
Tested with:
/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference,neuronx_distributed_inference0.9.0)trn3pd98.3xlarge)Additional Information
Notable architectural decisions:
tp_degree = 1becausenum_attention_heads = 15andnum_kv_heads = 5do not divide cleanly into the 4 cores ontrn3pd98.3xlarge. NxDI parallel primitives are still used throughout, so the code is portable to instances with divisor-friendly head counts; on this hardware sharding effectively no-ops.forloop as a single graph; the loop body is the compiled NEFF feat: Add support for phi3 models #3, called 10 times per chunk.NeuronDenoisingConfig(insrc/neuron_action_head_base.py) is used instead ofInferenceConfigbecause SmolVLA is a flow-matching VLA, not a CausalLM —InferenceConfig's LLM-specific fields (KV-cache layout, sequence buckets, vocab size, etc.) have no meaning here. The shim exposes only the fieldsModelWrapper.__init__()actually reads, plus the action-head-specific fields used by each subgraph wrapper.Numerical-fidelity notes:
To match the lerobot reference output, the port replicates four
lerobot-specific quirks not in the SmolVLM2 HF config:
resize_with_padpads top+left only (image lands in the bottom-right corner of the 512×512 frame).max_wavelength = 10000(lerobot hardcodes this inapply_rope; SmolVLM2 HF config says 100000).libero_processorinlerobot.processor.env_processor.Related Issues
Link the issue you opened before this PR.
vLLM Integration
By submitting this PR, I confirm that:
to officially-supported models