Skip to content

contrib: add SmolVLA-Libero (HuggingFaceVLA/smolvla_libero)#157

Open
KevGomes1403 wants to merge 2 commits into
aws-neuron:mainfrom
KevGomes1403:contrib/SmolVLA-Libero
Open

contrib: add SmolVLA-Libero (HuggingFaceVLA/smolvla_libero)#157
KevGomes1403 wants to merge 2 commits into
aws-neuron:mainfrom
KevGomes1403:contrib/SmolVLA-Libero

Conversation

@KevGomes1403
Copy link
Copy Markdown

Description

Adds a NeuronX Distributed Inference port of HuggingFaceVLA/smolvla_libero — a SmolVLM2-backed flow-matching vision-language-action (VLA) policy fine-tuned on the LIBERO benchmark — under contrib/models/SmolVLA-Libero/.

The port compiles to three NEFFs (SigLIP vision encoder, SmolLM2-style 32-layer VLM prefix decoder returning a full KV cache, and a 32-layer action-expert denoiser with self/cross-attn alternation). The 10-step Euler denoising loop runs on CPU because static-shape compilation cannot host the Python for loop; each step is one call into the compiled denoiser NEFF.

Validated end-to-end against the upstream lerobot.SmolVLAPolicy reference on CPU using shared seeded initial noise: cos_sim = 0.999921, mean abs diff = 0.0015. Closed-loop LIBERO libero_object task 0 succeeds. Warm p50 latency for one full action chunk (vision + prefix + 10 Euler steps) is 62.5 ms on trn3pd98.3xlarge.

A working closed-loop LIBERO mujoco demo (with mp4 + GIF output) lives in
my development repo: https://github.com/KevGomes1403/trainium-model-translation/tree/main/models/smol_vla
The demo is intentionally not part of this PR because it pulls in lerobot, mujoco, and the LIBERO suite at runtime

Model Information

Model Name: SmolVLA-Libero (HuggingFaceVLA/smolvla_libero)

Model Architecture: Flow-matching vision-language-action policy. SigLIP vision encoder (12 layers, hidden 768) + pixel-shuffle 4× connector + SmolLM2-style VLM text decoder (32 layers, hidden 960, 15 q-heads / 5 KV-heads GQA, RoPE, RMSNorm) + action-expert denoiser (32 layers, expert hidden 480, even layers self-attn over concatenated past-VLM-KV + suffix KV, odd layers cross-attn from suffix Q to past VLM KV, sinusoidal timestep embedding).

Purpose: Robotic manipulation — given two camera views, a natural-language instruction, and the current robot state, produces a 50-step action chunk ([B, 50, 32]) of which the first 7 dims drive a Franka arm in the LIBERO
mujoco simulation.

Checklist

Required Components

  • Accuracy Test (contrib/models/SmolVLA-Libero/test/integration/test_model.py)

    • test_lerobot_cpu_neuron_parity — loads lerobot.SmolVLAPolicy.from_pretrained(...) on CPU and compares the action chunk against the Neuron output with shared seeded initial noise. Asserts cos_sim ≥ 0.99 and mean abs diff < 0.05.
    • test_smoke_synthetic_chunk — full pipeline returns a finite [B, 50, 32] action chunk with non-zero variance.
    • test_warm_latency — sanity bound on warm p50 latency.
    • Test compiles the three NEFFs from scratch (~90 s) on first run, then reuses them.
  • README.md with the following sections:

    • Usage Example: Both CLI (run_inference.py --action compile|run)
      and programmatic (from modeling_smolvla import SmolVLAPolicy).
    • Compatibility Matrix: Trn3 ✅ at Neuron 2.29; Trn2 / Trn1 / Inf2 not
      tested (the 15/5 head split makes TP > 1 unproductive, but the code
      is portable).
    • Example Checkpoints: HuggingFaceVLA/smolvla_libero.
    • Testing Instructions: pytest contrib/models/SmolVLA-Libero/test/integration/test_model.py
      gated by SMOLVLA_CKPT and SMOLVLA_NEFF env vars.
  • Source Code (src/)

    • config_constants.py — all architecture constants from the checkpoint.
    • weight_mapping.py — HF safetensors → 3 per-subgraph state dicts.
    • modeling_smolvla_vision.py — SigLIP-12L + connector (NEFF 1).
    • modeling_smolvla_text.py — VLM prefix + action-expert denoiser (NEFFs 2 + 3).
    • modeling_smolvla.pySmolVLAPolicy orchestrator (compile / load / generate).
    • neuron_action_head_base.py — minimal NeuronDenoisingConfig shim (SmolVLA isn't a CausalLM, so it cannot reuse InferenceConfig).
    • run_inference.py — CLI (compile / run / benchmark).
    • All modules use NxDI parallel primitives (ColumnParallelLinear, RowParallelLinear, ParallelEmbedding) so the code is portable to instances with friendlier head counts; on this hardware they no-op (TP=1) — see "Notable architectural decisions" below.

Optional Components

  • Unit Tests — not included in this PR. Integration accuracy test is sufficient for v1; can be added in a follow-up if reviewers want finer-grained coverage of individual modules.

Folder Structure

contrib/models/SmolVLA-Libero/
├── README.md
├── src/
│   ├── __init__.py
│   ├── config_constants.py
│   ├── modeling_smolvla.py
│   ├── modeling_smolvla_text.py
│   ├── modeling_smolvla_vision.py
│   ├── neuron_action_head_base.py
│   ├── run_inference.py
│   └── weight_mapping.py
└── test/
    ├── __init__.py
    ├── integration/
    │   ├── __init__.py
    │   └── test_model.py
    └── unit/
        └── __init__.py

Testing

How did you test this change?

Tests run on trn3pd98.3xlarge under
/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference (Neuron 2.29 / neuronx_distributed_inference 0.9.0):

  1. Fresh compile of all three NEFFs from the contrib copy, ~91 s.
  2. CLI inference (run_inference.py --action run) — cold + 10 warm iters.
  3. Pytest integration suite — three tests including the lerobot-CPU parity
    accuracy check.
export SMOLVLA_CKPT=/home/ubuntu/.cache/huggingface/hub/models--HuggingFaceVLA--smolvla_libero/snapshots/<hash>/
export SMOLVLA_NEFF=/path/to/neff_output_dir
pytest contrib/models/SmolVLA-Libero/test/integration/test_model.py --capture=tee-sys

Test Results:

contrib/models/SmolVLA-Libero/test/integration/test_model.py::test_smoke_synthetic_chunk
    [smolvla] load: 8.0s
    PASSED [ 33%]
contrib/models/SmolVLA-Libero/test/integration/test_model.py::test_warm_latency
    [smolvla] warm p50 latency: 62.4 ms over 5 iters
    PASSED [ 66%]
contrib/models/SmolVLA-Libero/test/integration/test_model.py::test_lerobot_cpu_neuron_parity
    [smolvla] Neuron vs lerobot CPU parity: cos_sim=0.999921 max_abs=0.0373 mean_abs=0.0015
    PASSED [100%]

======================= 3 passed, 17 warnings in 24.05s ========================

CLI bench (10 warm iters):

Loaded in 7.2s
Cold:  75.1 ms  shape=(1, 50, 32)  hasNaN=False  mean=-0.0123  std=0.2718
Warm:  p50=62.5 ms  p99=65.4 ms  min=60.7 ms  max=65.4 ms

Closed-loop LIBERO libero_object task 0 (validated locally; demo scripts
not part of this PR — see "Additional Information" below): success.

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference,
    neuronx_distributed_inference 0.9.0)
  • Instance Type(s): Trn3 (trn3pd98.3xlarge)
  • PyTorch Version: 2.9.1
  • Python Version: 3.12.3

Additional Information

Notable architectural decisions:

  • tp_degree = 1 because num_attention_heads = 15 and num_kv_heads = 5 do not divide cleanly into the 4 cores on trn3pd98.3xlarge. NxDI parallel primitives are still used throughout, so the code is portable to instances with divisor-friendly head counts; on this hardware sharding effectively no-ops.
  • 10-step Euler denoising loop runs on CPU. Static-shape compilation cannot host a Python for loop as a single graph; the loop body is the compiled NEFF feat: Add support for phi3 models #3, called 10 times per chunk.
  • Custom NeuronDenoisingConfig (in src/neuron_action_head_base.py) is used instead of InferenceConfig because SmolVLA is a flow-matching VLA, not a CausalLM — InferenceConfig's LLM-specific fields (KV-cache layout, sequence buckets, vocab size, etc.) have no meaning here. The shim exposes only the fields ModelWrapper.__init__() actually reads, plus the action-head-specific fields used by each subgraph wrapper.

Numerical-fidelity notes:

To match the lerobot reference output, the port replicates four
lerobot-specific quirks not in the SmolVLM2 HF config:

  1. resize_with_pad pads top+left only (image lands in the bottom-right corner of the 512×512 frame).
  2. Pad-aware attention: dynamic 2D mask + cumsum-based position_ids that skip padding tokens.
  3. RoPE max_wavelength = 10000 (lerobot hardcodes this in apply_rope; SmolVLM2 HF config says 100000).
  4. 180° image flip for the LIBERO env per libero_processor in lerobot.processor.env_processor.

Related Issues

Link the issue you opened before this PR.

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared
    to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant