Add FPAN-based version of fltflt addition with reduced latency by tbensonatl · Pull Request #1172 · NVIDIA/MatX

tbensonatl · 2026-05-06T22:18:01Z

Switch to a floating-point accumulation network based approach for fltflt addition as presented in "High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This formulation uses the same number of floating point operations (20), but has a shorter critical path (10 vs 13) and thus higher ILP.

Added a new fltflt_add latency benchmark. The throughput of fltflt_add() did not change because the number of operations remains the same, but the latency reduced by ~17-18% in the most latency-exposed benchmark.

Also removed the separate fltflt and sarbp benchmark scripts and replaced them with a shorter benchmark script that leverages structured output from nvbench (JSON) rather than parsing the nvbench standard output.

Switch to a floating-point accumulation network based approach for fltflt addition as presented in "High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This formulation uses the same number of floating point operations (20), but has a shorter critical path (10 vs 13) and thus higher ILP. Added a new fltflt_add latency benchmark. The throughput of fltflt_add() did not change because the number of operations remains the same, but the latency reduced by ~17-18% in the most latency-exposed benchmark. Also removed the separate fltflt and sarbp benchmark scripts and replaced them with a shorter benchmark script that leverages structured output from nvbench (JSON) rather than parse the nvbench standard output. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-05-06T22:18:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-06T22:22:42Z

Greptile Summary

This PR replaces the fltflt_add implementation with the Floating-Point Accumulation Network (FPAN) formulation from Zhang & Aiken (SC'25), reducing the critical-path depth from 13 to 10 fp32 ops while keeping the same 20-op total (no throughput regression). A new latency benchmark is added to expose and measure this improvement, and the two separate benchmark scripts are replaced by a unified run_benchmarks.py that parses nvbench JSON output.

include/matx/kernels/fltflt.h: fltflt_add(fltflt, fltflt) is rewritten to run the FastTwoSum(s.hi, t.hi) concurrently with the lo-side add, shortening the serial dependency chain; the fltflt_add(fltflt, float) overload is unchanged.
bench/00_misc/fltflt_arithmetic.cu: Adds warmup_gpu_once(), bump_ulp() helpers, renames the existing add benchmark to _throughput, and introduces fltflt_bench_add_latency / chain_add_kernel to sweep the latency/throughput trade-off across block counts.
bench/scripts/: Removes run_fltflt_benchmarks.py and run_sarbp_benchmarks.py, replacing them with run_benchmarks.py that uses nvbench JSON (stable API) instead of parsing ANSI-colored terminal output, and preserves both profiles' original timeouts.

Confidence Score: 5/5

Safe to merge — the algorithmic change is a well-documented drop-in replacement from a peer-reviewed paper, op count is identical, and the benchmark additions are self-contained.

The fltflt_add rewrite directly implements Figure 2 from Zhang & Aiken SC'25 and the comment includes a detailed critical-path analysis. Both previous review concerns (missing exec.sync() in the latency benchmark and missing subprocess timeout) are addressed in this revision. No correctness regressions are visible in the changed paths.

No files require special attention.

Important Files Changed

Filename	Overview
include/matx/kernels/fltflt.h	Core algorithmic change: `fltflt_add(fltflt, fltflt)` rewritten using FPAN from Zhang & Aiken SC'25 Fig 2, reducing critical-path depth from 13 to 10 fp32 ops with no op-count change. Float overload and all other functions untouched.
bench/00_misc/fltflt_arithmetic.cu	Adds `warmup_gpu_once()`, `bump_ulp()` helpers, renames add benchmark to `_throughput`, and adds a new `fltflt_bench_add_latency` with serial `chain_add_kernel`; all benchmark functions now call `warmup_gpu_once()` followed by `exec.sync()` before the timing window.
bench/scripts/run_benchmarks.py	New unified benchmark runner replacing two deleted scripts; reads nvbench JSON (stable API) for result parsing, preserves original per-profile timeouts (300 s fltflt / 600 s sarbp), and handles timeout and non-zero exit codes cleanly.
bench/scripts/run_fltflt_benchmarks.py	Deleted — functionality absorbed into run_benchmarks.py with improved JSON-based parsing.
bench/scripts/run_sarbp_benchmarks.py	Deleted — functionality absorbed into run_benchmarks.py with improved JSON-based parsing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph FPAN["fltflt_add — FPAN form (new, depth 10)"]
        A1["s = TwoSum(a.hi, b.hi)"] --> A2["q = FastTwoSum(s.hi, t.hi)"]
        B1["t = TwoSum(a.lo, b.lo)"] --> A2
        B1 --> B2["st_lo = s.lo + t.lo"]
        A2 --> B3["stq_lo = st_lo + q.lo"]
        B2 --> B3
        B3 --> A4["return FastTwoSum(q.hi, stq_lo)"]
    end

    subgraph THALL["fltflt_add — Thall form (old, depth 13)"]
        C1["s = TwoSum(a.hi, b.hi)"] --> C3["s.lo = s.lo + t.hi"]
        C2["t = TwoSum(a.lo, b.lo)"] --> C3
        C3 --> C4["s = FastTwoSum(s.hi, s.lo)"]
        C4 --> C5["s.lo = s.lo + t.lo"]
        C2 --> C5
        C5 --> C6["s = FastTwoSum(s.hi, s.lo)"]
    end

_{Reviews (2): Last reviewed commit: "Add sync to fltflt benchmark and timeout..." | Re-trigger Greptile}

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-05-06T22:42:43Z

/build

coveralls · 2026-05-07T04:53:28Z

Coverage is 94.308% — add-fpan-based-fltflt-addition into main. No base build found for main.

cliffburdick · 2026-05-07T22:36:07Z

  constexpr int block_size = 256;
  int grid_size = static_cast<int>((size + block_size - 1) / block_size);

+  warmup_gpu_once();


Isn't this the job of nvbench?

It is. I was seeing high variability in the first benchmark run (add_throughput with floats). Adding that anecdotally stabilized the first run, but I can dig into it a bit more. nvbench will definitely handle single-run warmup (loading/copying the kernel, etc.), but didn't seem to be handling clock ramp up.

tbensonatl self-assigned this May 6, 2026

Add sync to fltflt benchmark and timeout to new benchmark script

357a78f

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl requested review from cliffburdick and simonbyrne May 6, 2026 22:44

cliffburdick reviewed May 7, 2026

View reviewed changes

cliffburdick approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FPAN-based version of fltflt addition with reduced latency#1172

Add FPAN-based version of fltflt addition with reduced latency#1172
tbensonatl wants to merge 2 commits intomainfrom
add-fpan-based-fltflt-addition

tbensonatl commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

greptile-apps Bot commented May 6, 2026 •

edited

Loading

Uh oh!

tbensonatl commented May 6, 2026

Uh oh!

coveralls commented May 7, 2026

Uh oh!

cliffburdick May 7, 2026

Uh oh!

tbensonatl May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbensonatl commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

tbensonatl commented May 6, 2026

Uh oh!

coveralls commented May 7, 2026

Uh oh!

cliffburdick May 7, 2026

Choose a reason for hiding this comment

Uh oh!

tbensonatl May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 6, 2026 •

edited

Loading