Skip to content

Add FPAN-based version of fltflt addition with reduced latency#1172

Open
tbensonatl wants to merge 2 commits intomainfrom
add-fpan-based-fltflt-addition
Open

Add FPAN-based version of fltflt addition with reduced latency#1172
tbensonatl wants to merge 2 commits intomainfrom
add-fpan-based-fltflt-addition

Conversation

@tbensonatl
Copy link
Copy Markdown
Collaborator

Switch to a floating-point accumulation network based approach for fltflt addition as presented in "High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This formulation uses the same number of floating point operations (20), but has a shorter critical path (10 vs 13) and thus higher ILP.

Added a new fltflt_add latency benchmark. The throughput of fltflt_add() did not change because the number of operations remains the same, but the latency reduced by ~17-18% in the most latency-exposed benchmark.

Also removed the separate fltflt and sarbp benchmark scripts and replaced them with a shorter benchmark script that leverages structured output from nvbench (JSON) rather than parsing the nvbench standard output.

Switch to a floating-point accumulation network based approach for fltflt
addition as presented in "High-Performance Branch-Free Algorithms for
Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This
formulation uses the same number of floating point operations (20), but
has a shorter critical path (10 vs 13) and thus higher ILP.

Added a new fltflt_add latency benchmark. The throughput of fltflt_add()
did not change because the number of operations remains the same, but
the latency reduced by ~17-18% in the most latency-exposed benchmark.

Also removed the separate fltflt and sarbp benchmark scripts and replaced
them with a shorter benchmark script that leverages structured output from
nvbench (JSON) rather than parse the nvbench standard output.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl self-assigned this May 6, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR replaces the fltflt_add implementation with the Floating-Point Accumulation Network (FPAN) formulation from Zhang & Aiken (SC'25), reducing the critical-path depth from 13 to 10 fp32 ops while keeping the same 20-op total (no throughput regression). A new latency benchmark is added to expose and measure this improvement, and the two separate benchmark scripts are replaced by a unified run_benchmarks.py that parses nvbench JSON output.

  • include/matx/kernels/fltflt.h: fltflt_add(fltflt, fltflt) is rewritten to run the FastTwoSum(s.hi, t.hi) concurrently with the lo-side add, shortening the serial dependency chain; the fltflt_add(fltflt, float) overload is unchanged.
  • bench/00_misc/fltflt_arithmetic.cu: Adds warmup_gpu_once(), bump_ulp() helpers, renames the existing add benchmark to _throughput, and introduces fltflt_bench_add_latency / chain_add_kernel to sweep the latency/throughput trade-off across block counts.
  • bench/scripts/: Removes run_fltflt_benchmarks.py and run_sarbp_benchmarks.py, replacing them with run_benchmarks.py that uses nvbench JSON (stable API) instead of parsing ANSI-colored terminal output, and preserves both profiles' original timeouts.

Confidence Score: 5/5

Safe to merge — the algorithmic change is a well-documented drop-in replacement from a peer-reviewed paper, op count is identical, and the benchmark additions are self-contained.

The fltflt_add rewrite directly implements Figure 2 from Zhang & Aiken SC'25 and the comment includes a detailed critical-path analysis. Both previous review concerns (missing exec.sync() in the latency benchmark and missing subprocess timeout) are addressed in this revision. No correctness regressions are visible in the changed paths.

No files require special attention.

Important Files Changed

Filename Overview
include/matx/kernels/fltflt.h Core algorithmic change: fltflt_add(fltflt, fltflt) rewritten using FPAN from Zhang & Aiken SC'25 Fig 2, reducing critical-path depth from 13 to 10 fp32 ops with no op-count change. Float overload and all other functions untouched.
bench/00_misc/fltflt_arithmetic.cu Adds warmup_gpu_once(), bump_ulp() helpers, renames add benchmark to _throughput, and adds a new fltflt_bench_add_latency with serial chain_add_kernel; all benchmark functions now call warmup_gpu_once() followed by exec.sync() before the timing window.
bench/scripts/run_benchmarks.py New unified benchmark runner replacing two deleted scripts; reads nvbench JSON (stable API) for result parsing, preserves original per-profile timeouts (300 s fltflt / 600 s sarbp), and handles timeout and non-zero exit codes cleanly.
bench/scripts/run_fltflt_benchmarks.py Deleted — functionality absorbed into run_benchmarks.py with improved JSON-based parsing.
bench/scripts/run_sarbp_benchmarks.py Deleted — functionality absorbed into run_benchmarks.py with improved JSON-based parsing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph FPAN["fltflt_add — FPAN form (new, depth 10)"]
        A1["s = TwoSum(a.hi, b.hi)"] --> A2["q = FastTwoSum(s.hi, t.hi)"]
        B1["t = TwoSum(a.lo, b.lo)"] --> A2
        B1 --> B2["st_lo = s.lo + t.lo"]
        A2 --> B3["stq_lo = st_lo + q.lo"]
        B2 --> B3
        B3 --> A4["return FastTwoSum(q.hi, stq_lo)"]
    end

    subgraph THALL["fltflt_add — Thall form (old, depth 13)"]
        C1["s = TwoSum(a.hi, b.hi)"] --> C3["s.lo = s.lo + t.hi"]
        C2["t = TwoSum(a.lo, b.lo)"] --> C3
        C3 --> C4["s = FastTwoSum(s.hi, s.lo)"]
        C4 --> C5["s.lo = s.lo + t.lo"]
        C2 --> C5
        C5 --> C6["s = FastTwoSum(s.hi, s.lo)"]
    end
Loading

Reviews (2): Last reviewed commit: "Add sync to fltflt benchmark and timeout..." | Re-trigger Greptile

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage is 94.308%add-fpan-based-fltflt-addition into main. No base build found for main.

constexpr int block_size = 256;
int grid_size = static_cast<int>((size + block_size - 1) / block_size);

warmup_gpu_once();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the job of nvbench?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I was seeing high variability in the first benchmark run (add_throughput with floats). Adding that anecdotally stabilized the first run, but I can dig into it a bit more. nvbench will definitely handle single-run warmup (loading/copying the kernel, etc.), but didn't seem to be handling clock ramp up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants