Add FPAN-based version of fltflt addition with reduced latency#1172
Add FPAN-based version of fltflt addition with reduced latency#1172tbensonatl wants to merge 2 commits intomainfrom
Conversation
Switch to a floating-point accumulation network based approach for fltflt addition as presented in "High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This formulation uses the same number of floating point operations (20), but has a shorter critical path (10 vs 13) and thus higher ILP. Added a new fltflt_add latency benchmark. The throughput of fltflt_add() did not change because the number of operations remains the same, but the latency reduced by ~17-18% in the most latency-exposed benchmark. Also removed the separate fltflt and sarbp benchmark scripts and replaced them with a shorter benchmark script that leverages structured output from nvbench (JSON) rather than parse the nvbench standard output. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Greptile SummaryThis PR replaces the
Confidence Score: 5/5Safe to merge — the algorithmic change is a well-documented drop-in replacement from a peer-reviewed paper, op count is identical, and the benchmark additions are self-contained. The No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph FPAN["fltflt_add — FPAN form (new, depth 10)"]
A1["s = TwoSum(a.hi, b.hi)"] --> A2["q = FastTwoSum(s.hi, t.hi)"]
B1["t = TwoSum(a.lo, b.lo)"] --> A2
B1 --> B2["st_lo = s.lo + t.lo"]
A2 --> B3["stq_lo = st_lo + q.lo"]
B2 --> B3
B3 --> A4["return FastTwoSum(q.hi, stq_lo)"]
end
subgraph THALL["fltflt_add — Thall form (old, depth 13)"]
C1["s = TwoSum(a.hi, b.hi)"] --> C3["s.lo = s.lo + t.hi"]
C2["t = TwoSum(a.lo, b.lo)"] --> C3
C3 --> C4["s = FastTwoSum(s.hi, s.lo)"]
C4 --> C5["s.lo = s.lo + t.lo"]
C2 --> C5
C5 --> C6["s = FastTwoSum(s.hi, s.lo)"]
end
Reviews (2): Last reviewed commit: "Add sync to fltflt benchmark and timeout..." | Re-trigger Greptile |
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
|
/build |
| constexpr int block_size = 256; | ||
| int grid_size = static_cast<int>((size + block_size - 1) / block_size); | ||
|
|
||
| warmup_gpu_once(); |
There was a problem hiding this comment.
Isn't this the job of nvbench?
There was a problem hiding this comment.
It is. I was seeing high variability in the first benchmark run (add_throughput with floats). Adding that anecdotally stabilized the first run, but I can dig into it a bit more. nvbench will definitely handle single-run warmup (loading/copying the kernel, etc.), but didn't seem to be handling clock ramp up.
Switch to a floating-point accumulation network based approach for fltflt addition as presented in "High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic" by Zhang and Aiken. This formulation uses the same number of floating point operations (20), but has a shorter critical path (10 vs 13) and thus higher ILP.
Added a new fltflt_add latency benchmark. The throughput of fltflt_add() did not change because the number of operations remains the same, but the latency reduced by ~17-18% in the most latency-exposed benchmark.
Also removed the separate fltflt and sarbp benchmark scripts and replaced them with a shorter benchmark script that leverages structured output from nvbench (JSON) rather than parsing the nvbench standard output.