vLLM/SGLang: comparison report#904
Conversation
📝 WalkthroughWalkthroughThis PR introduces a shared LLM serving benchmark comparison report framework. It refactors ChangesLLM Serving Comparison Reports
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
bd9012d to
2596726
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 175-183: diff_comparison_values currently uses repr(v) to detect
differences which produces false positives (e.g., 1 vs 1.0 or dicts with
different insertion order); replace that with a canonicalization + equality
check: add a small helper (inside diff_comparison_values) named
canonicalize(value) that (a) for Mapping or list/tuple returns json.dumps(value,
sort_keys=True, default=str) to normalize structural content/order, (b) for
numeric types normalizes to a single numeric form (e.g., float(value) or
Decimal) so 1 and 1.0 compare equal, and (c) otherwise returns the value itself
(or str(value) as a last resort); then compute canonical_forms =
[canonicalize(v) for v in all_values] and use len(set(canonical_forms)) > 1 to
decide diffs and assign diff[key] = all_values as before (refer to
diff_comparison_values, values_by_run, all_values).
In `@src/cloudai/workloads/common/llm_serving_report.py`:
- Line 176: The calls to extract_data_as_df are duplicated in create_tables and
create_charts causing repeated parsing; modify the class to compute and cache
extracted DataFrames once and reuse them: e.g., add a lightweight cache
attribute (like self._df_cache) keyed by the test run object or an immutable id
(referencing extract_data_as_df, create_tables, create_charts, and
group.items/item.tr), populate it the first time you call extract_data_as_df for
a given tr, and have both create_tables and create_charts read from that cache
(or accept an injected precomputed list of DataFrames) instead of calling
extract_data_as_df again.
In `@src/cloudai/workloads/vllm/vllm_comparison_report.py`:
- Around line 24-25: The current import in vllm_comparison_report.py pulls
VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from
the package root (cloudai.workloads.vllm) which creates a circular init
dependency; change those to direct imports from the submodules that actually
define them (use relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.
In `@tests/workloads/common/test_llm_serving_report.py`:
- Around line 40-137: Add a second test mirroring
test_llm_comparison_report_generates_html that exercises the SGLang path
end-to-end: create two cloudai.core.TestRun instances for SGLang (use
SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6a508bb1-229e-4702-b1ad-bdb49888fef9
📒 Files selected for processing (12)
src/cloudai/registration.pysrc/cloudai/report_generator/comparison_report.pysrc/cloudai/report_generator/groups.pysrc/cloudai/report_generator/util.pysrc/cloudai/workloads/common/llm_serving_report.pysrc/cloudai/workloads/sglang/__init__.pysrc/cloudai/workloads/sglang/sglang_comparison_report.pysrc/cloudai/workloads/vllm/__init__.pysrc/cloudai/workloads/vllm/vllm_comparison_report.pytests/test_init.pytests/workloads/common/test_llm_serving.pytests/workloads/common/test_llm_serving_report.py
| def diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]]: | ||
| """Return value differences across comparable TestRun value dictionaries.""" | ||
| all_keys = set().union(*[values.keys() for values in values_by_run]) | ||
|
|
||
| diff = {} | ||
| for key in all_keys: | ||
| all_values = [d[key] for d in dicts] | ||
| if len(set(all_values)) > 1: | ||
| all_values = [values.get(key) for values in values_by_run] | ||
| if len({repr(v) for v in all_values}) > 1: | ||
| diff[key] = all_values |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
samples = [
[1, 1.0],
[{"a": 1, "b": 2}, {"b": 2, "a": 1}],
]
for vals in samples:
repr_diff = len({repr(v) for v in vals}) > 1
eq_diff = any(v != vals[0] for v in vals[1:])
print(f"values={vals} | repr_diff={repr_diff} | eq_diff={eq_diff}")
PYRepository: NVIDIA/cloudai
Length of output: 184
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '150,210p' src/cloudai/report_generator/util.py | nl -ba | sed -n '1,120p'Repository: NVIDIA/cloudai
Length of output: 101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
awk 'NR>=150 && NR<=210 {printf "%d:%s\n", NR, $0}' src/cloudai/report_generator/util.pyRepository: NVIDIA/cloudai
Length of output: 2536
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "diff_comparison_values|diff_test_runs\(" -S src/cloudai/report_generatorRepository: NVIDIA/cloudai
Length of output: 729
🏁 Script executed:
#!/bin/bash
set -euo pipefail
for f in src/cloudai/report_generator/groups.py src/cloudai/report_generator/comparison_report.py; do
echo "---- $f ----"
awk 'NR>=1 && NR<=220 {print FNR ":" $0}' "$f"
doneRepository: NVIDIA/cloudai
Length of output: 12747
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "def comparison_values\(" -S src/cloudai/report_generatorRepository: NVIDIA/cloudai
Length of output: 178
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class TestRun|def model_dump\\(|cmd_args|extra_env_vars" -S src/cloudai | head -n 200Repository: NVIDIA/cloudai
Length of output: 25850
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class TestRun|class TestDefinition|extra_env_vars|bench_cmd_args|model_dump\\(" -S src/cloudai/core src/cloudai | head -n 200Repository: NVIDIA/cloudai
Length of output: 9527
🏁 Script executed:
#!/bin/bash
set -euo pipefail
for f in src/cloudai/_core/test_scenario.py src/cloudai/models/workload.py; do
echo "---- $f (relevant sections) ----"
doneRepository: NVIDIA/cloudai
Length of output: 184
🏁 Script executed:
#!/bin/bash
set -euo pipefail
wc -l src/cloudai/_core/test_scenario.py src/cloudai/models/workload.pyRepository: NVIDIA/cloudai
Length of output: 148
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '60,220p' src/cloudai/_core/test_scenario.py | nl -ba | head -n 200Repository: NVIDIA/cloudai
Length of output: 101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
start=60; end=220
awk -v s=$start -v e=$end 'NR>=s && NR<=e {print NR ":" $0}' src/cloudai/_core/test_scenario.pyRepository: NVIDIA/cloudai
Length of output: 5757
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,180p' src/cloudai/models/workload.py | nl -ba | head -n 220Repository: NVIDIA/cloudai
Length of output: 101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
awk 'NR>=1 && NR<=190 {print NR ":" $0}' src/cloudai/models/workload.pyRepository: NVIDIA/cloudai
Length of output: 6076
Fix false diffs caused by repr-based comparison in diff_comparison_values() (src/cloudai/report_generator/util.py, lines 175-183).
diff_comparison_values() currently decides whether values differ via len({repr(v) for v in all_values}) > 1, which can mark semantically equal values as different (e.g., 1 vs 1.0, or dicts equal by contents but with different insertion order). These diffs then flow into comparison group item names.
💡 Suggested fix
def diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]]:
"""Return value differences across comparable TestRun value dictionaries."""
- all_keys = set().union(*[values.keys() for values in values_by_run])
+ all_keys = set().union(*(values.keys() for values in values_by_run))
diff = {}
for key in all_keys:
all_values = [values.get(key) for values in values_by_run]
- if len({repr(v) for v in all_values}) > 1:
+ if any(v != all_values[0] for v in all_values[1:]):
diff[key] = all_values🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/report_generator/util.py` around lines 175 - 183,
diff_comparison_values currently uses repr(v) to detect differences which
produces false positives (e.g., 1 vs 1.0 or dicts with different insertion
order); replace that with a canonicalization + equality check: add a small
helper (inside diff_comparison_values) named canonicalize(value) that (a) for
Mapping or list/tuple returns json.dumps(value, sort_keys=True, default=str) to
normalize structural content/order, (b) for numeric types normalizes to a single
numeric form (e.g., float(value) or Decimal) so 1 and 1.0 compare equal, and (c)
otherwise returns the value itself (or str(value) as a last resort); then
compute canonical_forms = [canonicalize(v) for v in all_values] and use
len(set(canonical_forms)) > 1 to decide diffs and assign diff[key] = all_values
as before (refer to diff_comparison_values, values_by_run, all_values).
| ) -> list[rich.table.Table]: | ||
| tables: list[rich.table.Table] = [] | ||
| for group in cmp_groups: | ||
| extracted_dfs = [self.extract_data_as_df(item.tr) for item in group.items] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider caching or memoizing extract_data_as_df results.
Both create_tables (line 176) and create_charts (line 268) call extract_data_as_df for the same test runs. If the parent class calls both methods during report generation, benchmark results are parsed twice per test run. Since benchmark result files are immutable, consider caching the extraction or computing it once and passing to both methods.
This is a minor optimization and can be deferred if parsing overhead is negligible.
Also applies to: 268-268
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/common/llm_serving_report.py` at line 176, The calls to
extract_data_as_df are duplicated in create_tables and create_charts causing
repeated parsing; modify the class to compute and cache extracted DataFrames
once and reuse them: e.g., add a lightweight cache attribute (like
self._df_cache) keyed by the test run object or an immutable id (referencing
extract_data_as_df, create_tables, create_charts, and group.items/item.tr),
populate it the first time you call extract_data_as_df for a given tr, and have
both create_tables and create_charts read from that cache (or accept an injected
precomputed list of DataFrames) instead of calling extract_data_as_df again.
| from cloudai.workloads.vllm import VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, VllmTestDefinition | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify package self-import coupling and potential cycle edges.
rg -n -C2 'from cloudai\.workloads\.vllm import|from \.vllm_comparison_report import' src/cloudai/workloads/vllm/__init__.py src/cloudai/workloads/vllm/vllm_comparison_report.pyRepository: NVIDIA/cloudai
Length of output: 889
Avoid package-root self-import in vLLM comparison module.
vllm_comparison_report.py imports VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from cloudai.workloads.vllm, while src/cloudai/workloads/vllm/__init__.py imports VLLMComparisonReport from .vllm_comparison_report, creating fragile circular initialization-order coupling. Import these symbols directly from their defining submodules (use relative/submodule imports) instead of the package root.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/vllm/vllm_comparison_report.py` around lines 24 - 25,
The current import in vllm_comparison_report.py pulls VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition from the package root
(cloudai.workloads.vllm) which creates a circular init dependency; change those
to direct imports from the submodules that actually define them (use
relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.
| def test_llm_comparison_report_generates_html(slurm_system: cloudai.systems.slurm.SlurmSystem) -> None: | ||
| tr1 = cloudai.core.TestRun( | ||
| name="vllm-8", | ||
| test=VllmTestDefinition( | ||
| name="vllm", | ||
| description="vLLM benchmark", | ||
| test_template_name="vllm", | ||
| cmd_args=VllmCmdArgs(docker_image_url="nvcr.io/nvidia/vllm:latest", model="Qwen/Qwen3-0.6B"), | ||
| bench_cmd_args=VllmBenchCmdArgs(max_concurrency=8), | ||
| semantic_eval_cmd_args=VllmSemanticEvalCmdArgs(), | ||
| ), | ||
| num_nodes=1, | ||
| nodes=[], | ||
| ) | ||
| tr2 = cloudai.core.TestRun( | ||
| name="vllm-16", | ||
| test=VllmTestDefinition( | ||
| name="vllm", | ||
| description="vLLM benchmark", | ||
| test_template_name="vllm", | ||
| cmd_args=VllmCmdArgs(docker_image_url="nvcr.io/nvidia/vllm:latest", model="Qwen/Qwen3-0.6B"), | ||
| bench_cmd_args=VllmBenchCmdArgs(max_concurrency=16), | ||
| semantic_eval_cmd_args=VllmSemanticEvalCmdArgs(), | ||
| ), | ||
| num_nodes=1, | ||
| nodes=[], | ||
| ) | ||
| _write_result( | ||
| slurm_system.output_path / tr1.name / "0", | ||
| VLLM_BENCH_JSON_FILE, | ||
| json.dumps( | ||
| { | ||
| "num_prompts": 30, | ||
| "completed": 27, | ||
| "mean_ttft_ms": 100.0, | ||
| "median_ttft_ms": 90.0, | ||
| "p99_ttft_ms": 150.0, | ||
| "mean_tpot_ms": 10.0, | ||
| "median_tpot_ms": 9.0, | ||
| "p99_tpot_ms": 15.0, | ||
| "output_throughput": 1200.0, | ||
| "max_concurrency": 8, | ||
| } | ||
| ), | ||
| ) | ||
| _write_result(slurm_system.output_path / tr1.name / "0", VLLM_GSM8K_JSON_FILE, '{"accuracy": 0.81}') | ||
| _write_result( | ||
| slurm_system.output_path / tr2.name / "0", | ||
| VLLM_BENCH_JSON_FILE, | ||
| json.dumps( | ||
| { | ||
| "num_prompts": 30, | ||
| "completed": 30, | ||
| "mean_ttft_ms": 80.0, | ||
| "median_ttft_ms": 70.0, | ||
| "p99_ttft_ms": 130.0, | ||
| "mean_tpot_ms": 8.0, | ||
| "median_tpot_ms": 7.0, | ||
| "p99_tpot_ms": 13.0, | ||
| "output_throughput": 1800.0, | ||
| "max_concurrency": 16, | ||
| } | ||
| ), | ||
| ) | ||
| _write_result(slurm_system.output_path / tr2.name / "0", VLLM_GSM8K_JSON_FILE, '{"accuracy": 0.9}') | ||
|
|
||
| report = VLLMComparisonReport( | ||
| slurm_system, | ||
| cloudai.core.TestScenario(name="vllm-comparison", test_runs=[tr1, tr2]), | ||
| slurm_system.output_path, | ||
| cloudai.report_generator.comparison_report.ComparisonReportConfig(enable=True, group_by=[]), | ||
| ) | ||
|
|
||
| report.load_test_runs() | ||
| assert len(report.trs) == 2 | ||
| tables = report.create_tables(report.group_test_runs()) | ||
| latency_table = tables[0] | ||
| success_table = tables[1] | ||
| throughput_table = tables[2] | ||
| quality_table = tables[3] | ||
|
|
||
| assert "bench_cmd_args.max_concurrency=8" in str(latency_table.columns[1].header) | ||
| assert "bench_cmd_args.max_concurrency=16" in str(latency_table.columns[2].header) | ||
| assert list(latency_table.columns[0].cells)[:3] == ["Mean TTFT (ms)", "Median TTFT (ms)", "P99 TTFT (ms)"] | ||
| assert list(success_table.columns[0].cells) == ["Successful Prompts", "Successful Prompts (%)"] | ||
| assert list(success_table.columns[3].cells) == [ | ||
| cloudai.report_generator.comparison_report.ComparisonReport._format_diff_cell(27.0, 30.0), | ||
| cloudai.report_generator.comparison_report.ComparisonReport._format_diff_cell(90.0, 100.0), | ||
| ] | ||
| throughput_row = list(throughput_table.columns[0].cells).index("TPS/GPU") | ||
| assert list(throughput_table.columns[1].cells)[throughput_row] == "150.0" | ||
| assert list(quality_table.columns[0].cells) == ["Accuracy"] | ||
| assert list(quality_table.columns[1].cells) == ["0.81"] | ||
| assert list(quality_table.columns[2].cells) == ["0.9"] | ||
|
|
||
| report.generate() | ||
|
|
||
| assert (slurm_system.output_path / "vllm_comparison.html").exists() |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add a symmetric integration test for SGLangComparisonReport.
This test fully validates the vLLM path, but the new SGLang comparison implementation is only covered indirectly via registry checks. Please add a parallel end-to-end SGLang report generation test to protect the new workload-specific parsing and output path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/workloads/common/test_llm_serving_report.py` around lines 40 - 137, Add
a second test mirroring test_llm_comparison_report_generates_html that exercises
the SGLang path end-to-end: create two cloudai.core.TestRun instances for SGLang
(use SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.
Summary
Test Plan
Additional Notes