Skip to content

vLLM/SGLang: comparison report#904

Open
podkidyshev wants to merge 8 commits into
mainfrom
ipod/llm-comparison
Open

vLLM/SGLang: comparison report#904
podkidyshev wants to merge 8 commits into
mainfrom
ipod/llm-comparison

Conversation

@podkidyshev
Copy link
Copy Markdown
Contributor

Summary

  • Enable comparison reporting for vLLM and SGLang

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this May 27, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a shared LLM serving benchmark comparison report framework. It refactors ComparisonReport to handle internal TestRun grouping, extracts diff utilities, creates an abstract LLMServingComparisonReport base class for standardized metric extraction and visualization, implements vLLM and SGLang variants, registers them, and validates via integration tests.

Changes

LLM Serving Comparison Reports

Layer / File(s) Summary
Diff utility refactoring
src/cloudai/report_generator/util.py
New helpers default_test_run_comparison_values and diff_comparison_values extract and compare per-run values, replacing raw set-based diffing with repr-based comparison; diff_test_runs return type widened to dict[str, list[object]].
ComparisonReport grouping refactoring
src/cloudai/report_generator/comparison_report.py, src/cloudai/report_generator/groups.py
ComparisonReport now groups TestRuns internally via group_test_runs(), get_group_value(), group_name(), create_group() using the new diff utilities instead of TestRunsGrouper; generate() and get_bokeh_html() call the internal method to produce cmp_groups.
LLMServingComparisonReport abstract framework
src/cloudai/workloads/common/llm_serving_report.py
Abstract base class defines workload-specific contract (can_handle, parse_results, benchmark_cmd_args, parse_accuracy), implements metric extraction (extract_data_as_df for latency/success/throughput/quality), and provides Rich table (create_tables) and Bokeh chart (_create_metric_bar_chart) rendering.
vLLM comparison report implementation
src/cloudai/workloads/vllm/vllm_comparison_report.py, src/cloudai/workloads/vllm/__init__.py
VLLMComparisonReport concrete class routes vLLM test detection, accesses benchmark command args, parses results via VLLMBenchReportGenerationStrategy, computes semantic accuracy, outputs vllm_comparison.html; exported via package __all__.
SGLang comparison report implementation
src/cloudai/workloads/sglang/sglang_comparison_report.py, src/cloudai/workloads/sglang/__init__.py
SGLangComparisonReport concrete class routes SGLang test detection, accesses benchmark command args, parses results via SGLangBenchReportGenerationStrategy, computes semantic accuracy, outputs sglang_comparison.html; exported via package __all__.
Registry integration
src/cloudai/registration.py
Register "vllm_comparison" and "sglang_comparison" scenario reports with their corresponding classes and enabled ComparisonReportConfig.
Test updates and integration test
tests/test_init.py, tests/workloads/common/test_llm_serving_report.py
Registry tests verify the new comparison reports are registered; integration test synthesizes benchmark/accuracy JSON, loads TestRuns into VLLMComparisonReport, generates diff-formatted comparison tables and Bokeh charts, and asserts vllm_comparison.html output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/cloudai#890: The new vLLM and SGLang comparison reports consume semantic-eval accuracy metrics and parsing hooks (parse_semantic_accuracy) that were introduced in that PR via LLMServingReportGenerationStrategy and workload-specific parsing/command-generation changes.

Suggested labels

feature

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

🐰 A Framework Unfolds!
TestRuns now gather in groups with grace,
LLM benchmarks side by side show pace,
vLLM, SGLang—each their own domain,
Rich tables, Bokeh charts explain the gain! ✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding comparison report functionality for vLLM and SGLang workloads.
Description check ✅ Passed The description is directly related to the changeset, explaining that it enables comparison reporting for vLLM and SGLang with a clear test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/llm-comparison

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev added the enhancement New feature or request label May 27, 2026
@podkidyshev podkidyshev force-pushed the ipod/llm-comparison branch from bd9012d to 2596726 Compare May 28, 2026 15:58
@podkidyshev podkidyshev marked this pull request as ready for review May 28, 2026 16:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/report_generator/util.py`:
- Around line 175-183: diff_comparison_values currently uses repr(v) to detect
differences which produces false positives (e.g., 1 vs 1.0 or dicts with
different insertion order); replace that with a canonicalization + equality
check: add a small helper (inside diff_comparison_values) named
canonicalize(value) that (a) for Mapping or list/tuple returns json.dumps(value,
sort_keys=True, default=str) to normalize structural content/order, (b) for
numeric types normalizes to a single numeric form (e.g., float(value) or
Decimal) so 1 and 1.0 compare equal, and (c) otherwise returns the value itself
(or str(value) as a last resort); then compute canonical_forms =
[canonicalize(v) for v in all_values] and use len(set(canonical_forms)) > 1 to
decide diffs and assign diff[key] = all_values as before (refer to
diff_comparison_values, values_by_run, all_values).

In `@src/cloudai/workloads/common/llm_serving_report.py`:
- Line 176: The calls to extract_data_as_df are duplicated in create_tables and
create_charts causing repeated parsing; modify the class to compute and cache
extracted DataFrames once and reuse them: e.g., add a lightweight cache
attribute (like self._df_cache) keyed by the test run object or an immutable id
(referencing extract_data_as_df, create_tables, create_charts, and
group.items/item.tr), populate it the first time you call extract_data_as_df for
a given tr, and have both create_tables and create_charts read from that cache
(or accept an injected precomputed list of DataFrames) instead of calling
extract_data_as_df again.

In `@src/cloudai/workloads/vllm/vllm_comparison_report.py`:
- Around line 24-25: The current import in vllm_comparison_report.py pulls
VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from
the package root (cloudai.workloads.vllm) which creates a circular init
dependency; change those to direct imports from the submodules that actually
define them (use relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.

In `@tests/workloads/common/test_llm_serving_report.py`:
- Around line 40-137: Add a second test mirroring
test_llm_comparison_report_generates_html that exercises the SGLang path
end-to-end: create two cloudai.core.TestRun instances for SGLang (use
SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6a508bb1-229e-4702-b1ad-bdb49888fef9

📥 Commits

Reviewing files that changed from the base of the PR and between 0738c4b and 727552b.

📒 Files selected for processing (12)
  • src/cloudai/registration.py
  • src/cloudai/report_generator/comparison_report.py
  • src/cloudai/report_generator/groups.py
  • src/cloudai/report_generator/util.py
  • src/cloudai/workloads/common/llm_serving_report.py
  • src/cloudai/workloads/sglang/__init__.py
  • src/cloudai/workloads/sglang/sglang_comparison_report.py
  • src/cloudai/workloads/vllm/__init__.py
  • src/cloudai/workloads/vllm/vllm_comparison_report.py
  • tests/test_init.py
  • tests/workloads/common/test_llm_serving.py
  • tests/workloads/common/test_llm_serving_report.py

Comment on lines +175 to 183
def diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]]:
"""Return value differences across comparable TestRun value dictionaries."""
all_keys = set().union(*[values.keys() for values in values_by_run])

diff = {}
for key in all_keys:
all_values = [d[key] for d in dicts]
if len(set(all_values)) > 1:
all_values = [values.get(key) for values in values_by_run]
if len({repr(v) for v in all_values}) > 1:
diff[key] = all_values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
samples = [
    [1, 1.0],
    [{"a": 1, "b": 2}, {"b": 2, "a": 1}],
]

for vals in samples:
    repr_diff = len({repr(v) for v in vals}) > 1
    eq_diff = any(v != vals[0] for v in vals[1:])
    print(f"values={vals} | repr_diff={repr_diff} | eq_diff={eq_diff}")
PY

Repository: NVIDIA/cloudai

Length of output: 184


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '150,210p' src/cloudai/report_generator/util.py | nl -ba | sed -n '1,120p'

Repository: NVIDIA/cloudai

Length of output: 101


🏁 Script executed:

#!/bin/bash
set -euo pipefail
awk 'NR>=150 && NR<=210 {printf "%d:%s\n", NR, $0}' src/cloudai/report_generator/util.py

Repository: NVIDIA/cloudai

Length of output: 2536


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "diff_comparison_values|diff_test_runs\(" -S src/cloudai/report_generator

Repository: NVIDIA/cloudai

Length of output: 729


🏁 Script executed:

#!/bin/bash
set -euo pipefail
for f in src/cloudai/report_generator/groups.py src/cloudai/report_generator/comparison_report.py; do
  echo "---- $f ----"
  awk 'NR>=1 && NR<=220 {print FNR ":" $0}' "$f"
done

Repository: NVIDIA/cloudai

Length of output: 12747


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "def comparison_values\(" -S src/cloudai/report_generator

Repository: NVIDIA/cloudai

Length of output: 178


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "class TestRun|def model_dump\\(|cmd_args|extra_env_vars" -S src/cloudai | head -n 200

Repository: NVIDIA/cloudai

Length of output: 25850


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "class TestRun|class TestDefinition|extra_env_vars|bench_cmd_args|model_dump\\(" -S src/cloudai/core src/cloudai | head -n 200

Repository: NVIDIA/cloudai

Length of output: 9527


🏁 Script executed:

#!/bin/bash
set -euo pipefail
for f in src/cloudai/_core/test_scenario.py src/cloudai/models/workload.py; do
  echo "---- $f (relevant sections) ----"
done

Repository: NVIDIA/cloudai

Length of output: 184


🏁 Script executed:

#!/bin/bash
set -euo pipefail
wc -l src/cloudai/_core/test_scenario.py src/cloudai/models/workload.py

Repository: NVIDIA/cloudai

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '60,220p' src/cloudai/_core/test_scenario.py | nl -ba | head -n 200

Repository: NVIDIA/cloudai

Length of output: 101


🏁 Script executed:

#!/bin/bash
set -euo pipefail
start=60; end=220
awk -v s=$start -v e=$end 'NR>=s && NR<=e {print NR ":" $0}' src/cloudai/_core/test_scenario.py

Repository: NVIDIA/cloudai

Length of output: 5757


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,180p' src/cloudai/models/workload.py | nl -ba | head -n 220

Repository: NVIDIA/cloudai

Length of output: 101


🏁 Script executed:

#!/bin/bash
set -euo pipefail
awk 'NR>=1 && NR<=190 {print NR ":" $0}' src/cloudai/models/workload.py

Repository: NVIDIA/cloudai

Length of output: 6076


Fix false diffs caused by repr-based comparison in diff_comparison_values() (src/cloudai/report_generator/util.py, lines 175-183).

diff_comparison_values() currently decides whether values differ via len({repr(v) for v in all_values}) > 1, which can mark semantically equal values as different (e.g., 1 vs 1.0, or dicts equal by contents but with different insertion order). These diffs then flow into comparison group item names.

💡 Suggested fix
 def diff_comparison_values(values_by_run: list[Mapping[str, object]]) -> dict[str, list[object]]:
     """Return value differences across comparable TestRun value dictionaries."""
-    all_keys = set().union(*[values.keys() for values in values_by_run])
+    all_keys = set().union(*(values.keys() for values in values_by_run))

     diff = {}
     for key in all_keys:
         all_values = [values.get(key) for values in values_by_run]
-        if len({repr(v) for v in all_values}) > 1:
+        if any(v != all_values[0] for v in all_values[1:]):
             diff[key] = all_values
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/report_generator/util.py` around lines 175 - 183,
diff_comparison_values currently uses repr(v) to detect differences which
produces false positives (e.g., 1 vs 1.0 or dicts with different insertion
order); replace that with a canonicalization + equality check: add a small
helper (inside diff_comparison_values) named canonicalize(value) that (a) for
Mapping or list/tuple returns json.dumps(value, sort_keys=True, default=str) to
normalize structural content/order, (b) for numeric types normalizes to a single
numeric form (e.g., float(value) or Decimal) so 1 and 1.0 compare equal, and (c)
otherwise returns the value itself (or str(value) as a last resort); then
compute canonical_forms = [canonicalize(v) for v in all_values] and use
len(set(canonical_forms)) > 1 to decide diffs and assign diff[key] = all_values
as before (refer to diff_comparison_values, values_by_run, all_values).

) -> list[rich.table.Table]:
tables: list[rich.table.Table] = []
for group in cmp_groups:
extracted_dfs = [self.extract_data_as_df(item.tr) for item in group.items]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider caching or memoizing extract_data_as_df results.

Both create_tables (line 176) and create_charts (line 268) call extract_data_as_df for the same test runs. If the parent class calls both methods during report generation, benchmark results are parsed twice per test run. Since benchmark result files are immutable, consider caching the extraction or computing it once and passing to both methods.

This is a minor optimization and can be deferred if parsing overhead is negligible.

Also applies to: 268-268

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/common/llm_serving_report.py` at line 176, The calls to
extract_data_as_df are duplicated in create_tables and create_charts causing
repeated parsing; modify the class to compute and cache extracted DataFrames
once and reuse them: e.g., add a lightweight cache attribute (like
self._df_cache) keyed by the test run object or an immutable id (referencing
extract_data_as_df, create_tables, create_charts, and group.items/item.tr),
populate it the first time you call extract_data_as_df for a given tr, and have
both create_tables and create_charts read from that cache (or accept an injected
precomputed list of DataFrames) instead of calling extract_data_as_df again.

Comment on lines +24 to +25
from cloudai.workloads.vllm import VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, VllmTestDefinition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify package self-import coupling and potential cycle edges.
rg -n -C2 'from cloudai\.workloads\.vllm import|from \.vllm_comparison_report import' src/cloudai/workloads/vllm/__init__.py src/cloudai/workloads/vllm/vllm_comparison_report.py

Repository: NVIDIA/cloudai

Length of output: 889


Avoid package-root self-import in vLLM comparison module.

vllm_comparison_report.py imports VllmBenchCmdArgs, VLLMBenchReportGenerationStrategy, and VllmTestDefinition from cloudai.workloads.vllm, while src/cloudai/workloads/vllm/__init__.py imports VLLMComparisonReport from .vllm_comparison_report, creating fragile circular initialization-order coupling. Import these symbols directly from their defining submodules (use relative/submodule imports) instead of the package root.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/vllm/vllm_comparison_report.py` around lines 24 - 25,
The current import in vllm_comparison_report.py pulls VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition from the package root
(cloudai.workloads.vllm) which creates a circular init dependency; change those
to direct imports from the submodules that actually define them (use
relative/submodule imports) so VllmBenchCmdArgs,
VLLMBenchReportGenerationStrategy, and VllmTestDefinition are imported from
their defining modules instead of from cloudai.workloads.vllm.

Comment on lines +40 to +137
def test_llm_comparison_report_generates_html(slurm_system: cloudai.systems.slurm.SlurmSystem) -> None:
tr1 = cloudai.core.TestRun(
name="vllm-8",
test=VllmTestDefinition(
name="vllm",
description="vLLM benchmark",
test_template_name="vllm",
cmd_args=VllmCmdArgs(docker_image_url="nvcr.io/nvidia/vllm:latest", model="Qwen/Qwen3-0.6B"),
bench_cmd_args=VllmBenchCmdArgs(max_concurrency=8),
semantic_eval_cmd_args=VllmSemanticEvalCmdArgs(),
),
num_nodes=1,
nodes=[],
)
tr2 = cloudai.core.TestRun(
name="vllm-16",
test=VllmTestDefinition(
name="vllm",
description="vLLM benchmark",
test_template_name="vllm",
cmd_args=VllmCmdArgs(docker_image_url="nvcr.io/nvidia/vllm:latest", model="Qwen/Qwen3-0.6B"),
bench_cmd_args=VllmBenchCmdArgs(max_concurrency=16),
semantic_eval_cmd_args=VllmSemanticEvalCmdArgs(),
),
num_nodes=1,
nodes=[],
)
_write_result(
slurm_system.output_path / tr1.name / "0",
VLLM_BENCH_JSON_FILE,
json.dumps(
{
"num_prompts": 30,
"completed": 27,
"mean_ttft_ms": 100.0,
"median_ttft_ms": 90.0,
"p99_ttft_ms": 150.0,
"mean_tpot_ms": 10.0,
"median_tpot_ms": 9.0,
"p99_tpot_ms": 15.0,
"output_throughput": 1200.0,
"max_concurrency": 8,
}
),
)
_write_result(slurm_system.output_path / tr1.name / "0", VLLM_GSM8K_JSON_FILE, '{"accuracy": 0.81}')
_write_result(
slurm_system.output_path / tr2.name / "0",
VLLM_BENCH_JSON_FILE,
json.dumps(
{
"num_prompts": 30,
"completed": 30,
"mean_ttft_ms": 80.0,
"median_ttft_ms": 70.0,
"p99_ttft_ms": 130.0,
"mean_tpot_ms": 8.0,
"median_tpot_ms": 7.0,
"p99_tpot_ms": 13.0,
"output_throughput": 1800.0,
"max_concurrency": 16,
}
),
)
_write_result(slurm_system.output_path / tr2.name / "0", VLLM_GSM8K_JSON_FILE, '{"accuracy": 0.9}')

report = VLLMComparisonReport(
slurm_system,
cloudai.core.TestScenario(name="vllm-comparison", test_runs=[tr1, tr2]),
slurm_system.output_path,
cloudai.report_generator.comparison_report.ComparisonReportConfig(enable=True, group_by=[]),
)

report.load_test_runs()
assert len(report.trs) == 2
tables = report.create_tables(report.group_test_runs())
latency_table = tables[0]
success_table = tables[1]
throughput_table = tables[2]
quality_table = tables[3]

assert "bench_cmd_args.max_concurrency=8" in str(latency_table.columns[1].header)
assert "bench_cmd_args.max_concurrency=16" in str(latency_table.columns[2].header)
assert list(latency_table.columns[0].cells)[:3] == ["Mean TTFT (ms)", "Median TTFT (ms)", "P99 TTFT (ms)"]
assert list(success_table.columns[0].cells) == ["Successful Prompts", "Successful Prompts (%)"]
assert list(success_table.columns[3].cells) == [
cloudai.report_generator.comparison_report.ComparisonReport._format_diff_cell(27.0, 30.0),
cloudai.report_generator.comparison_report.ComparisonReport._format_diff_cell(90.0, 100.0),
]
throughput_row = list(throughput_table.columns[0].cells).index("TPS/GPU")
assert list(throughput_table.columns[1].cells)[throughput_row] == "150.0"
assert list(quality_table.columns[0].cells) == ["Accuracy"]
assert list(quality_table.columns[1].cells) == ["0.81"]
assert list(quality_table.columns[2].cells) == ["0.9"]

report.generate()

assert (slurm_system.output_path / "vllm_comparison.html").exists()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a symmetric integration test for SGLangComparisonReport.

This test fully validates the vLLM path, but the new SGLang comparison implementation is only covered indirectly via registry checks. Please add a parallel end-to-end SGLang report generation test to protect the new workload-specific parsing and output path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/workloads/common/test_llm_serving_report.py` around lines 40 - 137, Add
a second test mirroring test_llm_comparison_report_generates_html that exercises
the SGLang path end-to-end: create two cloudai.core.TestRun instances for SGLang
(use SGLangTestDefinition, SGLangCmdArgs, SGLangBenchCmdArgs,
SGLangSemanticEvalCmdArgs), write corresponding bench and quality JSON results
to slurm_system.output_path under each test run (use the SGLANG_BENCH_JSON_FILE
and SGLANG_GSM8K_JSON_FILE constants or their equivalents), instantiate
SGLangComparisonReport with a TestScenario containing those runs, call
load_test_runs(), create_tables(group_test_runs()) and assert key table cells
(latency/throughput/quality/success) similar to the vLLM test, call
report.generate(), and finally assert that (slurm_system.output_path /
"sglang_comparison.html").exists() to ensure HTML was produced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant