[LEADS-362] Added langfuse compatibility with the lightspeed-eval by bsatapat-jpg · Pull Request #223 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2026-04-24T07:03:52Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Codex-5.0
Generated by: Cursor

Related Tickets & Documents

Related Issue # https://redhat.atlassian.net/browse/LEADS-362
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Optional Langfuse integration to export evaluation traces with per‑metric scores (enable via storage config and env vars); programmatic API now accepts evaluation_data_path and an on_complete callback; run metadata (EvaluationRunContext) exposed.
Documentation
- README section added with Langfuse installation, config, and usage guidance.
Chores
- Added langfuse optional dependency and updated pinned dependencies.
Tests
- New tests covering Langfuse reporting, storage wiring, and on_complete callback behavior.

coderabbitai · 2026-04-24T07:04:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds optional Langfuse trace reporting: new EvaluationRunContext model, Langfuse reporter utilities, API/pipeline/runner plumbing to accept and invoke an on_complete callback (CLI/env opt-in), README and dependency updates, and unit tests covering integration and callback behavior.

Changes

Langfuse integration

Layer / File(s)	Summary
Docs & packaging `README.md`, `pyproject.toml`, `requirements-all-extras.txt`, `requirements*.txt`, `config/system.yaml`	Adds README docs for optional Langfuse usage, declares `langfuse` optional extra and dev dependency, pins `langfuse`/`wrapt` in exported requirements, updates `packaging` pins, and adds a commented example `storage: - type: "langfuse"` in `config/system.yaml`.
Core model and exports `src/lightspeed_evaluation/core/models/evaluation_run_context.py`, `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/__init__.py`	Adds immutable `EvaluationRunContext` dataclass and re-exports it from core models and the package top-level lazy import registry.
Storage schema and parsing `src/lightspeed_evaluation/core/storage/config.py`, `src/lightspeed_evaluation/core/storage/factory.py`, `src/lightspeed_evaluation/core/system/loader.py`, `src/lightspeed_evaluation/core/storage/__init__.py`	Adds `LangfuseBackendConfig` model with `host`/`public_key`/`secret_key`, includes it in `StorageBackendConfig` union, adds `get_langfuse_storage_config()`, updates loader to parse `type: langfuse`, and re-exports config/factory helpers.
Langfuse reporter & integrations package `src/lightspeed_evaluation/integrations/langfuse_reporter.py`, `src/lightspeed_evaluation/integrations/__init__.py`	New optional integrations package and `langfuse_reporter` with `LangfuseClientConfig`, `push_evaluation_results_to_langfuse`, `build_langfuse_on_complete_callback`, and `build_langfuse_on_complete_from_storage_configs`. Reporter is defensive (no-op if SDK missing or init fails), builds a single trace per run, emits per-metric scores (skips None), and flushes.
Public API wrappers `src/lightspeed_evaluation/api.py`	Adds keyword-only `evaluation_data_path` and `on_complete` parameters to evaluation wrapper functions; determines effective `on_complete` (explicit or built from storage) and forwards into pipeline; updates docstrings and type imports.
Pipeline & runner wiring `src/lightspeed_evaluation/pipeline/evaluation/pipeline.py`, `src/lightspeed_evaluation/runner/evaluation.py`	`EvaluationPipeline.run_evaluation` accepts an `on_complete` callback, constructs an `EvaluationRunContext`, invokes callback (exceptions logged and swallowed). Runner builds a storage-derived `on_complete` and passes `evaluation_data_path`/`on_complete` into `evaluate()`.
Tests `tests/unit/integrations/test_langfuse_reporter.py`, `tests/unit/pipeline/evaluation/test_pipeline.py`, `tests/unit/runner/test_evaluation.py`, `tests/unit/core/*`, `tests/unit/test_api.py`	Adds new tests for Langfuse reporter (mocked client) and updates tests to assert `evaluation_data_path`/`on_complete` propagation and callback exception isolation; extends storage/loader/factory tests for langfuse handling.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as evaluate()
    participant Pipeline as EvaluationPipeline.run_evaluation
    participant Callback as on_complete
    participant Langfuse as Langfuse Client

    Client->>API: evaluate(..., evaluation_data_path, on_complete)
    API->>Pipeline: run_evaluation(..., original_data_path, on_complete)
    Pipeline->>Pipeline: perform evaluation -> results
    Pipeline->>Callback: on_complete(results, EvaluationRunContext)
    Callback->>Langfuse: init client & create trace (run_name)
    Callback->>Langfuse: emit score per metric
    Callback->>Langfuse: flush()
    Pipeline-->>API: return results
    API-->>Client: return results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: adding Langfuse compatibility to the lightspeed-evaluation project. It is concise, specific, and directly reflects the primary objective of the changeset across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 88.41% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

src/lightspeed_evaluation/integrations/langfuse_reporter.py (2)
84-90: User context.metadata can silently overwrite reserved trace keys.

trace_meta.update(context.metadata or {}) runs after run_name, result_count, and original_data_path are set, so any of those keys passed by a caller in metadata will clobber the values the reporter is trying to guarantee. Inverting the merge (user first, reserved last) preserves the invariant.
Proposed fix
-    trace_meta: dict[str, Any] = {
-        "run_name": context.run_name,
-        "result_count": len(results),
-    }
-    if context.original_data_path is not None:
-        trace_meta["original_data_path"] = context.original_data_path
-    trace_meta.update(context.metadata or {})
+    trace_meta: dict[str, Any] = dict(context.metadata or {})
+    trace_meta["run_name"] = context.run_name
+    trace_meta["result_count"] = len(results)
+    if context.original_data_path is not None:
+        trace_meta["original_data_path"] = context.original_data_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 84
- 90, The merge order currently allows user-supplied context.metadata to
overwrite reserved keys because trace_meta is populated first then
trace_meta.update(context.metadata or {}); reverse the merge so you start with
user metadata (e.g., initialize trace_meta = dict(context.metadata or {})), then
set/override the reserved keys ("run_name", "result_count", and optionally
"original_data_path") afterwards so the reporter's guaranteed values (in the
trace_meta dict used by the Langfuse reporter) cannot be clobbered by
context.metadata; ensure you still handle context.original_data_path None-check
and preserve existing typing for trace_meta.
34-47: Docstring drift: comment format no longer includes explicit pass/fail wording.

The Args docstring says the comment contains "pass/fail status and a truncated reason", but _format_comment emits result=<r.result> (where r.result is one of PASS/FAIL/ERROR/SKIPPED). That's functionally equivalent but not what the docstring promises — a reader parsing Langfuse comments for a literal "pass"/"fail" substring will be misled for ERROR/SKIPPED rows. Suggest clarifying: "a result=<STATUS> field and optional truncated reason".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 34
- 47, Update the docstring to reflect the actual comment format emitted by
_format_comment: change the phrase that says the comment contains "pass/fail
status and a truncated reason" to indicate it emits a "result=<STATUS>" field
(where STATUS is PASS/FAIL/ERROR/SKIPPED) and an optional truncated reason;
mention that PASS/FAIL correspond to pass/fail while ERROR/SKIPPED are reported
as those literal statuses so downstream parsers should look for result=<STATUS>
rather than the words "pass"/"fail".
tests/unit/test_api.py (1)
47-49: Assertions correctly distinguish the wrapper boundary (evaluation_data_path) from the pipeline boundary (original_data_path).

The naming difference is subtle but matches src/lightspeed_evaluation/api.py where evaluate(...) forwards evaluation_data_path=... into pipeline.run_evaluation(original_data_path=...). Good coverage of the default (None) path.

One gap: there is no test covering a non-None on_complete being forwarded through evaluate/evaluate_conversation/evaluate_turn — worth adding a single assertion (e.g. on_complete=callback_sentinel) to lock in the pass-through contract the Langfuse integration relies on.

Also applies to: 116-122, 135-141, 293-300
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_api.py` around lines 47 - 49, Add assertions to the existing
unit tests to verify that a non-None on_complete callback is forwarded through
the API wrapper into the pipeline; specifically, in tests that call evaluate,
evaluate_conversation, and evaluate_turn (which currently assert
mock_pipeline.run_evaluation.called with original_data_path=None), add a
sentinel (e.g., callback_sentinel) passed as on_complete to the evaluate* call
and assert mock_pipeline.run_evaluation.assert_called_with(...,
on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn;
update the assertions at the spots mentioned (around lines 47, 116-122, 135-141,
293-300) to include the on_complete=callback_sentinel check so the tests lock in
the pass-through behavior.
tests/unit/runner/test_evaluation.py (1)
52-64: Test updates mirror the runner signature change correctly.

One follow-up worth considering: there is no test in this file that exercises _make_eval_args(langfuse=True) end-to-end to assert that on_complete becomes non-None in the evaluate(...) call. The callback wiring is arguably the most error-prone part of the runner change; a single parametrized case here (mocking build_langfuse_on_complete_callback) would catch regressions cheaply.

Also applies to: 170-176, 231-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Add a
parametrized unit test that exercises _make_eval_args(langfuse=True) and
verifies the runner wires the Langfuse callback into evaluate: mock
build_langfuse_on_complete_callback to return a sentinel callback, call
evaluate(...) with args from _make_eval_args(langfuse=True), and assert evaluate
was invoked with on_complete equal to the sentinel (non-None). Locate uses of
_make_eval_args, evaluate, and build_langfuse_on_complete_callback in this test
module to add the new case (or extend the existing parametrized cases around
lines where _make_eval_args is used) so regressions in callback wiring are
caught.
src/lightspeed_evaluation/integrations/__init__.py (1)
3-11: Re-exports look clean.

Note: because this package __init__.py unconditionally imports from langfuse_reporter, importing lightspeed_evaluation.integrations pulls in the reporter module. That's fine today because langfuse_reporter itself only lazy-imports langfuse inside the function body — but if anyone later moves the import langfuse to module scope, from lightspeed_evaluation.integrations import ... would start failing without the optional extra installed. Worth a short comment or a try/except around the re-export to make the "opt-in per extra" promise robust to future edits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/__init__.py` around lines 3 - 11, The
module-level import of build_langfuse_on_complete_callback and
push_evaluation_results_to_langfuse makes importing
lightspeed_evaluation.integrations risk raising if langfuse becomes a hard
dependency; update src/lightspeed_evaluation/integrations/__init__.py to guard
the re-exports by either wrapping the import of langfuse_reporter in a
try/except ImportError (and only add the two symbols to __all__ when import
succeeds) or add a clear comment explaining the deliberate lazy-import behavior
so future edits preserve the optional-extra contract; reference the imported
symbols build_langfuse_on_complete_callback and
push_evaluation_results_to_langfuse and the module name langfuse_reporter when
making the change.
README.md (1)
494-496: Consider adding a version note for the langfuse extra.

The runtime module docstring pins to langfuse>=2,<3, but that constraint isn't surfaced here. A one-line note (e.g. "Requires Langfuse SDK v2") would prevent users from hitting the runtime failure described in the langfuse_reporter.py comment if they already have a v3 client installed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 494 - 496, Add a one-line note in the "Optional:
Langfuse" README section stating the required Langfuse SDK version (e.g.,
"Requires Langfuse SDK v2: install with lightspeed-evaluation[langfuse] and
langfuse>=2,<3") so users don't accidentally use v3; reference the runtime
module pin (langfuse>=2,<3) and the build_langfuse_on_complete_callback symbol
in lightspeed_evaluation.integrations.langfuse_reporter to align the doc with
the actual runtime constraint.
tests/unit/integrations/test_langfuse_reporter.py (1)
51-78: Minor: caplog.set_level("ERROR") is set but never asserted on.

Both test_build_callback_invokes_push and test_push_with_mock_langfuse take the caplog fixture and set a level but never inspect caplog.records / caplog.text. Either drop the fixture to keep the tests focused, or add an assertion that no ERROR-level records were emitted on the success path — that would actually protect against regressions in the callback's "catch and log" branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/integrations/test_langfuse_reporter.py` around lines 51 - 78, The
tests test_build_callback_invokes_push and test_push_with_mock_langfuse set
caplog.set_level("ERROR") but never assert against it; either remove the unused
caplog fixture from those test signatures or add an explicit assertion (e.g.,
assert not any(r.levelname == "ERROR" for r in caplog.records) or assert "ERROR"
not in caplog.text) after calling cb(...) /
push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were
emitted; update the function signatures and references accordingly
(test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).
pyproject.toml (1)
55-66: Heads-up: langfuse<3 pins the legacy v2 SDK.

The Langfuse Python SDK v3 (OTEL-based) has been GA since June 2025 and v4 is the current major on PyPI (langfuse==4.5.0). v2 is documented as the legacy SDK and its public API (Langfuse().trace().score()) is not forward-compatible — that shape is used throughout src/lightspeed_evaluation/integrations/langfuse_reporter.py and the tests. Pinning <3.0.0 is fine for now (and your inline comment signals intent), but track this as tech debt: a future v4 bump will require rewriting the reporter to use propagate_attributes / start_observation / score.create.

No action required in this PR unless you want to target v3/v4 directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 55 - 66, The project currently pins langfuse<3
(legacy v2); if you later want to upgrade to the OTEL-based v3/v4 SDK, create a
tech-debt ticket and update pyproject.toml to a v4 pin (e.g.,
"langfuse>=4.0.0,<5.0.0"), then refactor
src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace the v2
API calls (Langfuse().trace().score()) with the v3/v4 APIs (use
propagate_attributes, start_observation and score.create) and update affected
tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO
comment near the langfuse entry in pyproject.toml referencing the ticket so this
migration is tracked.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Line 25: Replace lint suppressions by restructuring and typing: create a small
dataclass (e.g., LangfuseClientConfig) and accept that or **client_kwargs in
push_evaluation_results_to_langfuse to eliminate the too-many-arguments disable;
perform the optional import at module top-level (or use importlib.find_spec at
import time) and set a _HAS_LANGFUSE flag instead of importing inside the
function to remove import-outside-toplevel; add a TYPE_CHECKING-gated import for
langfuse and annotate the local variable as langfuse: "Langfuse" (or import its
types) to remove the # type: ignore comments; and replace the broad except with
the concrete Langfuse exception type (e.g., langfuse.errors.Error) or
requests.RequestException where appropriate to avoid pylint:
disable=broad-exception-caught.
- Around line 102-110: The current loop always sends a numeric 0.0 when
EvaluationResult.r.score is None, conflating missing/ERROR/SKIPPED with real
zeros; update the loop that builds name/_score_name and comment/_format_comment
so that if r.score is None you do not call trace.score(name=..., value=..., ...)
with a 0.0 — instead either skip calling trace.score for that item and record
the status explicitly (r.result) in the comment/metadata, or call trace with an
explicit non-numeric indication supported by Langfuse (e.g., omit value and add
status to comment/metadata) so ERROR and SKIPPED remain distinguishable from
genuine low numeric scores.

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 55-66: The project currently pins langfuse<3 (legacy v2); if you
later want to upgrade to the OTEL-based v3/v4 SDK, create a tech-debt ticket and
update pyproject.toml to a v4 pin (e.g., "langfuse>=4.0.0,<5.0.0"), then
refactor src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace
the v2 API calls (Langfuse().trace().score()) with the v3/v4 APIs (use
propagate_attributes, start_observation and score.create) and update affected
tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO
comment near the langfuse entry in pyproject.toml referencing the ticket so this
migration is tracked.

In `@README.md`:
- Around line 494-496: Add a one-line note in the "Optional: Langfuse" README
section stating the required Langfuse SDK version (e.g., "Requires Langfuse SDK
v2: install with lightspeed-evaluation[langfuse] and langfuse>=2,<3") so users
don't accidentally use v3; reference the runtime module pin (langfuse>=2,<3) and
the build_langfuse_on_complete_callback symbol in
lightspeed_evaluation.integrations.langfuse_reporter to align the doc with the
actual runtime constraint.

In `@src/lightspeed_evaluation/integrations/__init__.py`:
- Around line 3-11: The module-level import of
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse
makes importing lightspeed_evaluation.integrations risk raising if langfuse
becomes a hard dependency; update
src/lightspeed_evaluation/integrations/__init__.py to guard the re-exports by
either wrapping the import of langfuse_reporter in a try/except ImportError (and
only add the two symbols to __all__ when import succeeds) or add a clear comment
explaining the deliberate lazy-import behavior so future edits preserve the
optional-extra contract; reference the imported symbols
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse and
the module name langfuse_reporter when making the change.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 84-90: The merge order currently allows user-supplied
context.metadata to overwrite reserved keys because trace_meta is populated
first then trace_meta.update(context.metadata or {}); reverse the merge so you
start with user metadata (e.g., initialize trace_meta = dict(context.metadata or
{})), then set/override the reserved keys ("run_name", "result_count", and
optionally "original_data_path") afterwards so the reporter's guaranteed values
(in the trace_meta dict used by the Langfuse reporter) cannot be clobbered by
context.metadata; ensure you still handle context.original_data_path None-check
and preserve existing typing for trace_meta.
- Around line 34-47: Update the docstring to reflect the actual comment format
emitted by _format_comment: change the phrase that says the comment contains
"pass/fail status and a truncated reason" to indicate it emits a
"result=<STATUS>" field (where STATUS is PASS/FAIL/ERROR/SKIPPED) and an
optional truncated reason; mention that PASS/FAIL correspond to pass/fail while
ERROR/SKIPPED are reported as those literal statuses so downstream parsers
should look for result=<STATUS> rather than the words "pass"/"fail".

In `@tests/unit/integrations/test_langfuse_reporter.py`:
- Around line 51-78: The tests test_build_callback_invokes_push and
test_push_with_mock_langfuse set caplog.set_level("ERROR") but never assert
against it; either remove the unused caplog fixture from those test signatures
or add an explicit assertion (e.g., assert not any(r.levelname == "ERROR" for r
in caplog.records) or assert "ERROR" not in caplog.text) after calling cb(...) /
push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were
emitted; update the function signatures and references accordingly
(test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).

In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Add a parametrized unit test that exercises
_make_eval_args(langfuse=True) and verifies the runner wires the Langfuse
callback into evaluate: mock build_langfuse_on_complete_callback to return a
sentinel callback, call evaluate(...) with args from
_make_eval_args(langfuse=True), and assert evaluate was invoked with on_complete
equal to the sentinel (non-None). Locate uses of _make_eval_args, evaluate, and
build_langfuse_on_complete_callback in this test module to add the new case (or
extend the existing parametrized cases around lines where _make_eval_args is
used) so regressions in callback wiring are caught.

In `@tests/unit/test_api.py`:
- Around line 47-49: Add assertions to the existing unit tests to verify that a
non-None on_complete callback is forwarded through the API wrapper into the
pipeline; specifically, in tests that call evaluate, evaluate_conversation, and
evaluate_turn (which currently assert mock_pipeline.run_evaluation.called with
original_data_path=None), add a sentinel (e.g., callback_sentinel) passed as
on_complete to the evaluate* call and assert
mock_pipeline.run_evaluation.assert_called_with(...,
on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn;
update the assertions at the spots mentioned (around lines 47, 116-122, 135-141,
293-300) to include the on_complete=callback_sentinel check so the tests lock in
the pass-through behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa8520d4-455b-4d02-90d7-970ce2235866

📥 Commits

Reviewing files that changed from the base of the PR and between fd9ee3a and 3c5861c.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (14)

README.md
pyproject.toml
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/runner/test_evaluation.py (1)
52-64: ⚠️ Potential issue | 🟡 Minor

Missing test coverage for the --langfuse / LIGHTSPEED_USE_LANGFUSE opt-in.

_make_eval_args now takes langfuse=False, and the two assert_called_once_with updates confirm the default path passes on_complete=None. However, there’s no test exercising the enabled path:

run_evaluation with langfuse=True should import build_langfuse_on_complete_callback and pass a non-None callback into lightspeed_evaluation.api.evaluate.

Same with LIGHTSPEED_USE_LANGFUSE=1 (and the accepted true/yes variants) in the environment.

main() should populate argparse.Namespace.langfuse from the new --langfuse flag.

Given this is the only public entry point for the new feature, at minimum a test that patches lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback and asserts the returned callable is forwarded would be worthwhile.

Want me to draft these test cases?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Tests are missing
for the langfuse opt-in path: add unit tests that call run_evaluation (and main
via invoking the CLI entry point) with langfuse=True and with
LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that patch
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback
to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to
capture its on_complete arg; assert evaluate was called with on_complete is the
sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main
populate argparse.Namespace.langfuse correctly from the --langfuse flag and the
environment variable.

🧹 Nitpick comments (5)

src/lightspeed_evaluation/api.py (1)
41-49: Coding-guideline violation: # pylint: disable=too-many-arguments across public wrappers.

Per repo guidelines for **/*.py, lint suppressions should be replaced by fixing the underlying issue rather than silencing it. All six wrappers (evaluate, evaluate_with_summary, evaluate_conversation, evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary) now carry this disable. Since the wrappers differ only in their payload shape, a small options object is a clean fix that removes every disable at once.
# In core.models (or alongside EvaluationRunContext)
from dataclasses import dataclass

`@dataclass`(frozen=True, slots=True)
class EvaluateOptions:
    output_dir: Optional[str] = None
    evaluation_data_path: Optional[str] = None
    on_complete: Optional[Callable[[list[EvaluationResult], EvaluationRunContext], None]] = None
    compute_confidence_intervals: bool = False
Wrappers would then take options: EvaluateOptions = EvaluateOptions() and forward a single object, dropping every # pylint: disable.

As per coding guidelines: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead".

Also applies to Lines 86, 129, 162, 199, 245.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/api.py` around lines 41 - 49, Replace the
multiple-argument public wrapper signatures that use "# pylint:
disable=too-many-arguments" (evaluate, evaluate_with_summary,
evaluate_conversation, evaluate_conversation_with_summary, evaluate_turn,
evaluate_turn_with_summary) with a single options object parameter: define an
immutable dataclass EvaluateOptions (with fields output_dir,
evaluation_data_path, on_complete, compute_confidence_intervals, etc.) and
change each wrapper to accept options: EvaluateOptions = EvaluateOptions() and
forward options to the underlying implementation, removing the lint disables;
update all call sites in these wrappers to read options.<continue/>
src/lightspeed_evaluation/integrations/langfuse_reporter.py (1)
139-155: User-supplied context.metadata silently clobbers internal trace keys.

trace_meta.update(context.metadata or {}) is performed after setting run_name, result_count, and original_data_path, so a caller who happens to use any of those keys in context.metadata will overwrite the framework-set values on the Langfuse trace. If that is intentional (callers should be able to override), the code is fine; otherwise consider reversing the precedence or namespacing user metadata.
♻️ Suggested precedence reversal
-    trace_meta: dict[str, Any] = {
-        "run_name": context.run_name,
-        "result_count": len(results),
-    }
-    if context.original_data_path is not None:
-        trace_meta["original_data_path"] = context.original_data_path
-    trace_meta.update(context.metadata or {})
+    trace_meta: dict[str, Any] = dict(context.metadata or {})
+    trace_meta["run_name"] = context.run_name
+    trace_meta["result_count"] = len(results)
+    if context.original_data_path is not None:
+        trace_meta["original_data_path"] = context.original_data_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 139
- 155, The trace metadata merge currently lets user-supplied context.metadata
overwrite framework keys because trace_meta.update(context.metadata or {}) runs
after setting run_name/result_count/original_data_path; change the precedence so
framework keys win by applying user metadata first and then updating/overwriting
with the framework values (i.e., build trace_meta from context.metadata or {}
then set trace_meta["run_name"]=context.run_name,
trace_meta["result_count"]=len(results), and optionally set original_data_path
if present) so that the values used when creating the langfuse.trace (see
trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.
tests/unit/pipeline/evaluation/test_pipeline.py (1)
163-260: Good coverage; consider one more edge case.

The two tests cover the happy-path invocation and exception isolation well. Consider a third that calls run_evaluation without original_data_path to lock in that EvaluationRunContext.run_name falls back to "evaluation" and original_data_path is None — this cements the fallback contract that Langfuse traces will rely on.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/pipeline/evaluation/test_pipeline.py` around lines 163 - 260, Add
a new unit test (e.g.,
test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the
existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT
providing original_data_path; patch the same collaborators (MetricManager,
APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager,
MetricsEvaluator, and ConversationProcessor returning a mock processor that
yields a mock EvaluationResult), register an on_complete callback that captures
the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data,
on_complete=on_complete) and assert the callback ran and that ctx.run_name ==
"evaluation" and ctx.original_data_path is None so the fallback behavior is
verified.
src/lightspeed_evaluation/runner/evaluation.py (1)
164-178: Minor: accepted truthy values don’t match common conventions.

("1", "true", "yes") covers the help text’s example but omits on, which some users expect (and is accepted by e.g. distutils.util.strtobool). Consider also handling on / y / t for symmetry, or document the exact accepted set in the --langfuse help.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/runner/evaluation.py` around lines 164 - 178, The
env/CLI truthiness check for enabling Langfuse currently tests only
("1","true","yes"); update the logic around use_langfuse (which uses
eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional common
truthy tokens such as "on", "y", and "t" (or use a standard parser like
distutils.util.strtobool with proper ValueError handling) before calling
build_langfuse_on_complete_callback; ensure the normalized string comparison is
case-insensitive and applied in the same place where use_langfuse is computed so
build_langfuse_on_complete_callback() behavior remains unchanged.
requirements-all-extras.txt (1)
223-224: Consider targeting Langfuse v4.

langfuse==2.60.10 pins to a deprecated v2 release—Langfuse v2 documentation was removed in August 2025, marking its end of life. Langfuse v3 (released June 2025) introduced OpenTelemetry-based observability; however, v4.x (released March 2026) is now current with an observation-centric data model and is the recommended upgrade path. Since this integration is new, target v4 directly (new extra langfuse>=4). If v2 is retained deliberately for compatibility reasons, add a note in the reporter module docstring explaining the rationale and include a tracking issue to migrate.

This is auto-generated from pyproject.toml, so the real change belongs there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements-all-extras.txt` around lines 223 - 224, The project is pinned to
langfuse==2.60.10 (v2); update the dependency to target the current major
release (langfuse>=4) in the source of truth (pyproject.toml) so the
autogenerated requirements-all-extras.txt will reflect langfuse>=4, or if you
must keep v2 for compatibility, add a clear explanatory note and migration
tracking issue in the reporter module docstring (referencing the reporter module
name) that states why v2 is retained and links the issue to migrate to langfuse
v4.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 49-81: The docstring promises Langfuse client errors will be
logged and not re-raised, but only client creation is currently protected; wrap
the trace/score/flush interaction so exceptions are caught and logged instead of
propagated. Specifically, in push_evaluation_results_to_langfuse (or inside
_write_langfuse_trace_and_scores) add a try/except around calls that invoke
client.trace(...), trace.score(...), and client.flush() (or any langfuse API
calls) and on exception call logger.error with contextual info and return
gracefully; keep the client creation logic as-is and do not re-raise the caught
exceptions so the implementation matches the docstring.
- Around line 23-26: The code is calling Langfuse v1-style APIs
(client.trace(...), trace.score(...), client.flush()) which do not exist in
langfuse>=2.0.0; update the implementation in langfuse_reporter.py to use the v2
observation API: replace client.trace(name=..., metadata=..., tags=...) with a
context created via client.start_as_current_observation(name=..., metadata=...,
tags=...) (or the recommended observation factory), switch trace.score(name=...,
value=..., comment=...) to client.create_score(...) or use the returned
observation/span's score method per v2 (e.g., span.score_trace or
client.create_score with the observation id), and remove calls to client.flush()
(v2 has no explicit flush). Also ensure the Langfuse import/optional check still
works and update any variable names (e.g., trace -> observation/span) to match
the v2 objects so tests and runtime use the correct v2 method signatures.

---

Outside diff comments:
In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Tests are missing for the langfuse opt-in path: add unit
tests that call run_evaluation (and main via invoking the CLI entry point) with
langfuse=True and with LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that
patch
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback
to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to
capture its on_complete arg; assert evaluate was called with on_complete is the
sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main
populate argparse.Namespace.langfuse correctly from the --langfuse flag and the
environment variable.

---

Nitpick comments:
In `@requirements-all-extras.txt`:
- Around line 223-224: The project is pinned to langfuse==2.60.10 (v2); update
the dependency to target the current major release (langfuse>=4) in the source
of truth (pyproject.toml) so the autogenerated requirements-all-extras.txt will
reflect langfuse>=4, or if you must keep v2 for compatibility, add a clear
explanatory note and migration tracking issue in the reporter module docstring
(referencing the reporter module name) that states why v2 is retained and links
the issue to migrate to langfuse v4.

In `@src/lightspeed_evaluation/api.py`:
- Around line 41-49: Replace the multiple-argument public wrapper signatures
that use "# pylint: disable=too-many-arguments" (evaluate,
evaluate_with_summary, evaluate_conversation,
evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary)
with a single options object parameter: define an immutable dataclass
EvaluateOptions (with fields output_dir, evaluation_data_path, on_complete,
compute_confidence_intervals, etc.) and change each wrapper to accept options:
EvaluateOptions = EvaluateOptions() and forward options to the underlying
implementation, removing the lint disables; update all call sites in these
wrappers to read options.<continue/>

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 139-155: The trace metadata merge currently lets user-supplied
context.metadata overwrite framework keys because
trace_meta.update(context.metadata or {}) runs after setting
run_name/result_count/original_data_path; change the precedence so framework
keys win by applying user metadata first and then updating/overwriting with the
framework values (i.e., build trace_meta from context.metadata or {} then set
trace_meta["run_name"]=context.run_name,
trace_meta["result_count"]=len(results), and optionally set original_data_path
if present) so that the values used when creating the langfuse.trace (see
trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.

In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 164-178: The env/CLI truthiness check for enabling Langfuse
currently tests only ("1","true","yes"); update the logic around use_langfuse
(which uses eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional
common truthy tokens such as "on", "y", and "t" (or use a standard parser like
distutils.util.strtobool with proper ValueError handling) before calling
build_langfuse_on_complete_callback; ensure the normalized string comparison is
case-insensitive and applied in the same place where use_langfuse is computed so
build_langfuse_on_complete_callback() behavior remains unchanged.

In `@tests/unit/pipeline/evaluation/test_pipeline.py`:
- Around line 163-260: Add a new unit test (e.g.,
test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the
existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT
providing original_data_path; patch the same collaborators (MetricManager,
APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager,
MetricsEvaluator, and ConversationProcessor returning a mock processor that
yields a mock EvaluationResult), register an on_complete callback that captures
the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data,
on_complete=on_complete) and assert the callback ran and that ctx.run_name ==
"evaluation" and ctx.original_data_path is None so the fallback behavior is
verified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 52c593a5-6762-4b3f-8cca-c7df85afe90f

📥 Commits

Reviewing files that changed from the base of the PR and between 3c5861c and 2a8a136.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

README.md
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (9)

requirements-local-embeddings.txt
requirements.txt
requirements-nlp-metrics.txt
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/integrations/init.py
README.md
pyproject.toml
tests/unit/integrations/test_langfuse_reporter.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/lightspeed_evaluation/init.py
tests/unit/test_api.py

bsatapat-jpg · 2026-04-24T08:55:53Z

@corerabbitai review

bsatapat-jpg · 2026-04-29T11:52:43Z

Can able to generate score in langfuse. Attaching the screenshot for your reference.
Tested for all the metrics

coderabbitai

🧹 Nitpick comments (2)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
174-186: Guideline violation: # pylint: disable=broad-exception-caught at line 182.

Per coding guidelines, lint suppressions should not be used. However, this is a valid use case where the callback is user-provided code and specific exception types cannot be known in advance. The defensive catch ensures evaluation results are always returned regardless of callback failures.

Consider documenting this exception to the guideline in a code comment, or using a more specific pattern if the callback contract can be refined. The existing pattern at line 231 (_save_amended_data) uses the same approach.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/pipeline.py` around lines 174 -
186, The broad-except suppression on the on_complete callback should be
documented instead of silenced: remove the inline "# pylint:
disable=broad-exception-caught" and add a short comment above the try/except
explaining that on_complete (the user-provided callback) may raise any exception
and we intentionally catch all exceptions to ensure evaluation results are
preserved (matching the existing pattern used by _save_amended_data); keep the
try/except and the logger.exception(...) call referencing the caught exception
so failures are logged with stack traces while allowing the function to
continue.
src/lightspeed_evaluation/integrations/langfuse_reporter.py (1)
46-46: Kwargs type annotation could be more precise.

The **deprecated_client_kwargs: Optional[str] annotation is technically valid (represents value type), but using Unpack[TypedDict] (Python 3.12+) or documenting the expected keys in a more type-safe way would improve clarity. Given this is for deprecated compatibility kwargs, the current approach is acceptable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` at line 46, The
type annotation for the variadic deprecated kwargs parameter
**deprecated_client_kwargs in
src/lightspeed_evaluation/integrations/langfuse_reporter.py is too imprecise
(Optional[str]); update it to a more accurate mapping type — either use
Unpack[YourTypedDict] if you can rely on Python 3.12+ and have a TypedDict
describing the expected keys, or fall back to Dict[str, Any] or Mapping[str,
Any] to indicate arbitrary keyword args; modify the function/method signature
where **deprecated_client_kwargs is declared (e.g., the Langfuse reporter
constructor/function) accordingly and add or reference the TypedDict if using
Unpack.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Line 46: The type annotation for the variadic deprecated kwargs parameter
**deprecated_client_kwargs in
src/lightspeed_evaluation/integrations/langfuse_reporter.py is too imprecise
(Optional[str]); update it to a more accurate mapping type — either use
Unpack[YourTypedDict] if you can rely on Python 3.12+ and have a TypedDict
describing the expected keys, or fall back to Dict[str, Any] or Mapping[str,
Any] to indicate arbitrary keyword args; modify the function/method signature
where **deprecated_client_kwargs is declared (e.g., the Langfuse reporter
constructor/function) accordingly and add or reference the TypedDict if using
Unpack.

In `@src/lightspeed_evaluation/pipeline/evaluation/pipeline.py`:
- Around line 174-186: The broad-except suppression on the on_complete callback
should be documented instead of silenced: remove the inline "# pylint:
disable=broad-exception-caught" and add a short comment above the try/except
explaining that on_complete (the user-provided callback) may raise any exception
and we intentionally catch all exceptions to ensure evaluation results are
preserved (matching the existing pattern used by _save_amended_data); keep the
try/except and the logger.exception(...) call referencing the caught exception
so failures are logged with stack traces while allowing the function to
continue.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 912418b5-a0f2-4a8c-907d-5c59e8707985

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8a136 and 8bc4784.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

README.md
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (7)

requirements.txt
requirements-nlp-metrics.txt
src/lightspeed_evaluation/core/models/init.py
README.md
src/lightspeed_evaluation/init.py
pyproject.toml
requirements-local-embeddings.txt

🚧 Files skipped from review as they are similar to previous changes (6)

src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/init.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/test_api.py
tests/unit/runner/test_evaluation.py

asamal4 · 2026-05-10T17:00:40Z

@bsatapat-jpg Could you please share the analytics tab details.
Also what is the additional benefit we will get with this? For storage we already provide sqlite/postgresql integration, for dashboard current implementation is very basic and the screenshot is not very intuitive.
Do you have any plan for future enhancement on this ? Because with current PR we are adding additional code without proportional benefit.

bsatapat-jpg · 2026-05-11T06:40:30Z

@asamal4 Langfuse’s Analytics tab gives mean/std, time trends, score distributions, and optional two-score comparison (correlation, MAE/RMSE, heatmap). For teams already on Langfuse, that avoids a separate eval DB and duplicate dashboards; our SQLite/Postgres path remains for local/basic UI — Langfuse is the optional path where analysis and org-wide tooling already live.

asamal4 · 2026-05-11T08:57:55Z

@bsatapat-jpg Probably I misunderstood the requirement here. I thought we are integrating langfuse for complete tracking, but seems like the goal is just to add result. If that is the case then why aren't we using storage config. This was the plan when we added storage config.. Why are we adding langfuse cli arg ?

bsatapat-jpg · 2026-05-11T09:33:33Z

@asamal4 - We need to provide the public key and private key for langfuse through the terminal using export commands. That's why I thought of configuring this through the CLI.
However, we can also handle this through the storage configuration itself by simply providing:

storage:
  - type: 'langfuse'

Let me know if I need to follow this structure.

asamal4 · 2026-05-11T11:07:29Z

Yes I would prefer consistent approach. We can still use environment variables.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/unit/runner/test_evaluation.py (1)

228-231: ⚡ Quick win

Strengthen callback wiring assertion with call count check.

This test should also assert evaluate is called exactly once before reading call_args, so regressions with multiple calls don’t slip through.

[recommendation]

Suggested tweak

         run_evaluation(_make_eval_args())
 
+        mock_evaluate.assert_called_once()
         assert mock_evaluate.call_args.kwargs["on_complete"] is not None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/runner/test_evaluation.py` around lines 228 - 231, The test
currently reads mock_evaluate.call_args without ensuring evaluate was invoked
only once; update the assertion after calling run_evaluation(_make_eval_args())
to assert the mock was called exactly once (e.g., use
mock_evaluate.assert_called_once() or assert mock_evaluate.call_count == 1)
before accessing mock_evaluate.call_args.kwargs["on_complete"], so multiple
calls cannot silently break the callback wiring; keep references to
run_evaluation and mock_evaluate when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 138-159: The current _suppress_conflicting_langfuse_host_env
context manager mutates os.environ to hide _LANGFUSE_HOST_ENV which can cause
process-wide races; instead, avoid touching global env by ensuring the Langfuse
client receives an explicit base URL when "host" is provided: in code paths that
call Langfuse(**langfuse_init_kwargs) (and in the helper
_suppress_conflicting_langfuse_host_env), remove any os.environ modifications
and set langfuse_init_kwargs["base_url"] = langfuse_init_kwargs.pop("host") (or
otherwise add the explicit client arg the Langfuse constructor expects) so the
client uses that value and won't read _LANGFUSE_HOST_ENV; also delete usages of
the context manager (including the block around lines referencing
_suppress_conflicting_langfuse_host_env) and keep the _LANGFUSE_HOST_ENV symbol
unchanged elsewhere.
- Around line 226-239: The docstring for the public function
build_langfuse_on_complete_callback is missing Google-style Args and Returns
sections; update its docstring to include an "Args:" block describing each
parameter (including types and defaults) and a "Returns:" block that specifies
the callback signature (what the returned on_complete callable accepts and
returns), using Google-style formatting consistent with other public APIs (e.g.,
describe the on_complete callback parameters, return type, and any side effects
like reporting to Langfuse).

---

Nitpick comments:
In `@tests/unit/runner/test_evaluation.py`:
- Around line 228-231: The test currently reads mock_evaluate.call_args without
ensuring evaluate was invoked only once; update the assertion after calling
run_evaluation(_make_eval_args()) to assert the mock was called exactly once
(e.g., use mock_evaluate.assert_called_once() or assert mock_evaluate.call_count
== 1) before accessing mock_evaluate.call_args.kwargs["on_complete"], so
multiple calls cannot silently break the callback wiring; keep references to
run_evaluation and mock_evaluate when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bd48c325-ce0d-44af-954b-cc29ffe04fba

📥 Commits

Reviewing files that changed from the base of the PR and between 8bc4784 and e020b82.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (26)

README.md
config/system.yaml
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/storage/__init__.py
src/lightspeed_evaluation/core/storage/config.py
src/lightspeed_evaluation/core/storage/factory.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/storage/test_composite_and_factory.py
tests/unit/core/system/test_loader.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (7)

config/system.yaml
requirements-local-embeddings.txt
src/lightspeed_evaluation/core/models/system.py
requirements-nlp-metrics.txt
README.md
requirements-all-extras.txt
src/lightspeed_evaluation/core/models/init.py

🚧 Files skipped from review as they are similar to previous changes (10)

pyproject.toml
src/lightspeed_evaluation/integrations/init.py
src/lightspeed_evaluation/init.py
tests/unit/test_api.py
tests/unit/pipeline/evaluation/test_pipeline.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
tests/unit/integrations/test_langfuse_reporter.py
requirements.txt
src/lightspeed_evaluation/api.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/integrations/__init__.py`:
- Around line 3-7: The public integration imports
(build_langfuse_on_complete_callback,
build_langfuse_on_complete_from_storage_configs,
push_evaluation_results_to_langfuse) in integrations/__init__.py must be guarded
so importing the top-level API won't raise when the Langfuse extra is not
installed; wrap the import in a try/except ImportError (or
ImportError/Exception) and set fallback names (e.g., None or dummy callables) so
callers can detect absence, or alternatively move the import into the caller
function _resolve_on_complete in api.py as a local import; ensure any code that
calls build_langfuse_on_complete_from_storage_configs checks for the fallback
and raises a clear error only when the feature is actually used.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d0c0fbc9-adac-4ce5-8192-358c84011455

📥 Commits

Reviewing files that changed from the base of the PR and between e020b82 and e5a6a80.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (26)

README.md
config/system.yaml
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/storage/__init__.py
src/lightspeed_evaluation/core/storage/config.py
src/lightspeed_evaluation/core/storage/factory.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/storage/test_composite_and_factory.py
tests/unit/core/system/test_loader.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (5)

requirements-nlp-metrics.txt
README.md
requirements.txt
config/system.yaml
src/lightspeed_evaluation/core/models/system.py

🚧 Files skipped from review as they are similar to previous changes (20)

requirements-local-embeddings.txt
src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/init.py
src/lightspeed_evaluation/core/storage/factory.py
tests/unit/test_api.py
tests/unit/core/system/test_loader.py
src/lightspeed_evaluation/core/storage/init.py
pyproject.toml
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
tests/unit/core/storage/test_composite_and_factory.py
src/lightspeed_evaluation/core/system/loader.py
tests/unit/runner/test_evaluation.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/runner/evaluation.py
requirements-all-extras.txt
src/lightspeed_evaluation/api.py
tests/unit/pipeline/evaluation/test_pipeline.py
src/lightspeed_evaluation/core/storage/config.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
tests/unit/integrations/test_langfuse_reporter.py

bsatapat-jpg · 2026-05-12T09:15:36Z

@coderabbitai review

coderabbitai · 2026-05-12T09:15:43Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py Outdated

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py Outdated

bsatapat-jpg force-pushed the dev branch 2 times, most recently from cac2b29 to 2a8a136 Compare April 24, 2026 08:08

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py Outdated

bsatapat-jpg force-pushed the dev branch 2 times, most recently from ffd22e3 to 8bc4784 Compare April 29, 2026 11:48

bsatapat-jpg changed the title ~~Added langfuse compatibility with the lightspeed-eval~~ [LEADS-362] Added langfuse compatibility with the lightspeed-eval Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

bsatapat-jpg force-pushed the dev branch from 8bc4784 to e020b82 Compare May 12, 2026 07:08

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py Outdated

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py

bsatapat-jpg force-pushed the dev branch 2 times, most recently from fb60d49 to e5a6a80 Compare May 12, 2026 08:18

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/__init__.py Outdated

Added langfuse compatibility with the lightspeed-eval

5184b4e

bsatapat-jpg force-pushed the dev branch from e5a6a80 to 5184b4e Compare May 12, 2026 09:08

bsatapat-jpg requested review from VladimirKadlec and asamal4 May 12, 2026 10:08

Conversation

bsatapat-jpg commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg commented Apr 24, 2026

Uh oh!

bsatapat-jpg commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented May 10, 2026

Uh oh!

bsatapat-jpg commented May 11, 2026

Uh oh!

asamal4 commented May 11, 2026

Uh oh!

bsatapat-jpg commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asamal4 commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bsatapat-jpg commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bsatapat-jpg commented Apr 24, 2026 •

edited

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

bsatapat-jpg commented May 11, 2026 •

edited

Loading