Skip to content

feat: implement loglikelihood and loglikelihood_rolling for LiteLLMClient (closes #1093)#1244

Open
ALI-AL-MARJANI wants to merge 2 commits into
huggingface:mainfrom
ALI-AL-MARJANI:feature/litellm-loglikelihood-support
Open

feat: implement loglikelihood and loglikelihood_rolling for LiteLLMClient (closes #1093)#1244
ALI-AL-MARJANI wants to merge 2 commits into
huggingface:mainfrom
ALI-AL-MARJANI:feature/litellm-loglikelihood-support

Conversation

@ALI-AL-MARJANI
Copy link
Copy Markdown

Summary

Implements loglikelihood() and loglikelihood_rolling() for LiteLLMClient,
enabling deterministic MCQ benchmarks (MMLU, ARC, HellaSwag) and perplexity
evaluation over any LiteLLM-supported provider.

Previously both methods raised NotImplementedError.

How it works

Uses litellm.atext_completion with echo=True, logprobs=1, max_tokens=1, temperature=0.0 (the /v1/completions endpoint). A two-layer Token Alignment
Engine isolates the continuation log-probabilities from the echoed prompt:

  1. Layer 1 — character-offset alignment via text_offset (OpenAI exact)
  2. Layer 2 — tiktoken token-count fallback for other providers

Concurrency is managed with asyncio.Semaphore + asyncio.gather, matching
the approach used by other async-capable backends.

Provider requirement

The /v1/completions endpoint with echo support is required:

Provider Supported
OpenAI gpt-3.5-turbo-instruct
Any OpenAI-compatible local server (vLLM, llama.cpp)
OpenAI chat-only models (gpt-4o, gpt-4-turbo)
Anthropic, Gemini, Cohere

A warning is emitted at runtime if the model is registered as mode=chat.

Changes

  • litellm_model.py — 9 new methods: loglikelihood, loglikelihood_rolling,
    async pipeline, token alignment engine, argmax check, provider guard, length guard
  • model_input.py — new to_litellm_text_completion_dict() method; also fixes
    presence_penalty silently dropped from to_litellm_dict() (bug fix)
  • inference_providers_model.py — informative NotImplementedError messages
    explaining why the HF Inference Providers backend cannot support this
  • use-litellm-as-backend.mdx — full documentation of both evaluation modes
  • installation.mdx, models.mdx — updated to reflect new capabilities
  • tests.yaml — adds --extra litellm to CI
  • litellm_completion_model.yaml — example config for MCQ/perplexity benchmarks

Tests

80 new unit tests, all passing, no new regressions:

  • Token alignment engine (6 tests)
  • Argmax checker (7 tests)
  • Async API caller with retry/backoff (7 tests)
  • Single-doc loglikelihood processor (7 tests)
  • End-to-end loglikelihood integration (4 tests)
  • Provider guard (12 tests)
  • Rolling loglikelihood (6 tests + 2 integration)
  • Length guard (7 tests)
  • to_litellm_text_completion_dict (9 tests)
  • to_litellm_dict presence_penalty regression (4 tests)
  • greedy_until split iteration regression (2 tests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant