Add trace groundedness evaluator #44

fcogidi · 2026-02-10T21:29:54Z

Summary

This PR adds a reusable trace-level groundedness evaluator (claims supported by tool evidence).
It also updates OpenAI client initialization to use langfuse.openai.AsyncOpenAI to enable tracing of LLM judge, with SDK retries disabled (max_retries=0) so retry behavior is controlled centrally via tenacity in grader utilities.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added create_trace_groundedness_evaluator + public schemas/exports for grader responses and claims
Updated aieng/agent_evals/async_client_manager.py to use langfuse.openai.AsyncOpenAI and disable SDK retries.
Added tests/.../graders/test_trace_groundedness.py

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

No deployment or migration steps required.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

…e_llm_as_judge_evaluator`

Copilot

Pull request overview

This PR introduces a trace-level groundedness evaluator for validating agent outputs against tool evidence, along with supporting infrastructure for LLM-based evaluation. The evaluator helps detect hallucinations by verifying that claims in candidate outputs are supported by tool observations in the trace.

Changes:

Added create_trace_groundedness_evaluator factory function with configurable tool filtering and evidence context building
Implemented LLM-as-a-judge evaluator factory with structured response schemas
Updated OpenAI client initialization to use Langfuse-wrapped AsyncOpenAI with SDK retries disabled in favor of centralized tenacity-based retry logic

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
aieng/agent_evals/evaluation/graders/trace_groundedness.py	Core implementation of trace groundedness evaluator with claim extraction and evidence verification
aieng/agent_evals/evaluation/graders/llm_judge.py	Generic LLM judge evaluator factory for item-level evaluation
aieng/agent_evals/evaluation/graders/_utils.py	Shared utilities for structured parsing, retry logic, and prompt rendering
aieng/agent_evals/evaluation/graders/config.py	Configuration dataclass for LLM request parameters and retry settings
aieng/agent_evals/evaluation/graders/init.py	Public API exports for grader factories and response models
aieng/agent_evals/async_client_manager.py	Updated to use Langfuse OpenAI wrapper and disable SDK retries
tests/aieng/agent_evals/evaluation/graders/test_trace_groundedness.py	Comprehensive tests for groundedness evaluator including error paths
tests/aieng/agent_evals/evaluation/graders/test_llm_judge.py	Tests for LLM judge evaluator with validation scenarios
tests/aieng/agent_evals/evaluation/graders/init.py	Test package marker
tests/aieng/agent_evals/evaluation/init.py	Test package marker

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aieng-eval-agents/aieng/agent_evals/evaluation/graders/trace_groundedness.py

aieng-eval-agents/aieng/agent_evals/evaluation/graders/__init__.py

…luator and update tests

…d update tests accordingly

…nd improved documentation

fcogidi and others added 6 commits February 6, 2026 16:38

Add LLM judge evaluator and related configurations

827fb3f

Address PR comments from copilot

483c530

Merge branch 'main' into fco/add_llm_judge

8ae7889

Merge branch 'main' into fco/add_llm_judge

f5820da

Rename evaluator factory from make_llm_as_judge_evaluator to `creat…

52b488d

…e_llm_as_judge_evaluator`

Add trace groundedness evaluator and corresponding tests

cbfbd64

fcogidi requested review from amrit110, Copilot and lotif February 10, 2026 21:29

fcogidi self-assigned this Feb 10, 2026

fcogidi added the enhancement New feature or request label Feb 10, 2026

Copilot AI reviewed Feb 10, 2026

View reviewed changes

aieng-eval-agents/aieng/agent_evals/evaluation/graders/trace_groundedness.py Outdated Show resolved Hide resolved

aieng-eval-agents/aieng/agent_evals/evaluation/graders/__init__.py Outdated Show resolved Hide resolved

fcogidi added 8 commits February 10, 2026 16:39

Validate max_unsupported_claims_in_metadata in trace groundedness eva…

4f03619

…luator and update tests

Update cryptography package to fix pip-audit CI failure

0ffff4b

Refactor max_field_chars parameter to allow None for no truncation an…

c5fcb5f

…d update tests accordingly

Merge branch 'main' into fco/add_llm_judge

0791d2c

Enhance groundedness evaluator prompts with rubric section handling a…

4550965

…nd improved documentation

Merge branch 'main' into fco/add_llm_judge

5102e4c

Merge branch 'fco/add_llm_judge' into fco/add_groundedness_eval

92d3839

Merge branch 'main' into fco/add_groundedness_eval

0ba18a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trace groundedness evaluator #44

Add trace groundedness evaluator #44

Uh oh!

fcogidi commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add trace groundedness evaluator #44

Are you sure you want to change the base?

Add trace groundedness evaluator #44

Uh oh!

Conversation

fcogidi commented Feb 10, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants