-
Notifications
You must be signed in to change notification settings - Fork 1
Add trace groundedness evaluator #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e_llm_as_judge_evaluator`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a trace-level groundedness evaluator for validating agent outputs against tool evidence, along with supporting infrastructure for LLM-based evaluation. The evaluator helps detect hallucinations by verifying that claims in candidate outputs are supported by tool observations in the trace.
Changes:
- Added
create_trace_groundedness_evaluatorfactory function with configurable tool filtering and evidence context building - Implemented LLM-as-a-judge evaluator factory with structured response schemas
- Updated OpenAI client initialization to use Langfuse-wrapped
AsyncOpenAIwith SDK retries disabled in favor of centralized tenacity-based retry logic
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| aieng/agent_evals/evaluation/graders/trace_groundedness.py | Core implementation of trace groundedness evaluator with claim extraction and evidence verification |
| aieng/agent_evals/evaluation/graders/llm_judge.py | Generic LLM judge evaluator factory for item-level evaluation |
| aieng/agent_evals/evaluation/graders/_utils.py | Shared utilities for structured parsing, retry logic, and prompt rendering |
| aieng/agent_evals/evaluation/graders/config.py | Configuration dataclass for LLM request parameters and retry settings |
| aieng/agent_evals/evaluation/graders/init.py | Public API exports for grader factories and response models |
| aieng/agent_evals/async_client_manager.py | Updated to use Langfuse OpenAI wrapper and disable SDK retries |
| tests/aieng/agent_evals/evaluation/graders/test_trace_groundedness.py | Comprehensive tests for groundedness evaluator including error paths |
| tests/aieng/agent_evals/evaluation/graders/test_llm_judge.py | Tests for LLM judge evaluator with validation scenarios |
| tests/aieng/agent_evals/evaluation/graders/init.py | Test package marker |
| tests/aieng/agent_evals/evaluation/init.py | Test package marker |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aieng-eval-agents/aieng/agent_evals/evaluation/graders/trace_groundedness.py
Outdated
Show resolved
Hide resolved
aieng-eval-agents/aieng/agent_evals/evaluation/graders/__init__.py
Outdated
Show resolved
Hide resolved
…luator and update tests
…d update tests accordingly
…nd improved documentation
Summary
This PR adds a reusable trace-level groundedness evaluator (claims supported by tool evidence).
It also updates OpenAI client initialization to use
langfuse.openai.AsyncOpenAIto enable tracing of LLM judge, with SDK retries disabled (max_retries=0) so retry behavior is controlled centrally via tenacity in grader utilities.Clickup Ticket(s): N/A
Type of Change
Changes Made
create_trace_groundedness_evaluator+ public schemas/exports for grader responses and claimsaieng/agent_evals/async_client_manager.pyto uselangfuse.openai.AsyncOpenAIand disable SDK retries.tests/.../graders/test_trace_groundedness.pyTesting
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
N/A
Screenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
No deployment or migration steps required.
Checklist