The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. Contact Axiom to get early access and join a focused group of teams shaping these tools.
The Eval
function
Coming soon
The primary tool for the Measure stage is the Eval
function, which will be available in the axiom/ai
package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
An Eval
is structured around a few key parameters:
data
: An async function that returns yourcollection
of{ input, expected }
pairs, which serve as your ground truth.task
: The function that executes your AI capability, taking aninput
and producing anoutput
.scorers
: An array ofgrader
functions that score theoutput
against theexpected
value.threshold
: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
Grading with scorers
Coming soon
A is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the input
, the generated output
, and the expected
value, and must return a score.
Running evaluations
Coming soon
You will run your evaluation suites from your terminal using the axiom
CLI.
vitest
in the background. Note that vitest
will be a peer dependency for this functionality.
Analyzing results in the console
Coming soon
When you run an , the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with eval.*
attributes, allowing you to deeply analyze results in the Axiom Console.
The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.