Skip to main content
LLMs are probabilistic—the same prompt might produce different results each time. Running a test once tells you it can work, not that it reliably works. Evals run the same test many times and measure statistical accuracy.

Why Run Evals?

A single test pass can be misleading:
  • The LLM might get lucky on one attempt
  • Temperature introduces randomness
  • Different phrasings might fail where others succeed
Running 30+ iterations gives you confidence in real-world performance:
  • “This tool is called correctly 97% of the time”
  • “Arguments are correct in 90% of cases”
  • “Average latency is 1.2 seconds”

EvalTest: Single Test Scenario

EvalTest runs one test function multiple times:
import { EvalTest } from "@mcpjam/sdk";

const test = new EvalTest({
  name: "addition-accuracy",
  test: async (agent) => {
    const result = await agent.prompt("Add 2 and 3");
    return result.hasToolCall("add"); // Return true/false
  },
});

const result = await test.run(agent, { iterations: 30 });

console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"
console.log(`${result.successes}/${result.iterations} passed`);

Writing Test Functions

Test functions receive an EvalAgent (implemented by TestAgent and mock agents) and return boolean:
// Simple: did the right tool get called?
test: async (agent) => {
  const result = await agent.prompt("Add 5 and 3");
  return result.hasToolCall("add");
}

// Detailed: check arguments too
test: async (agent) => {
  const result = await agent.prompt("Add 10 and 20");
  const args = result.getToolArguments("add");
  return args?.a === 10 && args?.b === 20;
}

// Complex: multi-step workflow
test: async (agent) => {
  const r1 = await agent.prompt("Create a project called 'Test'");
  const r2 = await agent.prompt("Add a task to it", { context: r1 });
  return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
}

Run Options

await test.run(agent, {
  iterations: 30,      // How many times to run
  concurrency: 5,      // Parallel runs (careful with rate limits)
  retries: 2,          // Retry failures
  timeoutMs: 30000,    // Per-test timeout
  mcpjam: {
    // Auto-save is enabled when MCPJAM_API_KEY is available
    suiteName: "SDK eval smoke",
    strict: false, // warn by default; true to fail CI on upload errors
  },
  onProgress: (done, total) => {
    console.log(`${done}/${total}`);
  },
});

Metrics

After running, access various metrics:
test.accuracy();           // Success rate (0.0 - 1.0)
test.precision();          // Precision metric
test.recall();             // Recall metric
test.averageTokenUse();    // Avg tokens per iteration

EvalSuite: Multiple Tests

Group related tests together:
import { EvalSuite, EvalTest } from "@mcpjam/sdk";

const suite = new EvalSuite({ name: "Math Operations" });

suite.add(new EvalTest({
  name: "addition",
  test: async (agent) => {
    const r = await agent.prompt("Add 5 and 3");
    return r.hasToolCall("add");
  },
}));

suite.add(new EvalTest({
  name: "multiplication",
  test: async (agent) => {
    const r = await agent.prompt("Multiply 4 by 6");
    return r.hasToolCall("multiply");
  },
}));

suite.add(new EvalTest({
  name: "division",
  test: async (agent) => {
    const r = await agent.prompt("Divide 20 by 4");
    return r.hasToolCall("divide");
  },
}));

const result = await suite.run(agent, { iterations: 30 });
console.log(`Overall: ${(result.aggregate.accuracy * 100).toFixed(1)}%`);

Save Results to MCPJam

Both EvalTest and EvalSuite can automatically save results to MCPJam when a run completes. Set MCPJAM_API_KEY in your environment and results are saved automatically:
await test.run(agent, {
  iterations: 30,
  mcpjam: {
    suiteName: "Addition Eval",
    passCriteria: { minimumPassRate: 90 },
  },
});
For manual save APIs, CI metadata, and artifact uploads, see the Save Results to MCPJam guide.

Suite Results

// Overall accuracy
console.log(`Suite: ${(suite.accuracy() * 100).toFixed(1)}%`);

// Per-test breakdown
for (const test of suite.getAll()) {
  console.log(`  ${test.getName()}: ${(test.accuracy() * 100).toFixed(1)}%`);
}

// Get specific test
const addTest = suite.get("addition");
console.log(`Addition accuracy: ${addTest.accuracy()}`);

Choosing Iteration Count

ScenarioIterationsWhy
Quick smoke test10Fast feedback during development
Regular testing30Good statistical significance
Pre-release50-100High confidence before shipping
Benchmarking100+Comparing models or changes

Best Practices

Use Low Temperature

More deterministic results for testing:
const agent = new TestAgent({
  // ...
  temperature: 0.1,
});

Handle Rate Limits

Reduce concurrency for rate-limited APIs:
await suite.run(agent, {
  iterations: 30,
  concurrency: 2, // Avoid hitting rate limits
});

Test Edge Cases

Don’t just test the happy path:
suite.add(new EvalTest({
  name: "handles-empty-input",
  test: async (agent) => {
    const r = await agent.prompt("Add numbers"); // No numbers given
    return !r.hasError(); // Should handle gracefully
  },
}));

suite.add(new EvalTest({
  name: "handles-large-numbers",
  test: async (agent) => {
    const r = await agent.prompt("Add 999999999 and 1");
    return r.hasToolCall("add");
  },
}));

Set Quality Thresholds

Fail CI if accuracy drops below a threshold:
await suite.run(agent, { iterations: 30 });

if (suite.accuracy() < 0.90) {
  console.error(`Accuracy ${suite.accuracy()} below 90% threshold`);
  process.exit(1);
}

Generate evals from the Inspector

You can also generate eval code from the MCPJam Inspector. Click ⋮ → Copy markdown for server evals on any server card, then paste it into an LLM. See the Quickstart for details. If you have a MCPJAM_API_KEY, the generated code will automatically save results to the Evals tab in the Inspector. Go to Settings > Workspace API Key to get your key.

Next Steps

Testing Across Providers

Compare performance across LLMs

EvalTest Reference

Full EvalTest API

EvalSuite Reference

Full EvalSuite API

Save Results to MCPJam

Auto-save, manual APIs, CI metadata, and artifact upload