Why Run Evals?
A single test pass can be misleading:- The LLM might get lucky on one attempt
- Temperature introduces randomness
- Different phrasings might fail where others succeed
- “This tool is called correctly 97% of the time”
- “Arguments are correct in 90% of cases”
- “Average latency is 1.2 seconds”
EvalTest: Single Test Scenario
EvalTest runs one test function multiple times:
Writing Test Functions
Test functions receive anEvalAgent (implemented by TestAgent and mock agents) and return boolean:
Run Options
Metrics
After running, access various metrics:EvalSuite: Multiple Tests
Group related tests together:Save Results to MCPJam
BothEvalTest and EvalSuite can automatically save results to MCPJam when a run completes. Set MCPJAM_API_KEY in your environment and results are saved automatically:
Suite Results
Choosing Iteration Count
| Scenario | Iterations | Why |
|---|---|---|
| Quick smoke test | 10 | Fast feedback during development |
| Regular testing | 30 | Good statistical significance |
| Pre-release | 50-100 | High confidence before shipping |
| Benchmarking | 100+ | Comparing models or changes |
Best Practices
Use Low Temperature
More deterministic results for testing:Handle Rate Limits
Reduce concurrency for rate-limited APIs:Test Edge Cases
Don’t just test the happy path:Set Quality Thresholds
Fail CI if accuracy drops below a threshold:Generate evals from the Inspector
You can also generate eval code from the MCPJam Inspector. Click ⋮ → Copy markdown for server evals on any server card, then paste it into an LLM. See the Quickstart for details. If you have aMCPJAM_API_KEY, the generated code will automatically save results to the Evals tab in the Inspector. Go to Settings > Workspace API Key to get your key.
Next Steps
Testing Across Providers
Compare performance across LLMs
EvalTest Reference
Full EvalTest API
EvalSuite Reference
Full EvalSuite API
Save Results to MCPJam
Auto-save, manual APIs, CI metadata, and artifact upload

