Why Test Multiple Providers?
Your users connect to MCP servers from various clients:- Claude Desktop (Anthropic)
- ChatGPT plugins (OpenAI)
- Cursor (various models)
- Custom apps (any provider)
- Tool calling capabilities
- Interpretation of descriptions
- Handling of complex arguments
- Response patterns
Supported Providers
The SDK supports 9 providers out of the box:| Provider | Model Format | Example |
|---|---|---|
| Anthropic | anthropic/model | anthropic/claude-sonnet-4-20250514 |
| OpenAI | openai/model | openai/gpt-4o |
google/model | google/gemini-1.5-pro | |
| Azure | azure/model | azure/gpt-4o |
| Mistral | mistral/model | mistral/mistral-large-latest |
| DeepSeek | deepseek/model | deepseek/deepseek-chat |
| Ollama | ollama/model | ollama/llama3 |
| OpenRouter | openrouter/org/model | openrouter/anthropic/claude-3-opus |
| xAI | xai/model | xai/grok-beta |
Comparing Providers
Create agents for each provider and run the same tests:Provider Comparison Script
A complete script for benchmarking:Custom Providers
Add your own OpenAI or Anthropic-compatible endpoints:LiteLLM Proxy
Test many models through a single proxy:Interpreting Results
When comparing providers, look for:Consistent High Performance
All providers score >90%? Your tool descriptions are clear and well-documented.One Provider Struggling
If Claude works but GPT-4 doesn’t, your descriptions might use Claude-specific patterns. Review and generalize.All Providers Struggling
Low accuracy across the board suggests ambiguous tool names or descriptions. Improve your MCP server’s documentation.High Variance
If the same provider gives 70% one run and 95% the next, try:- Lower temperature
- More iterations
- Clearer prompts in tests
Next Steps
LLM Providers Reference
All providers and configuration options
Running Evals
Statistical evaluation basics

