I’ve been running AI workloads in production across multiple providers since late 2024. Here’s my honest assessment of the major players as of April/May 2025 - not benchmarks, but actual experience building and running things.
Gemini
What’s good: Fast, large context windows, competitive pricing. For high-volume workloads where cost matters, Google’s pricing is hard to beat.
What’s frustrating:
- Non-standard API design. You’ll spend time learning Gemini-specific patterns that don’t transfer to other providers.
- No cost monitoring dashboard in GCP that actually works well. Tracking spend requires workarounds.
- High bug rate. I’ve hit more API-level issues with Gemini than any other provider.
- Reasoning tokens aren’t exposed via the API. You can’t see the model’s chain-of-thought, which matters for debugging.
- Token caching requires separate API calls rather than being handled automatically.
For pure cost efficiency on simpler tasks, Gemini is compelling. For production reliability and developer experience, it’s frustrating.
Claude
What’s good: High-quality outputs. Claude consistently produces well-structured, thoughtful responses, particularly on complex reasoning tasks. The developer documentation is genuinely good.
What’s frustrating:
- Approximately 3× more expensive than alternatives for comparable tasks.
- Claude 3.7’s verbosity is a real cost driver. The model has a tendency toward long, thorough responses that add tokens without always adding value. You need to work explicitly against this in your prompts.
- About 85% API reliability in production. The other 15% requires error handling and retry logic that you may not have built for providers with higher uptime.
For workloads where quality is the primary constraint, Claude is often the right choice. Budget accordingly.
Grok
What’s good: Interesting capabilities, particularly for certain types of reasoning tasks. The pricing was competitive when I tested it.
What’s frustrating:
- Grok 3 Mini outperformed Grok 3 on most of my benchmarks. Paying for the larger model gave worse results.
- The “fast” variants were actually slower in practice and cost 50% more. I never figured out whether this was a labeling issue or a real infrastructure problem.
- Significant delays between feature announcements and API availability. Features demoed in Grok’s consumer app regularly took weeks or months to appear in the API.
Grok has potential, but the operational inconsistencies make it difficult to plan around.
OpenAI
What’s good: The most reliable API in production. Uptime is consistently better than competitors. The ecosystem around OpenAI’s APIs is the most mature - more tools, more documentation, more community knowledge.
What’s frustrating:
- o1-pro is poor value. The price increase over o4-mini is approximately 100×, and on most tasks o4-mini performs better. There’s a specific class of very deep reasoning problems where o1-pro is the right tool, but it’s a narrow class.
- Markdown formatting requires specific workarounds. There are particular system prompt strings you need to include to get consistent markdown output. It works once you know the trick, but it’s friction that shouldn’t exist.
For production reliability and ecosystem maturity, OpenAI is still the safe choice.
Mistral
What’s good: Mistral has been a genuine pioneer in open-weight models. Their commitment to releasing open models matters for the ecosystem.
What’s frustrating:
- Their app uses models that aren’t accessible to external developers through a private Cerebras arrangement. The result: their consumer app is dramatically faster than their API. I measured up to 80× slower API performance compared to what I saw in the Mistral app.
- This directly contradicts their “open” marketing. If the best performance is locked in a proprietary arrangement that external developers can’t access, you’re not actually providing an open platform.
Mistral’s open-weight releases are valuable. Their commercial API product is a different story.
Practical Evaluation Framework
When choosing a provider for a new workload, I evaluate:
- Budget predictability: Can I model my costs accurately? Surprise bills are worse than predictable high bills.
- Reliability requirements: What’s my tolerance for API failures? Build your retry logic before you need it.
- Response formatting: Does the model follow instructions consistently? Format compliance varies more than you’d expect.
- API implementation quality: How well-documented is the API? How mature is the client library?
- Support and community: When something goes wrong, can you find help?
The “best” provider depends on your specific constraints. There isn’t a universal answer - but knowing which constraints matter most to your use case makes the choice much clearer.