Model success rates on real E2E coding tasks from cto.new users

It conveys a real-world performance ranking of various frontier AI models (like Claude, GPT variants, Gemini, and others) specifically on end-to-end coding tasks. Unlike many synthetic or lab-based AI benchmarks (e.g., SWE-Bench), this one uses actual user data from cto.new — measuring how often the AI successfully completes tasks that result in merged code (i.e., pull requests accepted into repositories).

Key points it emphasizes:

Metric: Success rate = percentage of completed tasks that lead to merged code.

Methodology: Based on real usage by thousands of developers; uses a rolling window (e.g., 4-day or 72-hour) with lags for resolution; excludes low-usage models or non-merging teams for reliability.

Purpose: To show which models perform best in practical, production-like coding workflows (using tools like file reading/writing, grep, terminal commands, etc.), rather than hypothetical tests.

Recent top rankings (as of late 2025 data shown on the page):

Claude Sonnet 4.5 – 88.3%

Devstral 2 – 86.9%

MiniMax M2 – 84.5%

Gemini 3 Pro Preview – 84.4%

GPT 5.2 – 84.2%

Overall, cto.new as a platform leverages top-performing models for real developer work, while providing transparent, usage-based evidence of model effectiveness in agentic coding scenarios. It's positioned as more representative of "what matters" in day-to-day engineering than traditional benchmarks.

SuperPrompter

Latest Blogs

AI Post-Analyzer

AI Interactive Poller

Join the Conversation

Stay Updated!