It conveys a real-world performance ranking of various frontier AI models (like Claude, GPT variants, Gemini, and others) specifically on end-to-end coding tasks. Unlike many synthetic or lab-based AI benchmarks (e.g., SWE-Bench), this one uses actual user data from cto.new β measuring how often the AI successfully completes tasks that result in merged code (i.e., pull requests accepted into repositories).
Key points it emphasizes:
Metric: Success rate = percentage of completed tasks that lead to merged code.
Methodology: Based on real usage by thousands of developers; uses a rolling window (e.g., 4-day or 72-hour) with lags for resolution; excludes low-usage models or non-merging teams for reliability.
Purpose: To show which models perform best in practical, production-like coding workflows (using tools like file reading/writing, grep, terminal commands, etc.), rather than hypothetical tests.
Recent top rankings (as of late 2025 data shown on the page):
Claude Sonnet 4.5 β 88.3%
Devstral 2 β 86.9%
MiniMax M2 β 84.5%
Gemini 3 Pro Preview β 84.4%
GPT 5.2 β 84.2%
Overall, cto.new as a platform leverages top-performing models for real developer work, while providing transparent, usage-based evidence of model effectiveness in agentic coding scenarios. It's positioned as more representative of "what matters" in day-to-day engineering than traditional benchmarks.
Join the Conversation