o3 and o4-mini performance in math and science

Hey there!

Welcome back to The Pulse, where we dive into interesting AI stories and trends backed by data, all presented through simple visuals.

> OpenAI released reasoning models o3 and o4-mini today

> biggest update: models independently use any tools + think with images (visual reasoning)

> SOTA score on benchmarks - 99.5% by o4-mini on AIME '25 with tool use, saturating the benchmark

> o3 beats Gemini 2.5 (current #1) marginally across benchmarks:

  • Aider: ~15% improvement

  • SWE bench verified: ~8% better

  • Humanity's Last Exam: 8% better without tools

> o3 reaches highest score on Aider Polyglot yet

> exceptional coding abilities as per other benchmarks too:

  • Codeforces Elo: 2706 (o3) & 2719 (o4-mini) - highest yet

  • SWE bench: 69.1 (o3) & 68.1 (o4-mini) - Claude 3.7 ~1.5% better (only with custom scaffolding)

> also released Codex CLI: open-source coding agent that runs locally on PC + turns natural language into working code

> as of Oct '24, OpenAI had 5x annualized revenue than Anthropic

> but API market share much closer in revenue (~$1.3B vs ~$800M)

> Claude still the programming-favorite as per OpenRouter's LLM rankings

> recent progress made by Gemini and OpenAI models threaten Claude's longstanding position as developer favorite