o3 and o4-mini performance in math and science

April 17, 2025

Hey there!

Welcome back to The Pulse, where we dive into interesting AI stories and trends backed by data, all presented through simple visuals.

> OpenAI released reasoning models o3 and o4-mini today

> biggest update: models independently use any tools + think with images (visual reasoning)

> SOTA score on benchmarks - 99.5% by o4-mini on AIME '25 with tool use, saturating the benchmark

> o3 beats Gemini 2.5 (current #1) marginally across benchmarks:

> o3 reaches highest score on Aider Polyglot yet

> exceptional coding abilities as per other benchmarks too:

Codeforces Elo: 2706 (o3) & 2719 (o4-mini) - highest yet
SWE bench: 69.1 (o3) & 68.1 (o4-mini) - Claude 3.7 ~1.5% better (only with custom scaffolding)

> also released Codex CLI: open-source coding agent that runs locally on PC + turns natural language into working code

> as of Oct '24, OpenAI had 5x annualized revenue than Anthropic

> but API market share much closer in revenue (~$1.3B vs ~$800M)

> Claude still the programming-favorite as per OpenRouter's LLM rankings

> recent progress made by Gemini and OpenAI models threaten Claude's longstanding position as developer favorite