- The Pulse by 42neurons
- Posts
- o3 and o4-mini performance in math and science
o3 and o4-mini performance in math and science
Hey there!
Welcome back to The Pulse, where we dive into interesting AI stories and trends backed by data, all presented through simple visuals.

> OpenAI released reasoning models o3 and o4-mini today
> biggest update: models independently use any tools + think with images (visual reasoning)
> SOTA score on benchmarks - 99.5% by o4-mini on AIME '25 with tool use, saturating the benchmark
> o3 beats Gemini 2.5 (current #1) marginally across benchmarks:
Aider: ~15% improvement
SWE bench verified: ~8% better
Humanity's Last Exam: 8% better without tools

> o3 reaches highest score on Aider Polyglot yet
> exceptional coding abilities as per other benchmarks too:
Codeforces Elo: 2706 (o3) & 2719 (o4-mini) - highest yet
SWE bench: 69.1 (o3) & 68.1 (o4-mini) - Claude 3.7 ~1.5% better (only with custom scaffolding)
> also released Codex CLI: open-source coding agent that runs locally on PC + turns natural language into working code

> as of Oct '24, OpenAI had 5x annualized revenue than Anthropic
> but API market share much closer in revenue (~$1.3B vs ~$800M)
> Claude still the programming-favorite as per OpenRouter's LLM rankings
> recent progress made by Gemini and OpenAI models threaten Claude's longstanding position as developer favorite