Gemini 2.5 Pro Exp's performance in math and science

March 27, 2025

Hey there!

Welcome back to The Pulse, where we dive into interesting AI stories and trends backed by data, all presented through simple visuals.

> Google released Gemini 2.5 two days ago (Mar 25)

> multimodal + hybrid (includes reasoning)

> #1 on LMArena leaderboard; largest score jump ever: from 1380 (Gemini 2.0) to 1443, beating Grok 3 by 39 pts

> without any test time optimizations, beats all other models at GPQA & AIME

> 18.8% on Humanity's Last Exam, highest in models without tool use

> best Gemini coding model yet - tops Aider’s polyglot leaderboard

> difficult LLM benchmark including 3000 questions on >100 subjects, created with ~1000 experts

> exponential growth in AI scores, from <5% to 26.6% by Deep Research in under a year

> unlike other saturated benchmarks, test still challenges models

> follows similar pattern: recent rapid progress suggests eventual benchmark saturation

> survey involving 730 coders and developers, with 0-20+ years of experience

> indicates frequency of words in written responses

> hints at AI saving time & helping find solutions

> criticism highlights AI might offer wrong solutions & struggles to understand context

> interestingly, "time", "writing" and "solutions" listed in both positive & negative context

> shows responses could be more task & model specific

> study also shows:

freelance coders like AI more than full-time coders
more experienced coders are more likely to integrate AI into their coding environments