New DeepSeek V3's performance on the Aider polyglot benchmark

March 25, 2025

Hey there!

Welcome back to The Pulse, where we dive into interesting AI stories and trends backed by data, all presented through simple visuals.

> latest update to DeepSeek V3 out yesterday (Mar 24)

> scored 55% on the difficult benchmark; big 14% improvement on prev version

> major jump in performance across benchmarks + huge improvement across coding tasks

> users claiming either best non-reasoning model or only behind Claude Sonnet 3.7

> Harvard experiment involving 776 professionals at Procter & Gamble

> avg performance of individuals = avg performance of teams when working with AI

> shows AI functions as an effective second teammate

> teams using AI are 9.2 percentage points more likely to produce top 10% exceptional solutions

> both AI-enabled groups worked 12-16% faster than non-AI groups with longer + more detailed solutions

> today's AI solves 1-hour human tasks; week-long tasks in 2-4 years if doubling trend continues

> Epoch claims 1-4 doublings per year

> with only 2024-2025 models -> trend accelerates more

> on SWE-Bench (real software engineering tasks): 70-day doubling time

> experts argue AI can't perform all 1-hour tasks but agree on general exponential trend