AI Today18 Feb 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Upptäck Premium

Prova 14 dagar kostnadsfritt

Skaffa Premium

Avsnitt(90)

10x your AI results with this ultimate context engineering lesson

On today's show we create a business to show you the huge improvements in gravitating beyond prompt engineering to the new community of practice we call context engineering.You'll be rocked by the res...

5 Nov 202549min

When's the right time to go all-in with AI?

Two of the most important voices in AI spoke out this week. Andrej Karpathy, one of the algorithm's greatest philosophers, was in conversation with Dwarkesh Patel talking praisingly and cautiously abo...

18 Okt 202514min

ELephantLM: the AI that never forgets!

If only that was the real name. After all this time begging frontier labs to build an LLM that learns from its mistakes and applies its discoveries at inference time...Welcome to AI Today!

13 Okt 202537min

brAIn: thinking of the future?

The Dragon Hatchling is a remarkable research paper that reboots modern AI as a model that approximates how our brains work.Today's show is a fascinating discussion and I implore you to both enjoy it ...

1 Okt 202529min

Does AI work?

It's the one thing every business leader needs to know.If I put AI to work in my organisation, will it screw everything up?While we should all be in experiment mode right now - until someone figures o...

26 Sep 202527min

Have we finally figured out how to make efficient AI?

A fantastic research paper published in this month's Nature Computational Science suggests a solution may be in our midst for the incredible inefficiency in generative AI.Large Language Models' (LLMs)...

25 Sep 202521min

China's got AI in the bag

30 years in journalism has sharpened my mind.I've spent years in AI.And months researching China and the US as they fight silently for AI supremacy.$500bn in The Stargate Project does not come close t...

14 Sep 202523min

I'm working on the Zeitgeist

I've been working on a business intelligence platform leveraging AI and 30 years in journalism and content strategy. It's the toughest professional project of my career. And I have no idea if I will w...

3 Sep 202518min

Allt en och samma app

Lyssna på dina favoritpoddar och ljudböcker på ett och samma ställe.

Noga utvalt innehåll

Njut av handplockade tips som passar din smak – utan ändlöst scrollande.

Fortsätt när du vill

Fortsätt lyssna där du slutade – även offline.

Premium

99 kr/ månad

Tillgång till alla Premium-poddar
Reklamfritt premium-innehåll
Avsluta när du vill

Prova 14 dagar gratis

Premium

129 kr/ månad

Tillgång till alla Premium-poddar
Reklamfritt premium-innehåll
Avsluta när du vill
Ett extra konto

Prova 14 dagar gratis

Populärt inom Teknik

rss-ai-med-katarina-gospic-och-viggo-cavling

Berättelserna och rösterna du älskar att lyssna på

Obegränsad lyssning på alla dina favoritpoddar och ljudböcker

Upptäck Premium