REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

What happens when AI fires all the hirers?

What happens when AI fires all the hirers?

Recruitment is being radically remodelled by AI.And according to a brand new piece of research, AI is already humiliating humans at hiring.Hear the story behind the headlines that AI-led interviews in...

21 Elo 20251h 5min

DeepMind's wet dream is M3-Agent's reality: how long-term multimodal memory is modelling the real world

DeepMind's wet dream is M3-Agent's reality: how long-term multimodal memory is modelling the real world

Google DeepMind's Demis Hassabis and his team have a bold mission: penetrating the 4D chess game that's AI embracing our ever-changing biological, physical world.Taking a snapshot is one thing. Rememb...

17 Elo 202556min

The AI revelation: unlocking simpler, superior LLMs

The AI revelation: unlocking simpler, superior LLMs

Wrestling with the 'Wild West' of Large Language Models (LLMs)?While LLMs are poised to redefine business, the crucial 'secret sauce' of reinforcement learning (RL) has become a labyrinth of conflicti...

12 Elo 202540min

Faster, Smarter, Better: How vibe coding transforms product development

Faster, Smarter, Better: How vibe coding transforms product development

Businesses are looking at vibe coding all wrong. They're trying to brute force products using 0 engineers, all vibe coding.It's a bugger's muddle. You can't win. AI doesn't understand you, your custom...

11 Elo 202553min

Secrets of writing with AI - from a 30-year journalist

Secrets of writing with AI - from a 30-year journalist

That journalist is me, your host and producer of AI Today - Dave Thackeray.I was approached by a researcher from the data labs at London School of Economics who wanted to find out how writing had chan...

1 Elo 202547min

ASI made easy?

ASI made easy?

ASI-ARCH is an Artificial Superintelligence (ASI) that's a game-changer for AI research.Like a tireless super-scientist, it has autonomously invented 106 ground-breaking AI 'brains', unearthing surpri...

1 Elo 202516min

The secret of AI mastery that no one wants to share...

The secret of AI mastery that no one wants to share...

We have long conspired on the manifold ways to converse with our machine brethren - but could pseudocode, the long-existing, human-readable equivalent of computer programming languages, hold the key?T...

15 Kesä 202552min

Meet the team: AI agents running The Grand Serenity Hotel

Meet the team: AI agents running The Grand Serenity Hotel

I just finished the second part of my presentation on agentic AI in hotel operations.It's impossible to overlook the immense opportunities in AI across any business. People don't have time, and have t...

5 Kesä 202514min