REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

How to prepare your business for huge growth with AI

How to prepare your business for huge growth with AI

You will never hear a more important episode of AI Today.This show will transform your business.Unlock immense value.And help you take that step towards putting your business on autopilot - so you can...

12 Maalis 202539min

Vexis: the world's first AI programming language

Vexis: the world's first AI programming language

We've been moaning for months about the limitations of AI when coding.Which is because we're asking AI to jive with our human world.It's impossible. In its current state, AI can only predict the next ...

11 Maalis 202524min

Small models - massive impact

Small models - massive impact

Sick of paying a fortune to get answers from AI?Small language models (SLMs) are the answer.Learn how these efficient AIs offer comparable performance to large language models at a fraction of the cos...

11 Maalis 202511min

Wyndham Hotels and Resorts

Wyndham Hotels and Resorts

Many years ago I was an editor with the business they now call Wyndham Hotels and Resorts.So I decided to give it an AI makeover - with my co-hosts recommending opportunities for the hospitality indus...

8 Maalis 202510min

Hello, buy: AI-first selling with Microsoft Dynamics 365

Hello, buy: AI-first selling with Microsoft Dynamics 365

Sales is tough as old boots. Waiting months for a decision, trying to get your foot in the door, negotiations, mindreading - it's more than enough to justify an MBE.What if every salesperson had their...

5 Maalis 202523min

Cracking the code: Adventures in building with AI

Cracking the code: Adventures in building with AI

Crikey! You won't believe what's happening in the world of AI. This ain't your nan's tech anymore.We're talking AI animating logos using mere pico frames, filming entire movies inside video games, and...

4 Maalis 202520min

Cloudy days are over: How local AI saves millions delivering excellence

Cloudy days are over: How local AI saves millions delivering excellence

Want 98% accuracy mining huge document libraries while saving more than 80% on your AI spend?Stanford researchers developed minionS, a ridiculously smart process prompting single-step instructions on ...

25 Helmi 202517min

Evo 2, AI co-scientist, and the Majorana 1 quantum chip: How the second scientific revolution was revealed in 24 hours...

Evo 2, AI co-scientist, and the Majorana 1 quantum chip: How the second scientific revolution was revealed in 24 hours...

We just redefined 'fast'.In less than one revolution of the Earth, boffins announced a triple upgrade to humanity’s operating system: AI that writes DNA like Shakespearean sonnets, and A quantum chip ...

20 Helmi 202517min