REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

World-class customer research - without the customers!

World-class customer research - without the customers!

If you're familiar with the Eisenhower matrix you'll be familiar with businesses and customer research - they simply don't know what they don't know! But thanks to two crucial AI research studies, we'...

16 Tammi 202512min

2025: AI and what I'm building

2025: AI and what I'm building

The past few weeks in AI have shattered my brain into a billion fragments of wonder. We've even found a new way to do AI, beyond transformers - that could change even what's been the most changeful we...

23 Joulu 202410min

BIG LAUNCHES: Devin and Gemini 2 Flash overshadow Santa's sack

BIG LAUNCHES: Devin and Gemini 2 Flash overshadow Santa's sack

You've heard the pandemonium all about Google launching its fastest and smartest frontier model yet. But what does it mean for your business? And what about Devin - the grown-up AI copilot for your en...

12 Joulu 202418min

AI Tomorrow

AI Tomorrow

Finally - an overdue appearance from AI Today creator, Dave Thackeray! What a year it's been. And it's just the beginning. Join me taking a look at 2024 and the indisputable delights and miracles co...

9 Joulu 202414min

EXCLUSIVE: AI gets memory!

EXCLUSIVE: AI gets memory!

The last barrier to enterprise adoption of AI was memory. Baking into every prompt what the algorithm needed to know, was enough to send business leaders scurrying for the Luddite hills. But now Googl...

20 Marras 202427min

CMO wet dream: Predicting human behaviour

CMO wet dream: Predicting human behaviour

Understanding human behaviour is critical to business success. Behavioural science informs every growth stage and product decision - yet so few businesses pay any attention to human behaviour and psyc...

7 Marras 202416min

Accelerating R&D with AI

Accelerating R&D with AI

Tired of research and development (R&D) bottlenecks? Today's episode of AI Today explores how AI can supercharge product development by rapidly uncovering game-changing insights from mountains of data...

1 Marras 202422min

AI as your full-stack engineer: with Databutton, it's finally time!

AI as your full-stack engineer: with Databutton, it's finally time!

I've tested 20 AI coding editors. My tech skills are basic, at best. None turned my ideas into apps. That's when I found Databutton. And now I'm an app developer. Listen in to find out how Databutton ...

30 Loka 202427min