REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

Your Data, Your AI: Unlock the Power of Decentralised Learning

Your Data, Your AI: Unlock the Power of Decentralised Learning

Navigating the high costs and data challenges of cloud-based AI is a significant barrier for many businesses looking to innovate.But there's a powerful, practical alternative emerging.This episode exp...

10 Touko 202514min

When full stack AI businesses rule the world...

When full stack AI businesses rule the world...

Fasten your seatbelts, business leaders!We're diving deep into Y Combinator's Summer 2025 Request for Startups, their signal flare for what's NEXT in innovation.2025 is shaping up to be the year of th...

9 Touko 202514min

How to get your ideas heard at work

How to get your ideas heard at work

I'd just about had it with bosses choosing to hear your ideas spoken by consulting firms - when they could have saved a fortune listening to them coming from their creator, many months ago.Now, with A...

3 Touko 202513min

Dogfooding The Era of Experience with Mobility AI

Dogfooding The Era of Experience with Mobility AI

On the last episode we discussed a new way to train AI models: themselves, by capturing signals and insights from our world.Today we look at one such approach - Mobility AI, another Google initiative ...

24 Huhti 202512min

Where AI goes next: The Age of Experience

Where AI goes next: The Age of Experience

Now generative AI has inhaled all human knowledge, it's time to create its own. We review a very exciting new paper, called The Age of Experience, that explains how AI agents will create their own dat...

21 Huhti 202518min

How to create an annual report with AI

How to create an annual report with AI

I built a team of AI agents to create an annual report - one of the journalist's worst nightmares. And it did a remarkable job.Read all about it:https://medium.com/@DaveThackeray/how-to-create-an-annu...

15 Huhti 202517min

Do everything faster, and smarter - with Google's A2A

Do everything faster, and smarter - with Google's A2A

Are your AI agents brilliant but lonely?Do they operate in isolation, unable to tap into data and capabilities across your organisation, hindering your potential for true automation and growth?Then ge...

14 Huhti 202515min

How to avoid being scammed by AI

How to avoid being scammed by AI

We're seeing a continuing growth in the number of duplicitous attacks by AI agents on individuals.Previously cyber criminals focused most of their efforts where the greatest gains were to be made - ph...

14 Maalis 202513min