REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

Prompt engineering masterclass

Prompt engineering masterclass

Here at AI Today, we know how to listen. We spent hours analysing Lenny Rachitsky - host of Lenny's Podcast - interviewing pro prompt engineer Mike Taylor to bring you this deep dive into all the tech...

30 Loka 202416min

How to make a ton of Monet with AI art...

How to make a ton of Monet with AI art...

Botto's 15,000 curators are celebrating a big win this week after six of their carefully-chosen, pixel-pushed masterpieces, sold for more than $350,000 at a Sotheby's auction in New York. It's a story...

29 Loka 202415min

Autonomous agents: Rebooting your business the right way

Autonomous agents: Rebooting your business the right way

Imagine if you had massive balls - crystal ones - to accurately forecast future business needs. That's one of the thousands of ways autonomous agents - popularised in organisations of all sizes throug...

27 Loka 202413min

Lose your RAG

Lose your RAG

Retrieval augmented generation is how we used to chunk content in huge corpuses of data. Now there's a new sheriff in town - contextual retrieval preprocessing, or contextual RAG. No more relying on k...

25 Loka 202416min

#FutureOfWork: AI as the enterprise nervous system with Microsoft's new Copilot

#FutureOfWork: AI as the enterprise nervous system with Microsoft's new Copilot

Let's take a look at how the latest version of Copilot can change the game for your business. Imagine a manufacturing company developing a new electric vehicle (EV) charging station. This complex proc...

24 Loka 202418min

How to build a video editor with Anthropic's Claude AI

How to build a video editor with Anthropic's Claude AI

If there's anyone left in the world yet to be convinced AI is changing it, have a chat with Meng To (@mengto on X). He just wrapped up Dreamcut.ai - what he calls his perfect video editor - after spen...

23 Loka 20249min

Claude Takes Control: AI That Uses Computers Like We Do

Claude Takes Control: AI That Uses Computers Like We Do

Forget everything you thought you knew about AI assistants. We're not talking simple chatbots that can barely string a sentence together. Claude 3.5 Sonnet, the latest iteration of Anthropic's groundb...

22 Loka 20249min

Solver brings full self driving to AI coding

Solver brings full self driving to AI coding

Engineering teams are frazzled. And we've all been down the Cursor, Aider, Cline, Bolt, and Replit rabbit roles questing for AI coding nirvana. But there are more potholes in the process than a worn-o...

22 Loka 20249min