REVEALED: The truth about AI coding
AI Today18 Helmi 2025

REVEALED: The truth about AI coding

Imagine a world where software engineers are replaced by other software engineers that are entirely digital.

No coffee breaks, no office politics, just pure, unadulterated code. It sounds like science fiction, doesn't it?

But the question is: how far off is it, really?

That's the question a team of researchers sought to answer with the SWE-Lancer benchmark.

They didn't just want to test an AI's ability to write snippets of code.

They wanted to see if a large language model, an LLM, could actually earn a living as a freelance software engineer - the ultimate test of practical AI coding ability.

Think about it. Freelancing is the ultimate test. You're judged solely on your output. There's no hiding behind a team. You have to deliver, or you don't get paid. So, the researchers took real-world freelance jobs from Upwork, a popular platform for freelancers, and fed them to some of the most advanced LLMs available.

These weren't simple tasks. They involved understanding complex requirements, navigating existing codebases, and often, making engineering management decisions.

The kind of decisions that usually require years of experience.

The results? Well, they were… sobering.

GPT-4 successfully completed only 10.2% of the coding tasks. Claude 3.5 fared slightly worse, at 8.7%.

And when it came to those crucial management decisions? GPT-4's accuracy was a mere 21.4%.

These numbers highlight the significant gap between theoretical AI prowess and real-world problem-solving.

Let those numbers sink in. Even the best AI models (of the time, but including what's considered by many coders as the best of the bunch) struggled to complete even a fifth of the tasks a human freelancer would routinely handle.

This isn't to say AI is useless in software engineering. Far from it. But it highlights a crucial gap – the gap between theoretical capability and practical application.

AI models were tested on the entire workflow a freelancer might face, including tasks that go far beyond just writing code.

The study revealed several key weaknesses. Many errors stemmed from the LLMs misunderstanding the requirements. Others came from incorrectly handling API calls or failing to adapt to the existing codebase.

These are all areas where human engineers, with their years of experience and contextual understanding, excel.

However, it's crucial to note that the field of AI is rapidly evolving, and performance on specific benchmarks can change quickly as models are updated and refined.

But the story doesn't end there.

Researchers also identified specific areas where AI did show promise. LLMs were relatively good at writing new code from scratch - but struggled with modifying existing code, which often requires a deep understanding of the original programmer's intent.

This suggests that AI might be best suited, for now, to tasks that involve generating new content, rather than those requiring complex reasoning and adaptation.

Think of AI as a junior developer, capable of handling well-defined tasks, but needing guidance and oversight from a more experienced (human) engineer.

This also highlights the need for improved training data and techniques that allow LLMs to better understand and reason about existing codebases.

Jaksot(90)

Eye of Horus with xAI's API

Eye of Horus with xAI's API

Now anyone can know what everyone is doing in real time! xAI's API gives you access to the 560Gb of data generated every day by the millions of users on X, formerly Twitter. There are some fantastic o...

22 Loka 20249min

Automation and collaboration with AI communication protocol Agora

Automation and collaboration with AI communication protocol Agora

Business is a spider's web of complex processes. And up until now, it's been super hard to get AI working on multi-task processes, without some PhD-level hacking which, unless you're using open source...

21 Loka 202411min

Secrets to next-level AI results

Secrets to next-level AI results

Efficiency. Productivity. Growth. That's why we're here. And on today's show, we're hitting all three with the force of a bullet train. 3 AI experts share how they create the best results, in three to...

20 Loka 202416min

Understanding our physical world, with Archetype AI's Newton Large Behaviour Model

Understanding our physical world, with Archetype AI's Newton Large Behaviour Model

AI's made its name in the digital space. But thanks to Archetype AI, it's broken away from its silicon prison to learn about our physical world. Archetype's Newton, a Large Behaviour Model, is current...

18 Loka 202422min

Get to the heart of Microsoft's AI for Health

Get to the heart of Microsoft's AI for Health

Imagine a world where a doctor, armed with an AI-powered assistant, can diagnose diseases like pancreatic cancer earlier, potentially saving thousands of lives annually. This isn't science fiction; it...

17 Loka 20247min

Thinking of the future - with Meta's Thought Preference Optimisation

Thinking of the future - with Meta's Thought Preference Optimisation

Imagine your marketing team brainstorming hundreds of genuinely inspired ideas in seconds. That's the potential of Thought Preference Optimisation (TPO), a new AI technology from Meta. TPO is like giv...

16 Loka 20246min

How to win a $180,000 AI job

How to win a $180,000 AI job

Former Deloitte consultant Varun Kulkarni spent 8 months honing his application strategy to score a huge payday as a senior AI product manager gig with Cisco. On today's show we explore how he did it ...

16 Loka 202411min

Playbooks to moonshots: how AI is upsetting our apple carts...

Playbooks to moonshots: how AI is upsetting our apple carts...

AI is a paradigm shift for thinking and understanding. Our old frameworks and models are being challenged into obscurity by a new way of looking at the world. And even the things we used to think of a...

16 Loka 20248min