New Mercor Benchmark Tests AI Agents on Real-World Work

Screenshot

Y’all, we need to talk about the most awkward disconnect happening in corporate America right now. It is like watching two people describe completely different movies after seeing the same film.

On one side, you have the "AI agents are replacing everyone" hype. On the other side, we finally have some cold, hard data that says... maybe don't fire your legal team just yet.

The Reality Bomb: Training-data giant Mercor just dropped a benchmark called APEX-Agents. Think of it as the SATs for AI agents trying to do actual white-collar work. They had investment bankers, lawyers, and consultants create real, soul-crushing tasks they do every day. Then, they let the AI loose in a "digital world" filled with Slack messages, Google Drive files, and messy spreadsheets.

The Verdict: Even the "god-tier" models like Google’s Gemini 3 Flash and OpenAI’s GPT-5.2 only got about 24% of the tasks right. Imagine hiring an intern who screws up three out of every four assignments.

But here's where it gets REALLY interesting. While AI is objectively failing most workplace tasks, there's a massive perception gap between the corner office and everyone else.

A new survey by research firm Section dropped some jaw-dropping numbers that perfectly illustrate this AI reality distortion field. They surveyed 5,000 white-collar workers at companies with 1,000+ employees, and the divide is almost comical.

It turns out CEOs are living their best life:

70%+ are "excited" about AI.
19% say it saves them more than 12 hours a week.
Only 2% say it saves them zero time.

However, actual workers are in the trenches:

Nearly 70% feel "anxious or overwhelmed."
40% say it saves them zero time per week.
Just 2% are saving 12+ hours.

Let that sink in for a second. Nineteen percent of executives are saving a full day of work every week, while 40% of workers are saving literally nothing. You could not design a more perfect illustration of the AI hype bubble if you tried.

So what's going on here?

According to Mercor CEO Brendan Foody (the 22-year-old billionaire college dropout), the problem is something called "multi-domain reasoning."

Here's the thing: AI can handle ONE task in ONE place pretty well. But the moment you need it to connect information from Slack AND Google Drive AND your email AND that PDF someone sent you three weeks ago? It completely falls apart. And guess what? That's literally how all work happens.

The APEX-Agents benchmark simulates this exact scenario.

Instead of asking trivia questions or testing general knowledge like most benchmarks do, APEX-Agents creates an entire fake workplace. We're talking realistic project scenarios with emails, Slack messages, Google Drive files, PDFs, spreadsheets, calendars—the whole nine yards. Then they give the AI tasks like "analyze this company's data export and tell me if it violates Article 49 under their own policies."

The leaderboard is fascinating (in a slightly depressing way for AI hype folks):

Gemini 3 Flash came out: 24.0% accuracy
GPT-5.2: 23.0% accuracy
Claude Opus 4.5: 18.4% accuracy
Gemini 3 Pro: 18.4% accuracy
GPT-5: 18.3% accuracy
Grok 4: 15.2%

For context, these tasks take a human pro with 7+ years of experience about 3.5 hours to finish.

But here's the plot twist: This isn't necessarily BAD news. It's actually GOOD that someone finally created a realistic test instead of everyone just vibing off hype. And honestly? The progress is pretty wild when you zoom out. Foody pointed out that last year, AI was getting these tasks right 5-10% of the time. Now it's at 24%. That's basically doubling year-over-year.

So why are CEOs having such a different experience than their employees?

Turns out, when you're a CEO, you use AI for high-level tasks like summarizing a report or drafting a "great job team" email. If the AI is 80% right, that is good enough because someone else will catch the errors.

But for employees? They use AI for precise technical work where errors have real consequences. A Workday survey found that while 85% of workers save some time with AI, most of those gains are eaten up by "The AI Tax." This is the time spent correcting, clarifying, or completely redoing the low-quality content the AI generates.

So what does this all mean?

The APEX-Agents benchmark gives us the hard data: AI agents aren't ready for primetime yet. They're getting better fast, but "better than last year" doesn't mean "ready to do your job."

The good news? The benchmark is now open source. Mercor released all 480 tasks on Hugging Face along with their entire evaluation infrastructure, called Archipelago. Now every AI lab in the world knows exactly what they need to fix to make these agents actually useful.

So what about you? Is AI actually clearing your plate, or are you just the robot's full-time editor now?

Hit reply and let us know!

New Mercor Benchmark Tests AI Agents on Real-World Work—and the Results Are Humbling

Reply

More From The Automated

For The AI Era