Did xAI Mislead People About Grok 3’s Performance?

The world of AI is no stranger to controversy, and this time, xAI is at the center of it.

This week, an OpenAI employee called out Elon Musk’s AI company, xAI, for allegedly exaggerating the benchmark performance of its latest AI model, Grok 3. Meanwhile, xAI co-founder Igor Babushkin defended the company’s claims, insisting that their data was accurate. But as with most debates, the truth is likely somewhere in between.

Did xAI Mislead People About Grok 3’s Performance?


In a blog post, xAI shared a graph showing Grok 3’s performance on AIME 2025, a test designed to measure a model’s ability to solve difficult math problems. While some experts argue that AIME isn’t the most reliable benchmark, it’s still widely used to evaluate AI math skills. According to xAI’s chart, both Grok 3 Reasoning Beta and Grok 3 Mini Reasoning outperformed OpenAI’s o3-mini-high model on AIME 2025.

But OpenAI employees quickly pointed out a crucial detail: xAI’s chart left out o3-mini-high’s performance at “cons@64.” What’s that? Simply put, it allows an AI model to attempt a question 64 times and then selects the most common answer—naturally improving its score. Leaving out this metric could make Grok 3 seem stronger than it actually is.

When comparing the first attempt scores (“@1”), Grok 3 Reasoning Beta actually falls short of OpenAI’s o3-mini-high and even trails slightly behind OpenAI’s o1 model set at “medium” computing. Yet, xAI still touts Grok 3 as the “world’s smartest AI.”

Babushkin fired back, arguing that OpenAI has also published charts that could be seen as misleading—though usually when comparing its own models. In the midst of the back-and-forth, an independent researcher attempted to create a more “balanced” graph that included every model’s performance at cons@64.

However, as AI researcher Nathan Lambert pointed out, there’s one key detail that remains a mystery: how much computing power (and money) each company had to spend to achieve these scores. This highlights a bigger problem—AI benchmarks often fail to tell the full story. They might show strengths, but they also hide the limitations. And in the race to claim the AI throne, every company wants to look like it’s leading the pack.