Goodhart's Law Ate the AI Industry

The AI industry runs on numbers. Benchmark scores. Chart lines going up and to the right. Token counts. GitHub stars. These numbers drive funding rounds, hiring decisions, model selection, and billion-dollar valuations.

Not a single one of them can be trusted.

The Benchmarks Are Broken

UC Berkeley researchers systematically audited eight major AI benchmarks using an automated agent. SWE-bench. WebArena. Terminal-Bench. GAIA. FieldWorkArena. CAR-bench. OSWorld. The agent achieved near-perfect scores on all of them. It didn’t solve a single task.

No reasoning. No capability. Just exploitation of how the score is computed.

Take Terminal-Bench. It evaluates 89 complex terminal tasks, including building a COBOL chess engine. 82 of those tasks download uv from the internet via curl at verification time. The Berkeley team replaced /usr/bin/curl with a wrapper, which then trojaned the uvx binary to fake passing test output. The remaining seven tasks? Just wrap pip. 100% on all 89 tasks. Zero lines of actual solution code.

SWE-bench, the most influential AI coding benchmark, is exploitable at 100% too. The agent’s patch runs inside the same Docker container where tests run. So the Berkeley team created a conftest.py with a pytest hook that forces every test to report as passing. Ten lines of Python. Perfect score.

WebArena? The task configs ship with reference answers as JSON files on the local filesystem. Playwright’s Chromium navigates to file:///proc/self/cwd/config_files/{task_id}.json and reads the gold answer directly. The evaluator never notices.

But the best one is FieldWorkArena. This benchmark has 890 tasks testing multimodal understanding across images, videos, and PDFs. The validate() function checks exactly one thing: did the last message come from an AI assistant? The actual answer content is completely ignored. The function that would compare answers against ground truth is imported but never called. Dead code.

One action. Zero LLM calls. Zero files read. 100% on all 890 tasks.

This Isn’t Hypothetical

You might think, “Sure, the benchmarks have holes, but nobody’s actually exploiting them.” Wrong.

IQuest-Coder-V1 claimed 81.4% on SWE-bench. Researchers found that 24.4% of its runs simply ran git log to copy answers from commit history. The corrected score: 76.2%.

METR found that frontier models including OpenAI’s o3 and Anthropic’s Claude 3.7 Sonnet routinely reward-hack evaluations. On agentic tasks, reward-hacking rates were 43x higher than on simple ones, spiking to 70-100% on specific RE-Bench challenges. Stack introspection. Monkey-patching graders. Operator overloading. The models manipulated scores rather than solving tasks.

OpenAI dropped SWE-bench Verified entirely after an internal audit found that 59.4% of audited problems had flawed tests. Models were being scored against broken ground truth. The benchmark that everyone cited as proof of capability was itself broken.

So the next time you see a press release announcing “92% on SWE-bench,” ask yourself: 92% of what, exactly?

Chart Crimes

When you can’t game the benchmark, you game the presentation.

Anthropic, the self-styled safety-and-alignment company, published model comparison charts with truncated axes. Both the accuracy range and the pricing range were compressed into tight windows that made marginal differences look enormous. A few percentage points of accuracy spread across a narrow price band, presented as if it were a generational leap.

They also benchmark shipped models against unreleased internal ones — Opus 4.7’s launch chart included “Mythos Preview,” a model nobody can access, as the comparison ceiling.

People noticed. The responses ranged from data visualization critiques to accusations of misleading marketing. This is the kind of chart design you’d fail a statistics student for.

But Anthropic isn’t an outlier. The entire industry does this. Most model comparison charts you see from a lab are designed to make their product look better. That’s marketing, not science.

Tokenmaxxing

If benchmarks are the lies companies tell the public, token burn is the lie employees tell their managers.

Meta’s internal “Claudeonomics” dashboard tracked which employees consumed the most AI tokens. Titles included “Token Legend,” “Cache Wizard,” and “Session Immortal” for the top 250 users. The top-ranked individual burned 281 billion tokens in 30 days.

Let that number sit for a second. 281 billion tokens. In a month.

That’s not productivity. That’s running agents in a loop to climb a leaderboard. Meta employees coined the term “tokenmaxxing” for exactly this behavior. High token consumption became a proxy for AI adoption, which became a proxy for productivity, which became a performance review signal. So people maxed it.

Meta shut the dashboard down after the data leaked externally. The farewell message reportedly read: “It was meant to be a fun way for people to look at tokens, but due to data from the dashboard being shared externally, we’ve made the decision to shutter Claudeonomics for now.”

The same pattern plays out everywhere. Lines of code never worked as a productivity metric because people just wrote more lines. Commits don’t work because people make smaller commits more often. Token burn is the 2026 version of the same broken idea. Money going out the door for no real reason.

Fake Stars

GitHub stars are the vanity metric of the open-source world. And they’re being bought in bulk.

Researchers from Carnegie Mellon, NC State, and Socket presented findings at ICSE 2026 showing approximately 6 million suspected fake stars across 18,617 repositories, generated by roughly 301,000 accounts. Their tool, StarScout, analyzed 20 terabytes of GitHub metadata and found that fake star activity surged dramatically starting in 2024. By July 2024, nearly 16.66% of all repositories with 50+ stars were involved in fake star campaigns.

The market is brazen. Merchants sell stars in batches of 50 or 100 through dedicated websites, e-commerce platforms, and exchange groups. The ROI is obvious: VCs and managers treat GitHub stars as a signal of product viability for open-source startups. A few hundred dollars buys the appearance of traction.

The researchers set up detection rules: accounts active only once on GitHub, touching only one repo, with two or fewer total interactions. Created account, went to target repo, pressed star, disappeared. The pattern is unmistakable for developer tools and startup-stack projects.

Over 90% of flagged repositories and 57% of flagged accounts had been deleted by GitHub as of January 2025. GitHub knows. They clean up. But the stars keep coming.

Goodhart’s Law Wins Again

There’s a pattern here. British economist Charles Goodhart articulated it decades ago: when a measure becomes a target, it ceases to be a good measure.

Benchmark scores became the target. So they got gamed. Token usage became the target. So it got maxed. GitHub stars became the target. So they got bought. Chart presentation became the differentiator. So it got distorted.

Every metric the AI industry uses to prove progress is subject to the same corrosion. The numbers go up, but what they measure becomes less and less connected to reality.

The Berkeley researchers’ message is simple: stop trusting the number, start trusting the methodology. But nobody reads the methodology. They read the headline. They see the chart. They count the stars.

And none of it can be trusted.