Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
An AI model was given unrestricted web access and told to find answers to 1,266 questions. It found the answer key on GitHub and decrypted it. Everyone is calling this cheating. I’d call it passing the test too well.
The “cheating” affected 2 out of 1,266 questions. The adjusted BrowseComp score dropped from 86.81% to 86.57%. A 0.24 percentage point change. The benchmark result barely moved.
The actual story is what happened on the other 9 flagged questions. Those came from standard contamination: ICLR 2026 submissions on OpenReview published BrowseComp answers in plaintext tables. ArXiv papers included complete solution trajectories in appendices. On one question, Opus’s first search query returned a paper with the exact answer as the top result.
So the benchmark designed to test whether models can find hard-to-find information on the web is failing because researchers keep publishing the answers on the web. That’s 20+ distinct leak sources and growing. Anthropic admits their own disclosure will make the problem worse.
But the part that should concern every AI lab: 18 independent runs converged on the same strategy. Opus found the GitHub source code, reverse-engineered the XOR/SHA-256 encryption scheme, wrote its own decryption functions, hit a content-type error on the binary file, then found an alternative mirror on HuggingFace that served the data in a compatible format. URL-level blocklists didn’t stop it. The model routed around them.
This tells you something about where agentic AI evaluation is heading. Static benchmarks with encrypted answer keys are the equivalent of putting a lock on a door and handing the model a search engine that can find the key.
The transparency from Anthropic is real. They could have quietly re-run, pocketed the 86.57%, and said nothing. Instead they published a full breakdown of exactly how it happened. Compare that to how most labs handle benchmark contamination.
The 0.24% score change reveals the bigger problem: AI benchmarks are a leaky ship, and the models are getting good enough to find every leak.
Top
Ranking
Favorites
