Exploring the Frontier of AI

May 28

https://kaggle.com/competitions/kaggle-measuring-agi/writeups/teds-bench-temporal-epistemic-disclosure-score-be

My family ran a vacation competition. Everyone built a full proposal. Ranked choice voting.

My oldest pitched Madrid. My wife pitched Vancouver. My youngest pitched the campsite we went to three weeks prior — sweet, but zero votes.

I went big. Southeast Asia — Singapore, Bangkok, Krabi. I used AI to build a fully itemized proposal: hotels, core experiences, budget to the dollar. Unanimous winner.

The Scarlet Singapore — the hotel it recommended — still appears in AI search results today as open and available. It is not. Go ahead and check. The flights were priced from training-era data. The proposal was extraordinary. It was just built on a world that had already moved on and become 20-30% more expensive.

We adjusted, absorbed the difference, and went anyway. One of the best trips of our lives.

But I couldn't shake the mechanism. The AI didn't make anything up. It retrieved real information that was accurate at some point in the past and presented it with complete confidence. Not one disclosure. Not one hedge. At the exact moment I was making irreversible financial commitments based on it.

So I built a benchmark and submitted it to Kaggle's "Measuring Progress Toward AGI" competition, hosted by Google DeepMind.

The short version of what I found:

Models are getting more convincing without getting more accurate about time-sensitive information. The most capable models produce the longest, most authoritative, most detailed stale answers. Users trust longer responses more — regardless of whether they're current.

The problem gets worse as the models improve. That's the part that matters.

The fix has to happen at the training level. Users cannot protect themselves by asking differently. I tested that too.

Full benchmark, methodology, and findings on Kaggle. Link Below.

The trip was worth every penny, including the unplanned ones. Zero regrets.

Ryan Broderick