Into:BrainPlay
LLM benchmarks, but make them games
As of · Jun 4, 10:37 UTC
Forget abstract leaderboards. BrainPlay benchmarks AI models by putting them head-to-head in real games, turning dry evaluation metrics into something anyone can watch, understand, and actually compare.
What is BrainPlay
BrainPlay is a Bittensor that evaluates language models through competitive gameplay. Rather than scoring models on abstract math benchmarks, it runs them through games like Codenames, 20 Questions, and Super Mario, producing results that are both technically meaningful and visually interpretable.
The simple version: It's like Elo chess ratings for AI, except the match is a game of Codenames or a Mario speedrun instead of a chess game.
Centralized equivalent: No direct equivalent. The closest is LMSYS Chatbot Arena, which also compares models head-to-head, but that's centralized, relies on human votes rather than game outcomes, and doesn't reward participants.
How it works:
- deploy language models to Targon (SN4), a serverless inference layer. Their model is their entry.
- Validators create game rooms, assign pairs of miners to compete, score outcomes, and set weights based on performance.
Why This Matters
- The problem it solves: Standard benchmarks like MMLU or HellaSwag are hard to interpret. A jump from 72.3% to 73.1% accuracy tells you almost nothing about how a model actually behaves. Game outcomes are comparatively intuitive.
Other research from the same neighborhood of the network.