ResearchSubnet 1174 min read

Into:BrainPlay

LLM benchmarks, but make them games

By vaNlabs ResearchApril 29, 2026View as Markdown

Priceτ0.00349

Market cap7.0k τ

Momentum45/ 100

Unique holders1.18k

Emission+0.55%

Net flow 7d-334.0 τ

As of · Jun 4, 10:37 UTC

Forget abstract leaderboards. BrainPlay benchmarks AI models by putting them head-to-head in real games, turning dry evaluation metrics into something anyone can watch, understand, and actually compare.

What is BrainPlay

BrainPlay is a Bittensor that evaluates language models through competitive gameplay. Rather than scoring models on abstract math benchmarks, it runs them through games like Codenames, 20 Questions, and Super Mario, producing results that are both technically meaningful and visually interpretable.

The simple version: It's like Elo chess ratings for AI, except the match is a game of Codenames or a Mario speedrun instead of a chess game.

Centralized equivalent: No direct equivalent. The closest is LMSYS Chatbot Arena, which also compares models head-to-head, but that's centralized, relies on human votes rather than game outcomes, and doesn't reward participants.

How it works:

deploy language models to Targon (SN4), a serverless inference layer. Their model is their entry.
Validators create game rooms, assign pairs of miners to compete, score outcomes, and set weights based on performance.

Why This Matters

The problem it solves: Standard benchmarks like MMLU or HellaSwag are hard to interpret. A jump from 72.3% to 73.1% accuracy tells you almost nothing about how a model actually behaves. Game outcomes are comparatively intuitive.

Keep exploring

Other research from the same neighborhood of the network.

ResearchSubnet 1174 min read

Into:BrainPlay

LLM benchmarks, but make them games

By vaNlabs ResearchApril 29, 2026View as Markdown

Priceτ0.00349

Market cap7.0k τ

Momentum45/ 100

Unique holders1.18k

Emission+0.55%

Net flow 7d-334.0 τ

As of · Jun 4, 10:37 UTC

What is BrainPlay

The simple version: It's like Elo chess ratings for AI, except the match is a game of Codenames or a Mario speedrun instead of a chess game.

How it works:

deploy language models to Targon (SN4), a serverless inference layer. Their model is their entry.
Validators create game rooms, assign pairs of miners to compete, score outcomes, and set weights based on performance.

Why This Matters

The problem it solves: Standard benchmarks like MMLU or HellaSwag are hard to interpret. A jump from 72.3% to 73.1% accuracy tells you almost nothing about how a model actually behaves. Game outcomes are comparatively intuitive.

Keep exploring

Other research from the same neighborhood of the network.

Full Analysis

Category: Other: LLM Benchmarking and Evaluation | Centralized Competitor: LMSYS Chatbot Arena

The AI benchmarking space is crowded with static test sets. BrainPlay's bet is that game-based, adversarial evaluation is more robust: models can't overfit to a fixed dataset if the competition is dynamic and the games are diverse. That's a real problem in model evaluation today, where optimizing directly for benchmark scores is common practice.

Mechanism:

BrainPlay v2.0 integrates with Targon (SN4) as its inference backbone. Miners don't run their own servers. Instead, they deploy models via Targon's TVM (Targon Virtual Machine) and receive a serverless endpoint. Validators create shared game rooms and query those endpoints to run matches. Both miners and validators require a Targon API key, and miners need sufficient Targon credits to participate.

Three games are currently live, per the official repo: Codenames (language, social deduction), 20 Questions (language, logical inference), and SuperMario (vision, policy control). Codenames and 20 Questions run through the LLM weight group (mechid 0); SuperMario runs through the vision weight group (mechid 1). are split equally between the two groups.

Reward logic: winning teams receive a normalized score of 1.0; losing teams score proportionally lower. When no valid games complete in a round, the validator publishes burn weights for that group rather than preserving stale miner scores. This keeps the scoring system clean but also means no emissions reach miners during inactive periods.

Price has gained 16.5% over 7 days against a backdrop of modest net inflows (~289 TAO over 7 days). stands at approximately 7,097 TAO. The is 0.32, meaning roughly two-thirds of the reflects organic demand rather than protocol subsidy. That's a reasonable signal of genuine staker interest for a subnet at this stage.

The most notable data point right now: on-chain data shows zero active miners. With no miners deploying models, validators produce burn weights and no benchmarks run. Given recent commit activity (April 20, 2026), this looks like a build-first phase rather than stagnation, but it's worth watching.

//What is BrainPlay

//Why This Matters

//What is BrainPlay

//Why This Matters

//Full Analysis

//Risk Factors

What is BrainPlay

Why This Matters

What is BrainPlay

Why This Matters

Full Analysis

Risk Factors