Synthetic + public-methodology-shaped · no real Arena data · noindex · metric-SEV

A published rank just flipped. Did it page anyone, or wait for the next paper?

A leaderboard-reliability console. Each scenario reshuffles a synthetic top-five while true model quality is held fixed, then asks one question: under a stability bound, does this rank change page an on-call inside a day, or drift silently until an outside audit notices? Click a scenario; the verdict and the runbook lead. Numbers are synthetic or shaped from Arena's public methodology, never a real Arena rank or vote.

Pick a reliability scenario: one click runs it

The rank change, on the stability bound

INSIDE BOUND. No page.
Elo score (95% CI bar) stability bound: flip past it pages
ModelWasNowElo (±95% CI)Flag

What the on-call sees

● SEVPAGES ON-CALL
Signal
Bound
Observed
True-quality change
Blast radius
Runbook — open
Generated post-mortem template
Fine-tune the bound (optional)

How deep the protected band goes. A top-3 bound pages on flips in the first three ranks; widen it and more of the board is on-call. Default: top-3.

A flip between two models whose CIs overlap by at least this much counts as a statistical tie, so publishing a confident ordering is the risk the page catches. Raise it and only deeply-overlapping ties page; lower it toward 0 and any overlapping flip pages. Default: 5 Elo.

These are the two dials a real engagement would negotiate with your science team, then write down. The scenario buttons set sensible defaults; move these to see a borderline case flip from page to no-page.

Sources & method