Algorithm vs. Madness: Inside an AI’s Multi-Year Assault on NCAA Brackets
This post is part of an ongoing series that tests how well AI-generated NCAA brackets stand up against the unpredictable theater of March Madness. Over multiple seasons we built, tuned, and deployed models that produce fully realized tournament brackets and then evaluated their performance against years of human bracket outcomes. The goal was not to crown a victor for bragging rights, but to understand what algorithmic forecasting can teach us about uncertainty, narratives, and the limits of prediction.
Why bracketology is a useful laboratory for AI
March Madness compresses a vast, high-variance system into a compact, repeatable experiment: 68 teams, a rigid elimination structure, and thousands of human brackets submitted each year. The tournament’s mixture of structure and chaos makes it an ideal sandbox for studying probabilistic forecasting, calibration, decision-making under uncertainty, and how models cope with infrequent but consequential events (upsets, Cinderella runs, collapse of favorites).
For AI researchers and the AI news community, bracketology is compelling because it demands probabilistic reasoning about discrete outcomes and requires combining disparate data sources: team performance metrics, matchup-specific styles, temporal trends, injuries, and sometimes human sentiment. It’s a beating heart of forecasting problems that go well beyond sports.
How we set up the experiment
Rather than relying on a single model or a single tournament, we constructed an experimental pipeline that is deliberately reproducible and backtestable:
- Data collection: historical tournament results from the modern bracket era, team season-level metrics (offensive/defensive efficiencies), matchup-level statistics, seed histories, and betting market lines where available.
- Modeling stack: an ensemble approach combining a probabilistic ranking engine (Elo-like), a supervised learning component trained to predict game-level win probabilities from matchup features, and a simulation layer that runs the tournament tens of thousands of times to convert game probabilities into bracket probabilities.
- Decision rule: from the simulated tournament outcomes we extract two outputs. First, a probability distribution over every possible team advancing to each round. Second, a single deterministic bracket constructed by taking the most likely outcome for each game under monotonic tie-breaking rules.
- Evaluation: we compare bracket scores and calibrated probability metrics against historical human bracket outcomes. Metrics include traditional bracket scoring (as used by many pools), Brier score for probability calibration, rank correlation with final placements, and analyses of upset prediction performance.
We emphasize that the experiment tests probabilistic forecasting as much as it tests bracket selection. A good set of probabilities should be well calibrated—even if, in any single year, luck overwhelms skill.
Backtest results: AI against human fields
We ran backtests covering the tournament seasons from the early 2000s through the most recent completed tournament available at the time of writing. For each historical tournament we fed the model only information that would have been available before that tournament’s tip-off and produced a bracket and full set of probabilities.
The headline findings are straightforward and instructive:
- Against the median human bracket, the AI-generated bracket outperformed in the majority of seasons. Backtested across two decades, the algorithm beat the median human entry in roughly three out of four tournaments. That is, a single informed probabilistic model typically fared better than the average of unaided human picks.
- Against the top-performing human brackets in a given year, however, the AI was less dominant. Elite human brackets—the small fraction that gets exceptionally lucky or happens to mirror the tournament’s actual upset pattern—still outscored the AI in many years. The model matched or exceeded top-quartile human performance in a minority of seasons.
- Calibration: the model’s probability estimates were substantially better calibrated than simple human heuristics (for example, “always pick the higher seed” or “favor the trendy upset”). Measured via Brier score, the AI tended to assign realistic probabilities to games—favorites were not given complete certainty and underdogs received appropriate tail mass.
- Upsets and heavy tails: where the AI struggled was in predicting outlier Cinderella runs. The model often identified games with elevated upset probability, but translating that into a single bracket that captures cascades of upsets across rounds is inherently low-probability. The deterministic bracket tends to underrepresent the possibility of long upset chains, which is why many of the most famous bracket winners—those who had improbable streaks of correct upset calls—remain out of reach for systematic strategies.
Put another way: the AI purchases steady improvement over naive human intuition, but it cannot reliably purchase the lottery ticket that wins large pools.
Deeper diagnostics: what the AI learned and where it failed
Three patterns emerged from our analysis that are relevant for any AI system deployed into high-variance domains.
1. Matchups matter more than raw seed
Human bracket pickers often anchor to seed lines: a 12-over-5 upset is a cultural trope, so many human brackets include at least a few of these picks. The AI, in contrast, learns to evaluate matchup-specific edges—styles that exploit a team’s weakness, tempo mismatches, and roster composition—and sometimes rejects conventional upset thresholds. This means the AI is prone to fewer random-seed-based upset picks and more targeted upset predictions backed by matchup analytics.
2. Calibrated uncertainty beats confident narratives
One of the AI’s most useful outputs is probabilistic confidence. Whereas a typical human bracket is a deterministic artifact that conceals uncertainty, the model lays its uncertainty bare. That enabled new decision-making modes: for instance, constructing diversified portfolios of brackets optimized for different objectives (maximize expected score, maximize probability of top-1 finish in a pool, or maximize chance of an upset-laden payoff). The clarity of those trade-offs is valuable beyond sports.
3. The human edge is randomness and narrative
Top-performing human brackets often benefit from luck and from making non-analytic choices that serendipitously match realized upsets. Some human pickers deliberately inject randomness—picking a few high-variance outcomes—to increase their chance of winning large pools where the payoff structure rewards boldness. The AI, unless specifically instructed to optimize for top-heavy rewards, typically prefers the safer path. This explains why it often beats the crowd but rarely wins the jackpot.
Case studies that illustrate the trade-offs
Two short examples bring the tension into focus:
- In one notable backtest year, the model correctly identified a lower-seed team whose defense-clogging style neutralized a higher seed’s fast-break offense, assigning that lower seed a substantially higher upset probability than human consensus. The deterministic bracket still put the favorite through because the model’s marginal probability (though elevated) did not cross the threshold for selection. However, in simulations the model’s probabilistic forecasts captured a meaningful chance of that upset, and alternative bracket portfolios constructed from those probabilities would have favored the upset and reaped large rewards when it occurred.
- In another year, the tournament produced a long Cinderella run from a low seed that had been underweighted by all models and human pickers alike. The AI’s probabilities had signaled a slight uptick in the team’s game-to-game win likelihood, but not enough to dominate the deterministic bracket. The takeaway: models detect faint signals, but the multiplicative improbability of repeating upsets across multiple rounds keeps such streaks elusive to single-bracket strategies.
Lessons for forecasting beyond the tournament
This experiment has implications that extend to political forecasting, epidemiology, financial risk assessment, and any domain where probabilistic models confront rare, impactful events.
- Make probabilities the product, not the byproduct: A model that presents probabilities enables richer downstream decisions than a single recommended action. In many real-world tasks, understanding uncertainty improves strategy.
- Evaluate for calibration and rank, not just point success: A model that is ‘‘often right’’ but overconfident is less useful than a model that accurately reflects uncertainty. Metrics like Brier score and calibration curves should be first-class evaluations.
- Design for objectives: If the objective is to maximize expected value (average performance), a calibrated ensemble will often win. If the objective is to win a winner-take-all pool (top-heavy reward), you need a different decision rule that intentionally pursues variance.
- Embrace ensembling: Combining simple domain knowledge models with richer machine-learned components improves robustness—there are many ways to be wrong, and ensembles smooth idiosyncratic errors.
The future of AI-driven bracketology
As AI models continue to absorb richer real-time signals—player-level tracking, injury reports, and even micro-adjustments such as coach strategies—forecasting will become incrementally stronger. But the core challenges will remain: rare events, structural breaks, and the fundamentally combinatorial nature of bracket outcomes. Improving forecasts will increasingly be about integrating better priors, better probabilistic reasoning, and clearer alignment between model outputs and human objectives.
For the AI news community, the tournament experiment is a reminder of two truths. First, building models that handle uncertainty gracefully is as important as building models that perform well on average. Second, deploying probabilistic outputs requires designing interfaces and decision frameworks that let humans use those probabilities for different ends—conservative play, high-variance play, or portfolio diversification.
What this means for practitioners and readers
If you’re building forecasting systems, treat bracketology as a microcosm for your larger problems. Demand calibration reports. Explicitly state decision objectives. Run scenario-based analyses. If you’re consuming forecasts, ask whether the model gives probabilities and, crucially, whether those probabilities have been tested against historical outcomes.
And if you’re just here for the spectacle: remember that the joy of March Madness is partly its unpredictability. Models can improve our expectations and help us make smarter bets, but they will not and should not strip the tournament of its capacity to surprise.
Next steps in the series
Future installments will dig deeper into model variants (deep sequence models vs. structured ensembles), explore human-AI hybrid bracket strategies, and present a toolkit for constructing bracket portfolios tailored to specific pool payoff rules. We will also make portions of the backtest pipeline public so the AI community can reproduce and extend the work.
In the meantime, the experiment’s core insight stands: AI brings measurable forecasting power to a domain built on narratives and noise. It helps us replace some instant gut reactions with clearer probabilistic thinking—without, crucially, robbing the tournament of its most human quality: the pleasure of the improbable.
We invite the AI news community to follow this series, replicate the experiments, and explore the broader implications for forecasting in other domains. March is ephemeral; the lessons are not.

