Evaluating Reasoning in LLMs Through MTG Deck Building

Update (2026-01-14): Gemini 3 Pro, Gemini 3 Flash, GPT 5.2 (medium), Claude Opus 4.5, and Grok 4.1 Fast added.

View Older Updates

Update (2025-08-08): GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4 Turbo 11-06, and GPT-3.5 Turbo added.

Update (2025-08-05): Kimi K2, GPT OSS 120B (low), and GPT OSS 120B (high) added.

Update (2025-07-13): Grok 4, Grok 3, Gemini 2.5 Flash, Claude Sonnet 4 (thinking), and Command A added.

Update (2025-06-11): o3 (high) added after API cost reduction.

Update (2025-06-06): Deepseek R1 05-28 and Gemini 2.5 Pro 06-05 added.

Update (2025-05-22): Claude Sonnet 4 and Opus 4 added.

Update (2025-05-14): Human Baseline, Gemini 2.5 Pro 03-25, Gemini 1.5 Flash, Deepseek V3 03-24, Qwen3 30B 3A added.

Introduction

I have an obsession with applying AI models to my favorite card game, so I created ManaBench, a benchmark for testing reasoning in LLMs using Magic: The Gathering (MTG) deck construction. Deck-building is a good test because the reasons for choosing one card over another are often subtle and require predicting how they will interact with the rest of the deck during a game. There is also an abundance of data from tournament winners to compare against. This post covers the benchmark’s construction and evaluation methodology.

The core task in ManaBench is as follows: Given a 59-card main deck from a specific MTG constructed format (e.g., Modern, Legacy) - a deck originally constructed by a human player and sourced from tournament results - the LLM must choose the most suitable 60th card from a list of six options. One of these options is the “golden” card - the card that was originally in that slot in the human-designed decklist - while the other five are plausible alternatives generated by Manamorphosis, a diffusion model I trained specifically for completing MTG decks.

This task is difficult because it demands more than just factual recall about individual cards. To score well, models must be able to perform:

Strategic coherence evaluation: the chosen card must align with the deck’s overall strategy (e.g., aggro, control, combo), a judgment that requires understanding the interplay of the existing 59 cards.
System-wide optimization: the card should fit the deck’s mana curve and resource development plan, demonstrating an understanding of resource management within the game system.
Complex interaction analysis (card synergies): effective MTG play relies heavily on card interactions. The LLM needs to identify cards that synergize well with the existing 59 cards, showcasing an ability to reason about emergent properties of combined elements.
Contextual awareness (format knowledge): different MTG formats have distinct card pools and power levels. The choice must be legal and relevant within the specified format, testing the LLM’s ability to operate within defined constraints.
Discernment against plausible alternatives: the alternatives are not entirely random but are generated by a model trained for the task, making them potentially attractive but likely sub-optimal choices. Successfully identifying the golden card requires fine-grained distinction based on strategic fit.

Benchmark Leaderboard

The human baseline has a small sample size (just me, since it’s hard to find skilled MTG players willing to sit through a 200 question test), but is still a useful reference. As the creator of the benchmark and a player of the game for more than a decade, I scored 68% agreement with the other human deck builders. This could be a skill issue on my end, but I think the subjective nature of some questions and the presentation format are bigger factors. To make the comparison fair, I wrote a script that presents the questions to me exactly as they are shown to the LLMs, but I think I could do better using a deck editor. When I first released this benchmark, I beat most of the models by a wide margin. Several months later, GPT-5 matched my score with 67% accuracy, followed by o3 and Gemini 2.5 Pro.

More recently, I evaluated Gemini 3 Pro, which reached 71.50%. I expect scores will plateau somewhere in the 80-90% range, since the “most competitive” card for a deck depends on the exact metagame at a given time. While there are many obviously wrong choices given the limited selection of 6 cards, I think at least some of the questions in the dataset do not have a clear correct answer. If a model scored 95%+, I would suspect it was overfit.

The results also show a gap in performance between the leading American models (GPT-5, Gemini, Claude) and Chinese models like Deepseek R1 (43.5%) and Kimi K2 (41%). These models are powerful, but their performance here suggests their reasoning capabilities may not be as developed as some of their US counterparts.

Performance Over Time

This chart tracks the performance of the best scoring model on ManaBench available at each point in time.

Cost Over Time

This chart shows the cheapest cost to reach GPT-4-level accuracy over time.

Accuracy vs. Cost

This chart visualizes the model cost vs performance. The x-axis represents a blended cost per million tokens, calculated with a 3:1 weighting of input to output costs. The y-axis shows the accuracy on ManaBench. Models on the red line represent the Pareto frontier.

Benchmark Construction Methodology

The creation of ManaBench involved several stages, from curating source decks to generating challenging multiple-choice questions. The process is designed to be rigorous and reproducible, with a focus on capturing instances of human expert judgment.

Source Data & Deck Curation

The foundation of the benchmark is a large corpus of human-constructed MTG decklists scraped from MTGTop8 (a public database of MTG tournament results and decklists). The curation process involves:

Format filtering: all decks are categorized by format, and only those from eternal constructed formats like Modern and Legacy are collected. Standard is excluded due to its rotating nature, which does not align with the goal of measuring reasoning abilities and minimizing the impact of knowledge cutoff dates.
Sanity checks: only decks containing exactly 60 main deck cards and 15 sideboard cards are considered. This ensures that they obey conventional deck-building wisdom and that the creator carefully considered the metagame when constructing the deck.
Format legality enforcement: using the card database from MTGJSON, every card in a potential source deck is verified for legality within its designated format. This step verifies that the decks and subsequent questions are valid within the game’s rules.
Sampling: from the pool of validated decks, 25 decks are randomly sampled for each of the four target formats, resulting in a set of 100 unique, curated decklists for benchmark generation. These decks represent successful strategies designed and tested by human players in tournament play.

Question Generation

For each of the 100 sampled human-constructed decks, two unique questions are generated. This involves selecting a “golden” card to remove and then using the Manamorphosis model to propose alternatives.

Golden card selection:
- The “golden” card represents the correct answer for the deck completion task and matches the original human-designed, tournament-sourced deck.
- It is chosen by randomly selecting one unique card name present in the original 60-card main deck. For instance, if a deck contains four copies of “Lightning Bolt,” “Lightning Bolt” is one possible unique card name that could be selected as the golden card.
- To ensure variety in the questions derived from a single deck, the two golden cards selected from the same source deck must be different card names.
Partial deck creation:
- Once a golden card name is selected, one instance of this card is removed from the original 60-card main deck list. This creates the 59-card partial deck that will be presented to the LLM.
- For example, if “Island” is chosen as the golden card from a deck containing 10 Islands, the partial deck will contain 9 Islands.
Generating plausible alternatives:
- The five incorrect-but-plausible alternatives are generated using Manamorphosis, a Transformer-based diffusion model I trained on a large corpus of MTG decks. It learns to represent cards as high-dimensional embeddings and understands patterns of card co-occurrence and deck structure.
- For benchmark generation, this model takes the 59-card partial deck and, through a reverse diffusion process conditioned on these known cards, predicts embeddings for the missing card. These embeddings are then mapped back to specific card names. This process is designed to generate alternatives that are contextually relevant yet distinct from the golden card.
- For a detailed technical explanation of the diffusion model’s architecture and training process, please refer to the Manamorphosis repository and the accompanying blog post.
- This generation process is repeated to obtain 5 unique card names that are different from the chosen golden card and from each other, serving as challenging distractors for the LLM.

Question structure (JSON): Each generated question is stored in a JSON object with the following structure:

{
  "id": "question_001",
  "deck": ["Card Name 1", "Card Name 2", "Card Name 3", "... list of 59 card names ..."],
  "golden": "Actual Missing Card Name",
  "alternatives": [
    "Alternative Card Name A",
    "Alternative Card Name B",
    "Alternative Card Name C",
    "Alternative Card Name D",
    "Alternative Card Name E"
  ],
  "format": "modern"
}

Evaluation Protocol

Prompting Strategy

Careful prompt engineering is employed to provide the LLM with sufficient context to make a reasoned judgment, minimizing the need for memorized MTG card knowledge so the task tests analytical skill:

Role assignment: the LLM is instructed: "You are an expert Magic: The Gathering player."
Task definition: the prompt explains that a decklist from a specific format is missing one card and asks the model to choose the answer that best completes the deck.
Decklist presentation: the 59-card partial deck is provided. Each card entry includes:
- Its count in the partial deck.
- Its full name.
- Its detailed rules text, including type line, mana cost, power/toughness (for creatures), loyalty (for planeswalkers), and Oracle text. This information is sourced from MTGJSON. The explicit provision of full rules text for all cards in the deck and choices is a key design element, intended to reduce the task’s reliance on the LLM’s pre-existing knowledge of specific cards and instead focus on its ability to reason based on the provided information.
- Example Card Presentation in Prompt: 2x Snapcaster Mage - Creature - Human Wizard - Cost: {1}{U} - P/T: 2/1 - Rules: Flash. When Snapcaster Mage enters the battlefield, target instant or sorcery card in your graveyard gains flashback until end of turn. The flashback cost is equal to its mana cost.
Multiple-choice options:
- The six choices (the golden card and the five alternatives) are presented, shuffled randomly to avoid positional bias, and labeled A through F.
- Each choice is also presented with its full name and detailed rules text (sourced from MTGJSON, same as deck cards).
- Example Choice Presentation in Prompt: A) Brainstorm - Instant - Cost: {U} - Rules: Draw three cards, then put two cards from your hand on top of your library in any order.

Answer Format and Extraction

To standardize evaluation, LLMs are instructed to output their final answer in a specific format: "Respond with only the letter of your choice in the format: ANSWER: [LETTER]"

The evaluation script parses this response using the following logic:

Primary method: looks for the “ANSWER: [LETTER]” pattern (case-insensitive for “ANSWER:”, extracts the letter).
Fallback 1 (single letter): if the primary method fails, it checks if the entire response is a single letter from A to F (e.g., a response like “C”).
Fallback 2 (boxed letter): if both above fail, it looks for the pattern $\boxed{LETTER}$ (e.g., “The answer is $\boxed{A}$”) using a regular expression.

Metrics

Accuracy: the primary metric is the percentage of questions the LLM answers correctly by selecting the golden card.

Why ManaBench is a Strong Benchmark for Reasoning

The preliminary results and the design of ManaBench demonstrate several key strengths for evaluating an LLM’s reasoning capabilities:

Measures alignment with human expert judgment: the “golden” answers are derived from decks designed by human players and sourced from MTGTop8, a database of tournament decklists. Success on this benchmark therefore indicates an LLM’s ability to make choices that align with established human expertise.
Clear performance differentiation: compared to other popular evaluation metrics like LMArena ELO ratings, ManaBench does a better job separating models. While a general positive correlation is observed (as indicated by the trendline in the chart below), ManaBench provides a much wider relative spread in scores. For the models included in the comparison, ManaBench accuracies range from 19.5% to 67% (a spread of 47.5 percentage points, representing an increase of approximately 244% from the minimum observed score to the maximum). In contrast, their LMArena ELO scores range from 1257 to 1481 (a spread of 224 ELO points, representing an increase of approximately 17.8% from the minimum observed ELO score to the maximum). The larger proportional range in ManaBench allows for a more granular distinction between models.

Challenge for frontier models: GPT-5 now reaches 67%, just 1 point shy of the 68% human baseline, with o3 (high) at 65% and o3 (low) at 63%. Only GPT-5 and o3 exceed 60%, indicating the benchmark remains unsaturated and continues to present a real challenge for frontier models.
Test of generalization vs. benchmark overfitting: the complexity of MTG deck construction, the private nature of the benchmark questions, and the fact that MTG strategy is unlikely to be a direct optimization target for most AI labs, collectively make ManaBench a strong test of generalized reasoning. Performance on this benchmark may reveal whether models are truly capable of applying reasoning to novel, complex systems, or if their high scores on common academic benchmarks (like MMLU or MATH) are partly due to overfitting. For example, a model series like Llama 4, which demonstrated strong performance on many standard benchmarks, gave a much weaker showing here. This aligns with the experiences of many users who reported that Llama 4 struggled with real tasks and underperformed expectations.
Cost-effectiveness and efficiency: with 200 questions, the benchmark is relatively small compared to some larger evaluation suites. This allows for more rapid and cost-effective evaluation cycles, making it feasible to test a wider array of models or fine-tuned variants without incurring prohibitive API costs or excessive computation time, while still providing a strong differentiating signal.

Conclusion

The initial results demonstrate ManaBench’s potential as a benchmark, with even frontier models finding the task challenging. To maintain the integrity and long-term utility of the benchmark, the specific questions are not being publicly released at this time. This is to prevent them from being inadvertently included in the training data of future LLMs, which would compromise its validity as an unseen test set.