Evaluating Reasoning in LLMs Through MTG Deck Building

Update (2025-06-11): o3 (high) added after API cost reduction.

Update (2025-06-06): Deepseek R1 05-28 and Gemini 2.5 Pro 06-05 added.

Update (2025-05-22): Claude Sonnet 4 and Opus 4 added.

Update (2025-05-14): Human Baseline, Gemini 2.5 Pro 03-25, Gemini 1.5 Flash, Deepseek V3 03-24, Qwen3 30B 3A added.

Introduction

Evaluating the advanced reasoning capabilities of Large Language Models (LLMs) requires specialized benchmarks that move beyond surface-level NLP tasks. Building upon my obsession applying AI models to my favorite card game, I’ve created ManaBench, a benchmark designed to probe an LLM’s capacity for reasoning using the collectible card game Magic: The Gathering (MTG) as a proxy. MTG, with its intricate interactions and deep strategic layer, serves as an ideal micro-world to test an LLM’s ability to process extensive contextual information, identify sophisticated patterns, and make judgments that align with expert human strategic choices. This post provides a technical overview of the benchmark’s construction and the methodology used for evaluating LLMs, not as a test of MTG-specific card knowledge, but as a measure of their broader reasoning and problem-solving faculties when faced with a constrained, strategic challenge.

The Deck Completion Task

The core task in ManaBench is as follows: Given a 59-card main deck from a specific MTG constructed format (e.g., Modern, Legacy) - a deck originally constructed by a human player and sourced from tournament results - the LLM must choose the most suitable 60th card from a list of six options. One of these options is the “golden” card - the card that was originally in that slot in the human-designed decklist - while the other five are plausible alternatives generated by Manamorphosis, a diffusion model I trained specifically for completing MTG decks.

This task is non-trivial for an LLM because it demands more than just factual recall about individual cards; it requires:

Strategic Coherence Evaluation: The chosen card must align with the deck’s overall strategy (e.g., aggro, control, combo), a judgment that requires understanding the interplay of the existing 59 cards.
System-Wide Optimization: The card should fit the deck’s mana curve and resource development plan, demonstrating an understanding of resource management within the game system.
Complex Interaction Analysis (Card Synergies): Effective MTG play relies heavily on card interactions. The LLM needs to identify cards that synergize well with the existing 59 cards, showcasing an ability to reason about emergent properties of combined elements.
Contextual Awareness (Format Knowledge): Different MTG formats have distinct card pools and power levels. The choice must be legal and relevant within the specified format, testing the LLM’s ability to operate within defined constraints.
Discernment Against Plausible Alternatives: The alternatives are not entire random but are generated by a model trained for the task, making them potentially attractive but incorrect choices. Successfully identifying the golden card requires fine-grained distinction based on strategic fit.

Benchmark Leaderboard

The human baseline has an admittedly small sample size (just myself, it’s difficult to find skilled MTG players who want to sit through a 200 question test), but still provides a valuable reference. As the creator of the benchmark and a player of the game for 10+ years, I scored 68% agreement with the other human deck builders. This could be a skill issue on my end, but I think it is more likely due to the somewhat subjective nature of question, along with the presentation format. To provide a fair comparison, I made a script that presents that questions to me exactly as they are shown to the LLMs, but I think I could do better using a deck editor. Despite my less than perfect score, I still beat most of the models by a wide margin. Among LLMs, o3 (high) comes closest with 65% accuracy, followed by o3 (low) at 63%, and Gemini 2.5 Pro 06-05 at 57.5%. This aligns with qualitative assessments and other benchmarks where these models often excel in real-world tasks. Their success here suggests that ManaBench is effectively capturing a similar type of reasoning aptitude.

The results also highlight a discernible gap in performance between the leading American models (o3, Gemini 2.5 Pro, Claude) and prominent Chinese models like Deepseek R1 (43.5%) and Qwen3 235B A22B (37%). While these models are undoubtedly powerful, their performance on ManaBench suggests that their reasoning capabilities may not be as developed as some of their US counterparts. This observation underscores the utility of specialized benchmarks like ManaBench in revealing subtle but significant differences in model capabilities that might be obscured by broader, more generalized benchmarks.

Benchmark Construction Methodology

The creation of ManaBench involves several stages, from curating source decks to generating challenging multiple-choice questions. The process is designed to be rigorous and reproducible, with a focus on capturing instances of human expert judgment as reflected in competitive deck choices.

Source Data & Deck Curation

The foundation of the benchmark is a large corpus of human-constructed MTG decklists scraped from MTGTop8 (a public database of MTG tournament results and decklists). The curation process involves:

Format Filtering: Decks are categorized by major constructed formats: Modern, Pioneer, Legacy, and Vintage. The benchmark currently focuses on these eternal formats. Standard is excluded due to its rotating nature, which does not align with the goal of measuring reasoning abilities and minimizing the impact of knowledge cutoff dates, thereby keeping the focus on reasoning skills rather than rapidly changing format knowledge.
Strict Validation: Only decks containing exactly 60 main deck cards and 15 sideboard cards are considered. This ensures that they obey conventional deck-building wisdom and that the creator carefully considered the metagame when constructing the deck.
Format Legality Check: Using the card database from MTGJSON, every card in a potential source deck is verified for legality within its designated format. This step is crucial for ensuring that the decks and subsequent questions are valid within the game’s rules.
Sampling: From the pool of validated decks, 25 decks are randomly sampled for each of the four target formats, resulting in a set of 100 unique, curated decklists for benchmark generation. These decks represent successful strategies designed and tested by human players in tournament play.

Question Generation

For each of the 100 sampled human-constructed decks, two unique questions are generated. This involves selecting a “golden” card to remove and then using the Manamorphosis model to propose alternatives.

Golden Card Selection:
- The “golden” card represents the correct answer for the deck completion task, reflecting a choice consistent with the original human-designed, tournament-sourced deck.
- It is chosen by randomly selecting one unique card name present in the original 60-card main deck. For instance, if a deck contains four copies of “Lightning Bolt,” “Lightning Bolt” is one possible unique card name that could be selected as the golden card.
- To ensure variety in the questions derived from a single deck, the two golden cards selected from the same source deck must be different card names.
Partial Deck Creation:
- Once a golden card name is selected, one instance of this card is removed from the original 60-card main deck list. This creates the 59-card partial deck that will be presented to the LLM.
- For example, if “Island” is chosen as the golden card from a deck containing 10 Islands, the partial deck will contain 9 Islands.
Generating Plausible Alternatives:
- The five incorrect-but-plausible alternatives are generated using a Manamorphosis, a Transformer-based diffusion model custom-trained trained on a vast corpus of MTG decks. It learns to represent cards as high-dimensional embeddings and understands patterns of card co-occurrence and deck structure.
- For benchmark generation, this model takes the 59-card partial deck and, through a reverse diffusion process conditioned on these known cards, predicts embeddings for the missing card. These embeddings are then mapped back to specific card names. This process is designed to generate alternatives that are contextually relevant yet distinct from the golden card.
- For a detailed technical explanation of the diffusion model’s architecture and training process, please refer to the Manamorphosis repository and the accompanying blog post.
- This generation process is repeated to obtain 5 unique card names that are different from the chosen golden card and from each other, serving as challenging distractors for the LLM.

Question Structure (JSON): Each generated question is stored in a JSON object with the following structure:

{
  "id": "question_001",
  "deck": ["Card Name 1", "Card Name 2", "Card Name 3", "... list of 59 card names ..."],
  "golden": "Actual Missing Card Name",
  "alternatives": [
    "Alternative Card Name A",
    "Alternative Card Name B",
    "Alternative Card Name C",
    "Alternative Card Name D",
    "Alternative Card Name E"
  ],
  "format": "modern"
}

Evaluation Protocol

Prompting Strategy

Careful prompt engineering is employed to provide the LLM with sufficient context to make a reasoned judgment, minimizing the need for memorized MTG card knowledge and emphasizing analytical skill:

Role Assigment: The LLM is instructed: "You are an expert Magic: The Gathering player."
Task Definition: The prompt explains that a decklist from a specific format is missing one card and asked to choose the answer that best completes the deck.
Decklist Presentation: The 59-card partial deck is provided. Each card entry includes:
- Its count in the partial deck.
- Its full name.
- Its detailed rules text, including type line, mana cost, power/toughness (for creatures), loyalty (for planeswalkers), and Oracle text. This information is sourced from MTGJSON. The explicit provision of full rules text for all cards in the deck and choices is a key design element, intended to reduce the task’s reliance on the LLM’s pre-existing knowledge of specific cards and instead focus on its ability to reason based on the provided information.
- Example Card Presentation in Prompt: 2x Snapcaster Mage - Creature - Human Wizard - Cost: {1}{U} - P/T: 2/1 - Rules: Flash. When Snapcaster Mage enters the battlefield, target instant or sorcery card in your graveyard gains flashback until end of turn. The flashback cost is equal to its mana cost.
Multiple-Choice Options:
- The six choices (the golden card and the five alternatives) are presented, shuffled randomly to avoid positional bias, and labeled A through F.
- Each choice is also presented with its full name and detailed rules text (sourced from MTGJSON, same as deck cards).
- Example Choice Presentation in Prompt: A) Brainstorm - Instant - Cost: {U} - Rules: Draw three cards, then put two cards from your hand on top of your library in any order.

Answer Format and Extraction

To standardize evaluation, LLMs are instructed to output their final answer in a specific format: "Respond with only the letter of your choice in the format: ANSWER: [LETTER]"

The evaluation script parses this response using the following logic:

Primary Method: Looks for the “ANSWER: [LETTER]” pattern (case-insensitive for “ANSWER:”, extracts the letter).
Fallback 1 (Single Letter): If the primary method fails, it checks if the entire response is a single letter from A to F (e.g., a response like “C”).
Fallback 2 (Boxed Letter): If both above fail, it looks for the pattern $\boxed{LETTER}$ (e.g., “The answer is $\boxed{A}$”) using a regular expression.

Metrics

Accuracy: The primary metric is the percentage of questions the LLM answers correctly by selecting the golden card.

Why ManaBench is a Strong Benchmark for Reasoning

The preliminary results and the design of ManaBench highlight several key strengths for evaluating an LLM’s reasoning capabilities:

Measures Alignment with Human Expert Judgment: The “golden” answers are derived from decks designed by human players and sourced from MTGTop8, a database of tournament decklists. Success on this benchmark therefore indicates an LLM’s ability to make choices that align with established human expertise and strategic consensus.
Clear Performance Differentiation: When comparing ManaBench scores to other established LLM evaluation metrics like LMArena ELO ratings, ManaBench demonstrates a significantly stronger ability to differentiate between models. While a general positive correlation is observed (as indicated by the trendline in the chart below), ManaBench provides a much wider relative spread in scores. For the models included in the comparison, ManaBench accuracies range from 19.5% to 65% (a spread of 45.5 percentage points, representing an increase of approximately 233% from the minimum observed score to the maximum). In contrast, their LMArena ELO scores range from 1257 to 1470 (a spread of 213 ELO points, representing an increase of approximately 17% from the minimum observed ELO score to the maximum). The significantly larger proportional range in ManaBench allows for a more granular distinction between models.
Challenge for Frontier Models: With the exception of o3, most of the models are far from matching human performance, let alone exceeding it. There is a 3 point gap between the top model (o3 (high) at 65%) and the human baseline, and every model except for o3 remains under 60%. This shows that the benchmark remains unsaturated and presents a significant challenge for frontier models.
Test of Generalization vs. Benchmark Overfitting: The complexity of MTG deck construction, the private nature of the benchmark questions, and the fact that MTG strategy is unlikely to be a direct optimization target for most LLM labs, collectively make ManaBench a strong test of generalized reasoning. Performance on this benchmark may reveal whether models are truly capable of applying reasoning to novel, complex systems, or if their high scores on common academic benchmarks (like MMLU or MATH) are partly due to overfitting or memorization of those specific test distributions. For example, a model series like Llama 4, which demonstrated strong performance on many standard benchmarks, gave a much weaker showing here, highlighting the value of diverse, specialized evaluations like ManaBench in assessing true generalization. This aligns with the experiences of many users who reported that Llama 4 struggled with real tasks and underperformed expectations.
Cost-Effectiveness and Efficiency: With 200 questions, the benchmark is relatively concise compared to some larger evaluation suites. This allows for more rapid and cost-effective evaluation cycles, making it feasible to test a wider array of models or fine-tuned variants without incurring prohibitive API costs or excessive computation time, while still providing strong differentiating signals as seen in the results.

Benchmark Integrity

To maintain the integrity and long-term utility of ManaBench as an evaluation tool, the specific benchmark questions and code are not being publicly released at this time. This measure is taken to prevent the benchmark from being inadvertently included in the training data of future LLMs, which would compromise its validity as an unseen test set. If you are a researcher and would like private access, please reach out.

Conclusion

ManaBench offers a novel approach to evaluating the sophisticated reasoning capabilities of Large Language Models by leveraging the strategic depth of Magic: The Gathering. The benchmark’s core “deck completion task” – choosing the optimal 60th card for a 59-card deck - demands an understanding of strategic coherence, system-wide optimization, and complex interactions.

The initial results demonstrate ManaBench’s potential as a strong differentiator of LLM reasoning abilities, with even frontier models finding the task challenging. If you would me to add other models to the leaderboard or just think it’s a cool project, consider checking out my other related work or following me on Twitter

Introduction#

The Deck Completion Task#

Benchmark Leaderboard#

Benchmark Construction Methodology#

Source Data & Deck Curation#

Question Generation#

Evaluation Protocol#

Prompting Strategy#

Answer Format and Extraction#

Metrics#

Why ManaBench is a Strong Benchmark for Reasoning#

Benchmark Integrity#

Conclusion#