This post details Manamorphosis, a first-of-its-kind diffusion model developed to complete Magic: The Gathering decklists. It takes a set of known cards and fills in the rest to form a 60-card main deck. Subsequently, using the completed main deck as context, it can complete a 15-card sideboard. The core generative mechanism is based on Denoising Diffusion Probabilistic Models (DDPMs), the same family of models powering many image generation systems like Stable Diffusion and Midjourney, but adapted here for the unique domain of card sets. Applying AI models to MTG has long been a pet project of mine and I’m exciting to share this model, as I believe it is the state-of-the-art (and only) AI model dedicated to understanding deck construction.
Demo Video
Card Representation: Doc2Vec Embeddings
Card identity is represented by 128-dimensional vectors (EMB_DIM=128
). These embeddings are generated by training a Doc2Vec model (train_embedding_model.py
) on preprocessed text data for each card obtained from MTGJSON’s AtomicCards.json
. This captures semantic relationships based on card text. Much like modern RAG engines use embeddings to understand the meaning behind search queries and documents, these Doc2Vec embeddings allow the system to grasp the functional similarities between cards based on their textual descriptions (cost, type, rules).
Rationale for Using Embeddings
Instead of training the model to directly predict discrete cards from a vocabulary of ~28,000+, Manamorphosis uses pre-trained Doc2Vec embeddings as an intermediate representation. This approach offers several advantages:
- Handling Data Sparsity: The training dataset (~47,000 decks) contains only a fraction (~5,000) of all legal MTG cards. A model predicting cards directly would struggle to learn meaningful representations or generation logic for the vast majority of cards rarely or never seen during training. The embedding model, trained on all card text, provides a representation for every card.
- Generalization to New/Unseen Cards: Because the embedding is derived from card text, the system can generate an embedding for any card, including newly released ones, without retraining the embedding model (though retraining the diffusion model might improve performance with new metagames). The diffusion model learns to operate on the semantic meaning captured in the 128-dimensional embedding space, rather than being limited to the fixed vocabulary seen during its own training. This contrasts with models that predict discrete tokens, which typically require a fixed, predefined vocabulary.
- Capturing Semantic Relationships: Doc2Vec learns vectors where cards with similar functions, costs, types, or textual patterns (e.g., different variations of counterspells, cheap red burn spells, evasive creatures) are closer together in the embedding space. This allows the diffusion model to learn higher-level concepts (“needs more removal,” “add card draw”) rather than just memorizing specific card co-occurrences, leading to potentially more robust and contextually relevant deck completions. This focus on semantic similarity is analogous to how embedding-based search engines return results that are conceptually related, not just keyword matches.
- Dimensionality Reduction & Decoupling: Working with dense 128-dimensional vectors is more computationally manageable for the transformer architecture than using extremely high-dimensional one-hot vectors (one per card). It also decouples the task of understanding card text semantics (Doc2Vec) from the task of generative deck construction (Diffusion Model).
Card Text Preprocessing (train_embedding_model.py:get_text
)
The get_text
function preprocesses card data into a consistent string format suitable for Doc2Vec training. Specific choices include:
- Mana Cost: Replaced curly braces
{}
with pipe symbols|
(|W|
,|U|
, etc.) and ensured spacing around symbols ({W}{U}
->|W| |U|
). This treats each mana symbol as a distinct token and distinguishes them from mana symbols in the card text. - Power/Toughness: Represented as
$Power$ #Toughness#
(e.g.,$2$ #2#
). This creates unique tokens for P/T values. - Card Type: The supertype (e.g., “Creature”, “Instant”) is split into individual tokens surrounded by pipes (
|Creature|
,|Instant|
). Subtypes (e.g., “Goblin”, “Wizard”) are kept as single words. - Rules Text:
- Card name references replaced with
@
. This prevents the model from overfitting to specific card names and focuses on the actions/effects. - Common self-references (“this creature”, “this enchantment”, etc.) also replaced with
@
. - Line breaks, semicolons replaced with spaces. Colons have spaces added (
:
->:
). - Reminder text (within parentheses) is removed using regex (
re.sub(reminder_remover, '', ...)
). - Special characters like
&
,−
,—
,'
,,
,.
,'
,"
are handled (replaced or removed). - Text is converted to lowercase.
- Card name references replaced with
- Stop Words: Common English stop words (like “the”, “a”, “is”) are removed using
nltk.corpus.stopwords
to reduce noise and focus on meaningful terms.
This preprocessing aims to convert structured card information and natural language text into a sequence of meaningful tokens that the Doc2Vec model can learn relationships from.
# From train_embedding_model.py (Illustrative snippet)
def get_text(card):
text = ''
if 'manaCost' in card:
text += card['manaCost'].replace('}{', '} {').replace('{', '|').replace('}', '|') + ' '
if 'power' in card:
text += '$' + card['power'] + '$ #' + card['toughness'] + '# '
text += ' '.join(['|' + word + '|' for word in card['type'].split(' — ')[0].split()]) + ' '
if '—' in card['type']:
text += card['type'].split(' — ')[1] + ' '
if 'text' in card:
# ... (Handling basic lands) ...
processed_text = card['text'].replace('&', 'and').replace(card['name'], '@').replace(card['name'].split(',')[0], '@') # simplified
processed_text = processed_text.replace('this creature', '@').replace('this enchantment', '@').replace('this artifact', '@').replace('this land', '@')
processed_text = processed_text.replace('\\n', ' ').replace(';', ' ').replace(':', ' :').replace('|', '•')
text += processed_text
text = re.sub(reminder_remover, '', text.lower()... ) # Lowercasing, punctuation, etc.
words = [word for word in text.split(' ') if word != '']
filtered_words = [word for word in words if word not in stop_words]
return ' '.join(filtered_words)
# Doc2Vec Training (Conceptual)
model = Doc2Vec(vector_size=128, dm=0, dbow_words=0, min_count=2, epochs=200, workers=cores, seed=42)
model.build_vocab(corpus) # corpus yields TaggedDocument(processed_text.split(), [card_id])
model.train(corpus, ...)
A simple linear classifier (train_embedding_classifier.py
) is trained separately to map these 128-dim embeddings back to unique card indices. This classifier is essential during the reverse diffusion process to identify the most likely card corresponding to a denoised embedding vector. This is more efficient than performing cosine similarity search for each generated embedding during inference.
# From train_embedding_classifier.py
class CardClassifier(nn.Module):
def __init__(self, embedding_dim, num_classes):
super(CardClassifier, self).__init__()
self.network = nn.Linear(embedding_dim, num_classes)
def forward(self, x):
return self.network(x)
Denoising Diffusion Probabilistic Model (DDPM) Framework
The core generation mechanism is a DDPM, mirroring the approach used in image generation.
Forward Process (Noise Addition): Starting with the true deck embeddings
x0
, Gaussian noise is progressively added overT
timesteps (here,T=1000
). This is analogous to how image diffusion models start with a clear image and gradually add noise until only static remains. The noise level at each step is determined by a predefined variance schedule, specifically a cosine schedule (cosine_beta_schedule
).# diffusion_model.py def cosine_beta_schedule(T, s=0.008): steps = torch.linspace(0, T, T + 1, dtype=torch.float64) alpha_bar = torch.cos(((steps / T) + s) / (1 + s) * torch.pi * 0.5) ** 2 alpha_bar = alpha_bar / alpha_bar[0] betas = 1 - (alpha_bar[1:] / alpha_bar[:-1]) return torch.clip(betas, 0, 0.999).float() # Within DiffusionTrainer class: beta = cosine_beta_schedule(T).to(device) # T = TIMESTEPS (e.g., 1000) alpha = 1.0 - beta alpha_bar = torch.cumprod(alpha, dim=0) # Precompute terms used in diffusion and sampling: self.register("sqrt_alpha_bar", torch.sqrt(alpha_bar)) self.register("sqrt_one_minus_alpha_bar", torch.sqrt(1.0 - alpha_bar)) self.register("beta", beta) self.register("alpha", alpha) # ... (other registered buffers)
The state
x_t
at timestept
can be sampled directly using the cumulative productalpha_bar
:x_t = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * noise
# diffusion_model.py: DiffusionTrainer.q_sample # Note: The actual implementation modifies x_t for known cards # based on the mask *after* the initial noise addition, # returning a mix of original x0 and noised x_t. def q_sample(self, x0, t, mask): mask_expanded = mask.expand_as(x0) sqrt_ab = self._extract(self.sqrt_alpha_bar, t, x0.shape, x0.device) sqrt_1mab = self._extract(self.sqrt_one_minus_alpha_bar, t, x0.shape, x0.device) noise = torch.randn_like(x0) # Calculate the fully noised version first x_t_noised = sqrt_ab * x0 + sqrt_1mab * noise # Return original embeddings for known positions, noised for unknown x_t_masked = mask_expanded * x0 + (1 - mask_expanded) * x_t_noised # Returns the masked noisy sample and the *original* noise (for loss) return x_t_masked, noise
Reverse Process (Denoising): The model learns to predict the noise
epsilon
added at timestept
. Starting from pure noisex_T
, the model iteratively refines the embeddings by predicting the noiseepsilon_pred = model(x_t, x0, sb_x_t, t, mask, sb_mask)
and estimatingx_{t-1}
untilx0
, the original noise-free main deck, is reached. During inference, the actual denoising step involves samplingx_{t-1}
by subtracting the predicted noiseepsilon_pred
fromx_t
, then adding a smaller amount of noise for the next timestep.
Model Architecture (diffusion_model.py:DiffusionModel
)
The model uses a transformer-based architecture, the same core building block behind Large Language Models like GPT and BERT, but adapted for set-based data rather than sequences. It lacks positional embeddings, treating decks as unordered sets, which differs from typical NLP or vision transformer usage where sequence order is crucial. It has distinct paths for main deck and sideboard processing. The internal model dimension is model_dim=384
, and the embedding dimension is EMB_DIM=128
.
Time Embeddings: Timestep
t
is encoded using standard sinusoidal embeddings, processed by separate MLPs for main deck and sideboard paths.# diffusion_model.py def sinusoidal_embedding(t: torch.Tensor, dim: int = EMB_DIM): half = dim // 2 freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device) / (half - 1)) args = t[:, None] * freqs[None] emb = torch.cat((args.sin(), args.cos()), dim=-1) if dim % 2: emb = nn.functional.pad(emb, (0, 1)) return emb # Within DiffusionModel.__init__ ff_dim = cfg["dim_feedforward"] # 3072 self.main_time_mlp = nn.Sequential( nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), ) self.sb_time_mlp = nn.Sequential( # For Sideboard Decoder path nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), )
Mask Embeddings: Binary masks (1.0 for known, 0.0 for unknown) are processed by separate MLPs.
# Within DiffusionModel.__init__ self.main_mask_mlp = nn.Sequential( nn.Linear(1, EMB_DIM), nn.SiLU(), nn.Linear(EMB_DIM, EMB_DIM), ) self.sb_mask_mlp = nn.Sequential( # For Sideboard Decoder path nn.Linear(1, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), # Note: intermediate dim is ff_dim here )
Input Processing: Input embeddings
x_t
(main) orsb_x_t
(sideboard) are combined via addition with their respective time and mask embeddings, then projected tomodel_dim
.# Within DiffusionModel.forward sin_emb = sinusoidal_embedding(t, EMB_DIM) # Shape: [Batch, EMB_DIM] # Main Deck Input main_t_emb_flat = self.main_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM] main_t_emb = main_t_emb_flat[:, None, :].expand(-1, DECK_SIZE, -1) # Shape: [Batch, DECK_SIZE, EMB_DIM] main_mask_emb = self.main_mask_mlp(mask) # mask shape: [Batch, DECK_SIZE, 1] -> Output: [Batch, DECK_SIZE, EMB_DIM] h_main = x_t + main_t_emb + main_mask_emb # x_t shape: [Batch, DECK_SIZE, EMB_DIM] h_main_proj = self.main_input_proj(h_main) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, DECK_SIZE, model_dim] # Sideboard Input sb_decoder_t_emb_flat = self.sb_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM] sb_decoder_t_emb = sb_decoder_t_emb_flat[:, None, :].expand(-1, SIDEBOARD_SIZE, -1) # Shape: [Batch, SIDEBOARD_SIZE, EMB_DIM] sb_decoder_mask_emb = self.sb_mask_mlp(sb_mask) # sb_mask shape: [Batch, SIDEBOARD_SIZE, 1] -> Output: [Batch, SIDEBOARD_SIZE, EMB_DIM] h_sb = sb_x_t + sb_decoder_t_emb + sb_decoder_mask_emb # sb_x_t shape: [Batch, SIDEBOARD_SIZE, EMB_DIM] h_sb_proj = self.sb_input_proj(h_sb) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, SIDEBOARD_SIZE, model_dim]
Main Deck Path (Encoder): Processes
h_main_proj
through a standardnn.TransformerEncoder
(layers=8
,nhead=8
). The output is projected back toEMB_DIM
to predict main deck noise (main_noise_pred
).# Within DiffusionModel.__init__ main_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...) self.main_transformer_encoder = nn.TransformerEncoder(main_encoder_layer, num_layers=num_layers) self.main_output_proj = nn.Linear(model_dim, EMB_DIM) # Within DiffusionModel.forward main_encoded = self.main_transformer_encoder(h_main_proj) main_noise_pred = self.main_output_proj(main_encoded)
Sideboard Context Path (Encoder): Processes the original main deck embeddings
x0
(noise-free) through a separate, shallownn.TransformerEncoder
(num_layers=1
) to create context (sb_context_encoded
). This context is used by the sideboard decoder.# Within DiffusionModel.__init__ self.sb_context_input_proj = nn.Linear(EMB_DIM, model_dim) sb_context_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...) self.sideboard_context_encoder = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=1) # Within DiffusionModel.forward h_sb_context = x0 # Original main deck embeddings h_sb_context_proj = self.sb_context_input_proj(h_sb_context) sb_context_encoded = self.sideboard_context_encoder(h_sb_context_proj)
Sideboard Path (Decoder): Processes projected sideboard embeddings
h_sb_proj
using ann.TransformerDecoder
(num_layers=1
), conditioned onsb_context_encoded
via cross-attention (memory
). The decoder output is passed through anothernn.TransformerEncoder
(sb_layers=8
) before the final projection back toEMB_DIM
to predict sideboard noise (sb_noise_pred
). The use of cross-attention to condition the sideboard generation on the main deck context is conceptually similar to how text-to-image models use cross-attention to condition image generation on a text prompt embedding.# Within DiffusionModel.__init__ sb_decoder_layer = nn.TransformerDecoderLayer(d_model=model_dim, nhead=nhead, ...) self.sb_transformer_decoder = nn.TransformerDecoder(sb_decoder_layer, num_layers=1) # Uses the same sb_context_encoder_layer definition for the subsequent encoder self.sb_transformer_output = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=sb_num_layers) self.sb_output_proj = nn.Linear(model_dim, EMB_DIM) # Within DiffusionModel.forward sb_decoded = self.sb_transformer_decoder(tgt=h_sb_proj, memory=sb_context_encoded) sb_decoded = self.sb_transformer_output(sb_decoded) sb_noise_pred = self.sb_output_proj(sb_decoded)
Conditioning: Masking Strategy
Conditional generation (deck completion) is handled via masking. Known card embeddings are provided by the user (or determined during training). This mechanism is analogous to providing a starting image and a mask for inpainting in image generation models, or providing a text prompt to guide generation.
Training: For each training sample (
x0_embeddings
,x0_indices
, etc.), multiple masks (masks_per_deck
) are generated dynamically per deck.- The number of known main deck cards
k_main
is sampled from partitioned ranges [1, 59] across the generated masks to ensure diversek
values are seen. - Sideboard
k_sb
is sampled randomly from [1, 14], with a 50% chance of being forced to 0. This is done so that the model performs well at generating sideboards from scratch, which I expect to be a common use case.
- The number of known main deck cards
Masking Logic (
diffusion_model.py:DiffusionTrainer._create_mask_row
): This function generates a single mask row (shape[deck_size, 1]
) for a targetk
. It identifies unique available card indices in the current deck (current_deck_indices
). It samples these unique cards without replacement using weights derived from pre-calculated popularity scores (self.card_popularity
), slightly favoring less popular cards (0.5 + score
, where score is1.0 - normalized_count
). It iterates through these weighted, shuffled unique cards. For each unique card, with 85% probability, it attempts to mask all available copies (up tok
remaining); with 15% probability, it masks a random number of available copies (from 1 up to available, limited byk
remaining). This process repeats untilk
positions are marked as known (1.0 in the mask). We typically want to mask all instances of a card to make the task harder, so that the model has to learn what cards go together and not just to add additional copies.# diffusion_model.py: DiffusionTrainer._create_mask_row (Simplified Pseudocode) def _create_mask_row(k_target, deck_size, current_deck_indices, popularity_scores): mask_row = torch.zeros(deck_size, 1) available_indices = torch.ones(deck_size, dtype=torch.bool) masked_count = 0 while masked_count < k_target and available_indices.any(): # 1. Get unique card indices from currently *available* positions unique_cards = torch.unique(current_deck_indices[available_indices]) if not unique_cards: break # 2. Calculate sampling weights (favor less popular) weights = torch.tensor([0.5 + popularity_scores.get(idx.item(), 1.0) for idx in unique_cards]) # 3. Sample unique cards without replacement based on weights perm_indices = torch.multinomial(weights, num_samples=len(unique_cards), replacement=False) shuffled_unique_cards = unique_cards[perm_indices] for card_idx in shuffled_unique_cards: if masked_count >= k_target: break # 4. Find available positions for this specific card_idx potential_pos = (current_deck_indices == card_idx).nonzero(as_tuple=True)[0] available_pos = potential_pos[available_indices[potential_pos]] # Filter by available available_count = len(available_pos) if available_count == 0: continue needed = k_target - masked_count # 5. Decide how many copies to mask (85% all available, 15% random count) if random.random() < 0.85: num_to_mask = min(available_count, needed) else: max_can_mask = min(available_count, needed) if max_can_mask <= 0: continue num_to_mask = random.randint(1, max(1, max_can_mask)) # 6. Select specific positions to mask and update mask_row/available_indices indices_to_mask = available_pos[torch.randperm(available_count)[:num_to_mask]] mask_row[indices_to_mask] = 1.0 available_indices[indices_to_mask] = False masked_count += num_to_mask return mask_row
Loss Calculation: The MSE loss is computed only between the predicted noise (
main_noise_pred
,sb_noise_pred
) and the true noise (noise
,sb_noise
fromq_sample
) for the unknown (mask value 0.0) card slots. This focuses the model on learning to generate the missing parts.# diffusion_model.py: DiffusionTrainer.p_losses main_loss = ((noise - main_noise_pred) * (1 - mask.expand_as(noise))).pow(2).mean() sb_loss = ((sb_noise - sb_noise_pred) * (1 - sb_mask.expand_as(sb_noise))).pow(2).mean() total_loss = main_loss + sb_loss
Inference: During the reverse diffusion process (sampling
x_{t-1}
fromx_t
), the known card embeddingsx0_known
(provided by the user) are reapplied at each step to guide the generation towards the desired completion. A common approach (simplified):- Predict noise:
epsilon_pred = model(x_t, x0_context, sb_x_t, t, mask, sb_mask)
- Calculate the parameters (mean, variance) of the distribution
p(x_{t-1} | x_t)
usingx_t
,t
, andepsilon_pred
according to the diffusion schedule. - Sample the potential next state
x_{t-1}_sample
from this distribution (adding noise ift > 0
, otherwise using the mean). - Re-apply knowns to the sample:
x_{t-1}_conditioned = mask * x0_known + (1 - mask) * x_{t-1}_sample
. - Use
x_{t-1}_conditioned
as the inputx_t
for the next step (t-2). The main deck contextsb_context_encoded
for the sideboard decoder is generated from the final denoised main deck embeddings (x0_main_final
).
- Predict noise:
Training (diffusion_model.py:DiffusionTrainer
)
The current model was trained using ~47,000 decks scraped from MTGTop8 and is format agnostic, with the training data covering Standard, Modern, Pioneer, Pauper, Legacy, and Vintage. The full model contains ~56 million parameters. Due to its small size, training was feasible on consumer hardware, specifically a single Nvidia 3050 Laptop GPU with 4GB of VRAM, taking roughly 4 days to complete 100 epochs.
- Dataset:
DeckDataset
loads decks and filters for exact 60 main deck / 15 sideboard card counts. It converts card names to the pre-trained Doc2Vec embeddings (card_embeddings.pkl
) and retrieves corresponding integer indices using the mapping from the trained classifier (card_classifier.pt
). Decks with cards missing from embeddings or the classifier map are skipped. It also calculates card popularity scores based on deck frequency for the masking strategy. - Optimizer: AdamW (
torch.optim.AdamW
) with learning rate (lr=5e-6
default) and weight decay (diff_weight_decay=1e-3
default). - Objective: Minimize the combined MSE loss
total_loss
described above, calculated overmasks_per_deck
different masks for each deck in the batch. - Process: Standard PyTorch training loop: iterates epochs, loads batches via DataLoader, calculates loss using
p_losses
, performs backpropagation (total_loss.backward()
), clips gradients (nn.utils.clip_grad_norm_
), updates optimizer (opt.step()
). Checkpoints containing model state dict, epoch, and config are saved periodically (torch.save
).
Inference: Enforcing Deck Rules with Iterative Refinement
While the diffusion model learns the underlying patterns of deck construction from the training data, it doesn’t inherently guarantee adherence to strict game rules like the 4-copy limit for non-basic cards or format legality during the raw generation process. To address this, the inference functions employ an iterative refinement strategy after the initial denoising pass:
- Initial Generation: The standard reverse diffusion process is performed once to generate initial embeddings for all unknown card slots, conditioned on any user-provided cards.
- Classification & Rule Check: The resulting embeddings (both originally known and newly generated) are converted back to card names using the trained linear classifier (
CardClassifier
). The system then checks for violations:- 4-Copy Limit: It counts occurrences of each non-basic card name. For sideboard generation, this count considers cards in both the main deck and the current sideboard iteration.
- Format Legality: Using preloaded card data (derived from MTGJSON’s
AtomicCards.json
), it verifies if each generated card is legal in the specified format (e.g., ‘Modern’, ‘Standard’). Basic lands are exempt from this check.
- Identify Violations: The system identifies the specific generated card slots that violate either the 4-copy limit or format legality. User-provided cards are never marked for regeneration.
- Mask Update & Regeneration: A new mask is created. User-provided cards and valid generated cards from the current iteration are marked as “known”. Slots corresponding to rule violations are marked as “unknown”.
- Re-run Diffusion: The reverse diffusion sampling process is run again, using the updated mask and the embeddings of the known cards (including the valid generated ones) as fixed context. The model only needs to generate new embeddings for the slots marked as unknown due to rule violations.
- Repeat: Steps 2-5 are repeated up to a fixed number of maximum refinement iterations (
MAX_REFINEMENT_ITERATIONS
). This loop continues until no rule violations are found among the generated cards or the iteration limit is reached.
This refinement loop significantly improves the likelihood of producing valid and legal deck completions by correcting rule violations after the initial generation, leveraging the classifier and external card data to guide the process without needing to bake these complex constraints directly into the diffusion model’s training objective. The final deck combines the original user input with the cards generated and refined through this process.
Conclusion
Manamorphosis applies diffusion models to Magic: The Gathering deck generation and was a fun, but time-consuming project. Developing this system, including embedding tuning, model architecture experiments, and refinement loops, was a substantial effort (two weeks of me procrastinating before finals).
While there’s always more to explore, like testing different embeddings or format specializations, the current model provides a solid foundation for AI-driven deck completion.
The complete code is available on GitHub if you’d like to run it yourself or see the nitty-gritty details: Manamorphosis GitHub repository.
Twitter: @JakeABoggs