This post details Manamorphosis, a first-of-its-kind diffusion model I developed to complete Magic: The Gathering decklists. It takes a partial list and fills in the rest to form a 60-card main deck. Subsequently, using the completed main deck as context, it can complete a 15-card sideboard. The core generative mechanism is based on Denoising Diffusion Probabilistic Models (DDPMs), the same family of models powering image generation systems like Stable Diffusion and Midjourney, but adapted here to produce sets of cards. As far as I know, this is the state-of-the-art (and only) AI model trained specifically for this task.
The complete code is available on GitHub if you’d like to run it yourself or see the full implementation.
Demo Video
Card Representation: Doc2Vec Embeddings
Card identity is represented by 128-dimensional vectors. These embeddings are generated by training a Doc2Vec model on preprocessed text data for each card obtained from MTGJSON’s AtomicCards.json. This captures semantic relationships based on card text. Much like how modern RAG engines use embeddings to understand the meaning behind search queries and documents, these Doc2Vec embeddings allow the system to grasp the functional similarities between cards based on their descriptions (cost, type, rules text).
Instead of training the model to directly predict discrete cards from a vocabulary of ~28,000+, Manamorphosis uses pre-trained Doc2Vec embeddings as an intermediate representation. This approach offers several advantages:
- Handling data sparsity: the training dataset (~47,000 decks) contains only a fraction (~5,000) of all legal MTG cards. A model predicting cards directly would struggle to learn meaningful representations or generation logic for the vast majority of cards rarely or never seen during training. The embedding model, trained on all card text, provides a representation for every card.
- Generalization to new and unseen cards: because the embedding is derived from card text, the system can generate an embedding for any card, including newly released ones, without retraining the embedding model (though retraining the diffusion model might improve performance with new metagames). The diffusion model learns to operate on the semantic meaning captured in the 128-dimensional embedding space, rather than being limited to the fixed vocabulary seen during its own training.
- Capturing semantic relationships: Doc2Vec learns vectors where cards with similar functions, costs, types, or textual patterns (e.g., different variations of counterspells, cheap red burn spells, evasive creatures) are closer together in the embedding space. This allows the diffusion model to learn higher-level concepts (“needs more removal,” “add card draw”) rather than just memorizing specific card co-occurrences, leading to more contextually relevant deck completions. This focus on semantic similarity is analogous to how embedding-based search engines return results that are conceptually related rather than exact keyword matches.
- Dimensionality reduction and decoupling: working with dense 128-dimensional vectors is more computationally manageable for the transformer architecture than using extremely high-dimensional one-hot vectors (one per card). It also decouples the task of understanding card text semantics (Doc2Vec) from the task of generative deck construction (Diffusion Model).
Before training the Doc2Vec model, a standardized “document” is created for each card by applying the following transformations to the MTGJSON data:
- Mana cost: replaced curly braces
{}with pipe symbols|(|W|,|U|, etc.) and ensured spacing around symbols ({W}{U}->|W| |U|). This treats each mana symbol as a distinct token and distinguishes them from mana symbols in the card text. - Power/toughness: represented as
$Power$ #Toughness#(e.g.,$2$ #2#). This creates unique tokens for P/T values. - Card type: the supertype (e.g., “Creature”, “Instant”) is split into individual tokens surrounded by pipes (
|Creature|,|Instant|). Subtypes (e.g., “Goblin”, “Wizard”) are kept as single words. - Rules text:
- Card name references replaced with
@. This prevents the model from overfitting to specific card names and focuses on the actions/effects. - Common self-references (“this creature”, “this enchantment”, etc.) also replaced with
@. - Line breaks, semicolons replaced with spaces. Colons have spaces added (
:->:). - Reminder text (within parentheses) is removed using regex (
re.sub(reminder_remover, '', ...)). - Special characters like
&,−,—,',,,.,',"are handled (replaced or removed). - Text is converted to lowercase.
- Card name references replaced with
- Stop words: common English stop words (like “the”, “a”, “is”) are removed using
nltk.corpus.stopwordsto reduce noise and focus on meaningful terms.
# From train_embedding_model.py (Illustrative snippet)
def get_text(card):
text = ''
if 'manaCost' in card:
text += card['manaCost'].replace('}{', '} {').replace('{', '|').replace('}', '|') + ' '
if 'power' in card:
text += '$' + card['power'] + '$ #' + card['toughness'] + '# '
text += ' '.join(['|' + word + '|' for word in card['type'].split(' — ')[0].split()]) + ' '
if '—' in card['type']:
text += card['type'].split(' — ')[1] + ' '
if 'text' in card:
# ... (Handling basic lands) ...
processed_text = card['text'].replace('&', 'and').replace(card['name'], '@').replace(card['name'].split(',')[0], '@') # simplified
processed_text = processed_text.replace('this creature', '@').replace('this enchantment', '@').replace('this artifact', '@').replace('this land', '@')
processed_text = processed_text.replace('\\n', ' ').replace(';', ' ').replace(':', ' :').replace('|', '•')
text += processed_text
text = re.sub(reminder_remover, '', text.lower()... ) # Lowercasing, punctuation, etc.
words = [word for word in text.split(' ') if word != '']
filtered_words = [word for word in words if word not in stop_words]
return ' '.join(filtered_words)
After building the dataset, Gensim makes training easy:
# Doc2Vec Training
model = Doc2Vec(vector_size=128, dm=0, dbow_words=0, min_count=2, epochs=200, workers=cores, seed=42)
model.build_vocab(corpus) # corpus yields TaggedDocument(processed_text.split(), [card_id])
model.train(corpus, ...)
A simple linear classifier is also trained to map these embeddings back to unique card indices. This classifier is used during inference to identify the cards corresponding to each denoised embedding vector. The alternative is to perform a cosine similarity search over entire set of possible cards, but this is inefficient.
# From train_embedding_classifier.py
class CardClassifier(nn.Module):
def __init__(self, embedding_dim, num_classes):
super(CardClassifier, self).__init__()
self.network = nn.Linear(embedding_dim, num_classes)
def forward(self, x):
return self.network(x)
Diffusion
Forward process (noise addition): starting with the true deck embeddings
x0, Gaussian noise is progressively added overTtimesteps (here,T=1000). The noise level at each step is determined by a predefined variance schedule, specifically a cosine schedule (cosine_beta_schedule).# diffusion_model.py def cosine_beta_schedule(T, s=0.008): steps = torch.linspace(0, T, T + 1, dtype=torch.float64) alpha_bar = torch.cos(((steps / T) + s) / (1 + s) * torch.pi * 0.5) ** 2 alpha_bar = alpha_bar / alpha_bar[0] betas = 1 - (alpha_bar[1:] / alpha_bar[:-1]) return torch.clip(betas, 0, 0.999).float() # Within DiffusionTrainer class: beta = cosine_beta_schedule(T).to(device) # T = TIMESTEPS (e.g., 1000) alpha = 1.0 - beta alpha_bar = torch.cumprod(alpha, dim=0) # Precompute terms used in diffusion and sampling: self.register("sqrt_alpha_bar", torch.sqrt(alpha_bar)) self.register("sqrt_one_minus_alpha_bar", torch.sqrt(1.0 - alpha_bar)) self.register("beta", beta) self.register("alpha", alpha) # ... (other registered buffers)The state
x_tat timesteptcan be sampled directly using the cumulative productalpha_bar:x_t = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * noise# diffusion_model.py: DiffusionTrainer.q_sample # Note: The actual implementation modifies x_t for known cards # based on the mask *after* the initial noise addition, # returning a mix of original x0 and noised x_t. def q_sample(self, x0, t, mask): mask_expanded = mask.expand_as(x0) sqrt_ab = self._extract(self.sqrt_alpha_bar, t, x0.shape, x0.device) sqrt_1mab = self._extract(self.sqrt_one_minus_alpha_bar, t, x0.shape, x0.device) noise = torch.randn_like(x0) # Calculate the fully noised version first x_t_noised = sqrt_ab * x0 + sqrt_1mab * noise # Return original embeddings for known positions, noised for unknown x_t_masked = mask_expanded * x0 + (1 - mask_expanded) * x_t_noised # Returns the masked noisy sample and the *original* noise (for loss) return x_t_masked, noiseReverse process (denoising): the model learns to predict the noise
epsilonadded at timestept. Starting from pure noisex_T, the model iteratively refines the embeddings by predicting the noiseepsilon_pred = model(x_t, x0, sb_x_t, t, mask, sb_mask)and estimatingx_{t-1}untilx0, the original noise-free main deck, is reached. During inference, the actual denoising step involves samplingx_{t-1}by subtracting the predicted noiseepsilon_predfromx_t, then adding a smaller amount of noise for the next timestep.
Model Architecture
The model uses a transformer-based architecture, the same building block behind Large Language Models like the GPT series and BERT. Manamorphosis lacks positional embeddings, treating decks as unordered sets, which differs from typical NLP or vision transformer usage where sequence order matters. It has distinct paths for main deck and sideboard processing. The internal model dimension is model_dim=384, and the embedding dimension is EMB_DIM=128.
Time embeddings: timestep
tis encoded using standard sinusoidal embeddings, processed by separate MLPs for main deck and sideboard paths.# diffusion_model.py def sinusoidal_embedding(t: torch.Tensor, dim: int = EMB_DIM): half = dim // 2 freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device) / (half - 1)) args = t[:, None] * freqs[None] emb = torch.cat((args.sin(), args.cos()), dim=-1) if dim % 2: emb = nn.functional.pad(emb, (0, 1)) return emb # Within DiffusionModel.__init__ ff_dim = cfg["dim_feedforward"] # 3072 self.main_time_mlp = nn.Sequential( nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), ) self.sb_time_mlp = nn.Sequential( # For Sideboard Decoder path nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), )Mask embeddings: binary masks (1.0 for known, 0.0 for unknown) are processed by separate MLPs.
# Within DiffusionModel.__init__ self.main_mask_mlp = nn.Sequential( nn.Linear(1, EMB_DIM), nn.SiLU(), nn.Linear(EMB_DIM, EMB_DIM), ) self.sb_mask_mlp = nn.Sequential( # For Sideboard Decoder path nn.Linear(1, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), # Note: intermediate dim is ff_dim here )Input processing: input embeddings
x_t(main) orsb_x_t(sideboard) are combined via addition with their respective time and mask embeddings, then projected tomodel_dim.# Within DiffusionModel.forward sin_emb = sinusoidal_embedding(t, EMB_DIM) # Shape: [Batch, EMB_DIM] # Main Deck Input main_t_emb_flat = self.main_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM] main_t_emb = main_t_emb_flat[:, None, :].expand(-1, DECK_SIZE, -1) # Shape: [Batch, DECK_SIZE, EMB_DIM] main_mask_emb = self.main_mask_mlp(mask) # mask shape: [Batch, DECK_SIZE, 1] -> Output: [Batch, DECK_SIZE, EMB_DIM] h_main = x_t + main_t_emb + main_mask_emb # x_t shape: [Batch, DECK_SIZE, EMB_DIM] h_main_proj = self.main_input_proj(h_main) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, DECK_SIZE, model_dim] # Sideboard Input sb_decoder_t_emb_flat = self.sb_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM] sb_decoder_t_emb = sb_decoder_t_emb_flat[:, None, :].expand(-1, SIDEBOARD_SIZE, -1) # Shape: [Batch, SIDEBOARD_SIZE, EMB_DIM] sb_decoder_mask_emb = self.sb_mask_mlp(sb_mask) # sb_mask shape: [Batch, SIDEBOARD_SIZE, 1] -> Output: [Batch, SIDEBOARD_SIZE, EMB_DIM] h_sb = sb_x_t + sb_decoder_t_emb + sb_decoder_mask_emb # sb_x_t shape: [Batch, SIDEBOARD_SIZE, EMB_DIM] h_sb_proj = self.sb_input_proj(h_sb) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, SIDEBOARD_SIZE, model_dim]Main deck path (encoder): processes
h_main_projthrough a standardnn.TransformerEncoder(layers=8,nhead=8). The output is projected back toEMB_DIMto predict main deck noise (main_noise_pred).# Within DiffusionModel.__init__ main_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...) self.main_transformer_encoder = nn.TransformerEncoder(main_encoder_layer, num_layers=num_layers) self.main_output_proj = nn.Linear(model_dim, EMB_DIM) # Within DiffusionModel.forward main_encoded = self.main_transformer_encoder(h_main_proj) main_noise_pred = self.main_output_proj(main_encoded)Sideboard context path (encoder): processes the original main deck embeddings
x0(noise-free) through a separate, shallow transformer encoder to create context (sb_context_encoded). This context is used by the sideboard decoder.# Within DiffusionModel.__init__ self.sb_context_input_proj = nn.Linear(EMB_DIM, model_dim) sb_context_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...) self.sideboard_context_encoder = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=1) # Within DiffusionModel.forward h_sb_context = x0 # Original main deck embeddings h_sb_context_proj = self.sb_context_input_proj(h_sb_context) sb_context_encoded = self.sideboard_context_encoder(h_sb_context_proj)Sideboard path (decoder): processes projected sideboard embeddings
h_sb_projusing a transformer decoder, conditioned onsb_context_encodedvia cross-attention. The decoder output is passed through another encoder before the final projection back toEMB_DIMto predict sideboard noise (sb_noise_pred). The use of cross-attention to condition the sideboard generation on the main deck context is similar to how text-to-image models condition image generation on a text prompt.# Within DiffusionModel.__init__ sb_decoder_layer = nn.TransformerDecoderLayer(d_model=model_dim, nhead=nhead, ...) self.sb_transformer_decoder = nn.TransformerDecoder(sb_decoder_layer, num_layers=1) # Uses the same sb_context_encoder_layer definition for the subsequent encoder self.sb_transformer_output = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=sb_num_layers) self.sb_output_proj = nn.Linear(model_dim, EMB_DIM) # Within DiffusionModel.forward sb_decoded = self.sb_transformer_decoder(tgt=h_sb_proj, memory=sb_context_encoded) sb_decoded = self.sb_transformer_output(sb_decoded) sb_noise_pred = self.sb_output_proj(sb_decoded)
Conditioning: Masking Strategy
Conditional generation (deck completion) is handled via masking. Known card embeddings are provided by the user (or determined during training). This mechanism is analogous to providing a starting image and a mask for inpainting in image generation models, or providing a text prompt to guide generation.
Training: for each training sample (
x0_embeddings,x0_indices, etc.), multiple masks (masks_per_deck) are generated dynamically per deck.- The number of known main deck cards
k_mainis sampled from partitioned ranges [1, 59] across the generated masks to ensure diversekvalues are seen. - Sideboard
k_sbis sampled randomly from [1, 14], with a 50% chance of being forced to 0. This is done so that the model performs well at generating sideboards from scratch, which I expect to be a common use case.
- The number of known main deck cards
Masking logic: this function generates a single mask row (shape
[deck_size, 1]) for a targetk. It identifies unique available card indices in the current deck (current_deck_indices). It samples these unique cards without replacement using weights derived from pre-calculated popularity scores (self.card_popularity), slightly favoring less popular cards (0.5 + score, where score is1.0 - normalized_count). It iterates through these weighted, shuffled unique cards. For each unique card, with 85% probability, it attempts to mask all available copies (up tokremaining); with 15% probability, it masks a random number of available copies (from 1 up to available, limited bykremaining). This process repeats untilkpositions are marked as known (1.0 in the mask). We typically want to mask all instances of a card to make the task harder, so that the model has to learn what cards go together and not just add additional copies.# diffusion_model.py: DiffusionTrainer._create_mask_row (Simplified Pseudocode) def _create_mask_row(k_target, deck_size, current_deck_indices, popularity_scores): mask_row = torch.zeros(deck_size, 1) available_indices = torch.ones(deck_size, dtype=torch.bool) masked_count = 0 while masked_count < k_target and available_indices.any(): # 1. Get unique card indices from currently *available* positions unique_cards = torch.unique(current_deck_indices[available_indices]) if not unique_cards: break # 2. Calculate sampling weights (favor less popular) weights = torch.tensor([0.5 + popularity_scores.get(idx.item(), 1.0) for idx in unique_cards]) # 3. Sample unique cards without replacement based on weights perm_indices = torch.multinomial(weights, num_samples=len(unique_cards), replacement=False) shuffled_unique_cards = unique_cards[perm_indices] for card_idx in shuffled_unique_cards: if masked_count >= k_target: break # 4. Find available positions for this specific card_idx potential_pos = (current_deck_indices == card_idx).nonzero(as_tuple=True)[0] available_pos = potential_pos[available_indices[potential_pos]] # Filter by available available_count = len(available_pos) if available_count == 0: continue needed = k_target - masked_count # 5. Decide how many copies to mask (85% all available, 15% random count) if random.random() < 0.85: num_to_mask = min(available_count, needed) else: max_can_mask = min(available_count, needed) if max_can_mask <= 0: continue num_to_mask = random.randint(1, max(1, max_can_mask)) # 6. Select specific positions to mask and update mask_row/available_indices indices_to_mask = available_pos[torch.randperm(available_count)[:num_to_mask]] mask_row[indices_to_mask] = 1.0 available_indices[indices_to_mask] = False masked_count += num_to_mask return mask_rowLoss calculation: the MSE loss is computed only between the predicted noise (
main_noise_pred,sb_noise_pred) and the true noise (noise,sb_noisefromq_sample) for the unknown (mask value 0.0) card slots. This focuses the model on learning to generate the missing parts.# diffusion_model.py: DiffusionTrainer.p_losses main_loss = ((noise - main_noise_pred) * (1 - mask.expand_as(noise))).pow(2).mean() sb_loss = ((sb_noise - sb_noise_pred) * (1 - sb_mask.expand_as(sb_noise))).pow(2).mean() total_loss = main_loss + sb_lossInference: during the reverse diffusion process (sampling
x_{t-1}fromx_t), the known card embeddingsx0_known(provided by the user) are reapplied at each step to guide the generation towards the desired completion. A common approach (simplified):- Predict noise:
epsilon_pred = model(x_t, x0_context, sb_x_t, t, mask, sb_mask) - Calculate the parameters (mean, variance) of the distribution
p(x_{t-1} | x_t)usingx_t,t, andepsilon_predaccording to the diffusion schedule. - Sample the potential next state
x_{t-1}_samplefrom this distribution (adding noise ift > 0, otherwise using the mean). - Re-apply knowns to the sample:
x_{t-1}_conditioned = mask * x0_known + (1 - mask) * x_{t-1}_sample. - Use
x_{t-1}_conditionedas the inputx_tfor the next step (t-2). The main deck contextsb_context_encodedfor the sideboard decoder is generated from the final denoised main deck embeddings (x0_main_final).
- Predict noise:
Training
The current model was trained using ~47,000 decks scraped from MTGTop8 and is format agnostic, with training data covering formats from Standard through Vintage. The full model contains ~56 million parameters. Due to its small size, training was feasible on consumer hardware, specifically a single Nvidia 3050 Laptop GPU with 4GB of VRAM, taking roughly 4 days to complete 100 epochs.
- Dataset:
DeckDatasetloads decks and filters for exact 60 main deck / 15 sideboard card counts. It converts card names to the pre-trained Doc2Vec embeddings and retrieves corresponding integer indices using the mapping from the linear classifier. Decks with cards missing from embeddings or the classifier map are skipped. It also calculates card popularity scores based on deck frequency for the masking strategy. - Optimizer: AdamW with weight decay.
- Objective: minimize the combined MSE loss
total_lossdescribed above, calculated overmasks_per_deckdifferent masks for each deck in the batch. - Process: standard PyTorch training loop that iterates through epochs, loads batches via
DataLoader, calculates loss usingp_losses, performs backpropagation, clips gradients, and updates the optimizer. Checkpoints save the model state together with the epoch number and config.
Inference: Enforcing Deck Rules with Iterative Refinement
While the diffusion model learns the underlying patterns of deck construction from the training data, it doesn’t inherently guarantee adherence to strict game rules like the 4-copy limit for non-basic cards or format legality during the raw generation process. To address this, the inference functions employ an iterative refinement strategy after the initial denoising pass:
- Initial generation: the standard reverse diffusion process is performed once to generate initial embeddings for all unknown card slots, conditioned on any user-provided cards.
- Classification and rule check: the resulting embeddings (both originally known and newly generated) are converted back to card names using the trained linear classifier. The system then checks for violations:
- 4-copy limit: it counts occurrences of each non-basic card name. For sideboard generation, this count considers cards in both the main deck and the current sideboard iteration.
- Format legality: each generated card is checked for legality in the specified format (e.g., ‘Modern’, ‘Standard’). Basic lands are exempt from this check.
- Identify violations: the system identifies the specific generated card slots that violate either the 4-copy limit or format legality. User-provided cards are never marked for regeneration.
- Mask update and regeneration: a new mask is created. User-provided cards and valid generated cards from the current iteration are marked as “known”. Slots corresponding to rule violations are marked as “unknown”.
- Re-run diffusion: the reverse diffusion sampling process is run again, using the updated mask and the embeddings of the known cards (including the valid generated ones) as fixed context. The model only needs to generate new embeddings for the slots marked as unknown due to rule violations.
- Repeat: steps 2-5 are repeated up to a fixed number of maximum refinement iterations. This loop continues until no rule violations are found among the generated cards or the iteration limit is reached.
This refinement loop makes legal deck completions more likely by correcting rule violations after the initial generation, leveraging the classifier and external card data to guide the process without needing to bake these complex constraints directly into the diffusion model’s training objective. The final deck combines the original user input with the cards generated through this process.
Final Thoughts
This was a fun but time-consuming project (two weeks of me procrastinating before finals). Some day I’ll probably train v2, but until then you can keep up with what I’m up to on Twitter: @JakeABoggs