Introducing Manamorphosis: A Diffusion Model for MTG Deck Generation

This post details Manamorphosis, a first-of-its-kind diffusion model I developed to complete Magic: The Gathering decklists. It takes a partial list and fills in the rest to form a 60-card main deck. Subsequently, using the completed main deck as context, it can complete a 15-card sideboard. The core generative mechanism is based on Denoising Diffusion Probabilistic Models (DDPMs), the same family of models powering image generation systems like Stable Diffusion and Midjourney, but adapted here to produce sets of cards. As far as I know, this is the state-of-the-art (and only) AI model trained specifically for this task.

The complete code is available on GitHub if you’d like to run it yourself or see the full implementation.

Demo Video

Card Representation: Doc2Vec Embeddings

Card identity is represented by 128-dimensional vectors. These embeddings are generated by training a Doc2Vec model on preprocessed text data for each card obtained from MTGJSON’s AtomicCards.json. This captures semantic relationships based on card text. Much like how modern RAG engines use embeddings to understand the meaning behind search queries and documents, these Doc2Vec embeddings allow the system to grasp the functional similarities between cards based on their descriptions (cost, type, rules text).

Instead of training the model to directly predict discrete cards from a vocabulary of ~28,000+, Manamorphosis uses pre-trained Doc2Vec embeddings as an intermediate representation. This approach offers several advantages:

Handling data sparsity: the training dataset (~47,000 decks) contains only a fraction (~5,000) of all legal MTG cards. A model predicting cards directly would struggle to learn meaningful representations or generation logic for the vast majority of cards rarely or never seen during training. The embedding model, trained on all card text, provides a representation for every card.
Generalization to new and unseen cards: because the embedding is derived from card text, the system can generate an embedding for any card, including newly released ones, without retraining the embedding model (though retraining the diffusion model might improve performance with new metagames). The diffusion model learns to operate on the semantic meaning captured in the 128-dimensional embedding space, rather than being limited to the fixed vocabulary seen during its own training.
Capturing semantic relationships: Doc2Vec learns vectors where cards with similar functions, costs, types, or textual patterns (e.g., different variations of counterspells, cheap red burn spells, evasive creatures) are closer together in the embedding space. This allows the diffusion model to learn higher-level concepts (“needs more removal,” “add card draw”) rather than just memorizing specific card co-occurrences, leading to more contextually relevant deck completions. This focus on semantic similarity is analogous to how embedding-based search engines return results that are conceptually related rather than exact keyword matches.
Dimensionality reduction and decoupling: working with dense 128-dimensional vectors is more computationally manageable for the transformer architecture than using extremely high-dimensional one-hot vectors (one per card). It also decouples the task of understanding card text semantics (Doc2Vec) from the task of generative deck construction (Diffusion Model).

Before training the Doc2Vec model, a standardized “document” is created for each card by applying the following transformations to the MTGJSON data:

Mana cost: replaced curly braces {} with pipe symbols | (|W|, |U|, etc.) and ensured spacing around symbols ({W}{U} -> |W| |U|). This treats each mana symbol as a distinct token and distinguishes them from mana symbols in the card text.
Power/toughness: represented as $Power$ #Toughness# (e.g., $2$ #2#). This creates unique tokens for P/T values.
Card type: the supertype (e.g., “Creature”, “Instant”) is split into individual tokens surrounded by pipes (|Creature|, |Instant|). Subtypes (e.g., “Goblin”, “Wizard”) are kept as single words.
Rules text:
- Card name references replaced with @. This prevents the model from overfitting to specific card names and focuses on the actions/effects.
- Common self-references (“this creature”, “this enchantment”, etc.) also replaced with @.
- Line breaks, semicolons replaced with spaces. Colons have spaces added (: -> :).
- Reminder text (within parentheses) is removed using regex (re.sub(reminder_remover, '', ...)).
- Special characters like &, −, —, ', ,, ., ', " are handled (replaced or removed).
- Text is converted to lowercase.
Stop words: common English stop words (like “the”, “a”, “is”) are removed using nltk.corpus.stopwords to reduce noise and focus on meaningful terms.

# From train_embedding_model.py (Illustrative snippet)
def get_text(card):
    text = ''

    if 'manaCost' in card:
        text += card['manaCost'].replace('}{', '} {').replace('{', '|').replace('}', '|') + ' '
    if 'power' in card:
        text += '$' + card['power'] + '$ #' + card['toughness'] + '# '
    text += ' '.join(['|' + word + '|' for word in card['type'].split(' — ')[0].split()]) + ' '
    if '—' in card['type']:
        text += card['type'].split(' — ')[1] + ' '
    if 'text' in card:
        # ... (Handling basic lands) ...
        processed_text = card['text'].replace('&', 'and').replace(card['name'], '@').replace(card['name'].split(',')[0], '@') # simplified
        processed_text = processed_text.replace('this creature', '@').replace('this enchantment', '@').replace('this artifact', '@').replace('this land', '@')
        processed_text = processed_text.replace('\\n', ' ').replace(';', ' ').replace(':', ' :').replace('|', '•')
        text += processed_text

    text = re.sub(reminder_remover, '', text.lower()... ) # Lowercasing, punctuation, etc.
    words = [word for word in text.split(' ') if word != '']
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

After building the dataset, Gensim makes training easy:

# Doc2Vec Training
model = Doc2Vec(vector_size=128, dm=0, dbow_words=0, min_count=2, epochs=200, workers=cores, seed=42)
model.build_vocab(corpus) # corpus yields TaggedDocument(processed_text.split(), [card_id])
model.train(corpus, ...)

A simple linear classifier is also trained to map these embeddings back to unique card indices. This classifier is used during inference to identify the cards corresponding to each denoised embedding vector. The alternative is to perform a cosine similarity search over entire set of possible cards, but this is inefficient.

# From train_embedding_classifier.py
class CardClassifier(nn.Module):
    def __init__(self, embedding_dim, num_classes):
        super(CardClassifier, self).__init__()
        self.network = nn.Linear(embedding_dim, num_classes)
    
    def forward(self, x):
        return self.network(x)

Diffusion

Forward process (noise addition): starting with the true deck embeddings x0, Gaussian noise is progressively added over T timesteps (here, T=1000). The noise level at each step is determined by a predefined variance schedule, specifically a cosine schedule (cosine_beta_schedule).

# diffusion_model.py
def cosine_beta_schedule(T, s=0.008):
    steps = torch.linspace(0, T, T + 1, dtype=torch.float64)
    alpha_bar = torch.cos(((steps / T) + s) / (1 + s) * torch.pi * 0.5) ** 2
    alpha_bar = alpha_bar / alpha_bar[0]
    betas = 1 - (alpha_bar[1:] / alpha_bar[:-1])
    return torch.clip(betas, 0, 0.999).float()

# Within DiffusionTrainer class:
beta = cosine_beta_schedule(T).to(device) # T = TIMESTEPS (e.g., 1000)
alpha = 1.0 - beta
alpha_bar = torch.cumprod(alpha, dim=0)
# Precompute terms used in diffusion and sampling:
self.register("sqrt_alpha_bar", torch.sqrt(alpha_bar))
self.register("sqrt_one_minus_alpha_bar", torch.sqrt(1.0 - alpha_bar))
self.register("beta", beta)
self.register("alpha", alpha)
# ... (other registered buffers)

The state x_t at timestep t can be sampled directly using the cumulative product alpha_bar: x_t = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * noise

# diffusion_model.py: DiffusionTrainer.q_sample
# Note: The actual implementation modifies x_t for known cards
# based on the mask *after* the initial noise addition,
# returning a mix of original x0 and noised x_t.
def q_sample(self, x0, t, mask):
    mask_expanded = mask.expand_as(x0)
    sqrt_ab = self._extract(self.sqrt_alpha_bar, t, x0.shape, x0.device)
    sqrt_1mab = self._extract(self.sqrt_one_minus_alpha_bar, t, x0.shape, x0.device)
    noise = torch.randn_like(x0)
    # Calculate the fully noised version first
    x_t_noised = sqrt_ab * x0 + sqrt_1mab * noise
    # Return original embeddings for known positions, noised for unknown
    x_t_masked = mask_expanded * x0 + (1 - mask_expanded) * x_t_noised
    # Returns the masked noisy sample and the *original* noise (for loss)
    return x_t_masked, noise

Reverse process (denoising): the model learns to predict the noise epsilon added at timestep t. Starting from pure noise x_T, the model iteratively refines the embeddings by predicting the noise epsilon_pred = model(x_t, x0, sb_x_t, t, mask, sb_mask) and estimating x_{t-1} until x0, the original noise-free main deck, is reached. During inference, the actual denoising step involves sampling x_{t-1} by subtracting the predicted noise epsilon_pred from x_t, then adding a smaller amount of noise for the next timestep.

Model Architecture

The model uses a transformer-based architecture, the same building block behind Large Language Models like the GPT series and BERT. Manamorphosis lacks positional embeddings, treating decks as unordered sets, which differs from typical NLP or vision transformer usage where sequence order matters. It has distinct paths for main deck and sideboard processing. The internal model dimension is model_dim=384, and the embedding dimension is EMB_DIM=128.

Time embeddings: timestep t is encoded using standard sinusoidal embeddings, processed by separate MLPs for main deck and sideboard paths.

# diffusion_model.py
def sinusoidal_embedding(t: torch.Tensor, dim: int = EMB_DIM):
    half = dim // 2
    freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device) / (half - 1))
    args = t[:, None] * freqs[None]
    emb = torch.cat((args.sin(), args.cos()), dim=-1)
    if dim % 2:
        emb = nn.functional.pad(emb, (0, 1))
    return emb

# Within DiffusionModel.__init__
ff_dim = cfg["dim_feedforward"] # 3072
self.main_time_mlp = nn.Sequential(
    nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM),
)
self.sb_time_mlp = nn.Sequential( # For Sideboard Decoder path
    nn.Linear(EMB_DIM, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM),
)

Mask embeddings: binary masks (1.0 for known, 0.0 for unknown) are processed by separate MLPs.

# Within DiffusionModel.__init__
self.main_mask_mlp = nn.Sequential(
    nn.Linear(1, EMB_DIM), nn.SiLU(), nn.Linear(EMB_DIM, EMB_DIM),
)
self.sb_mask_mlp = nn.Sequential( # For Sideboard Decoder path
    nn.Linear(1, ff_dim), nn.SiLU(), nn.Linear(ff_dim, EMB_DIM), # Note: intermediate dim is ff_dim here
)

Input processing: input embeddings x_t (main) or sb_x_t (sideboard) are combined via addition with their respective time and mask embeddings, then projected to model_dim.

# Within DiffusionModel.forward
sin_emb = sinusoidal_embedding(t, EMB_DIM) # Shape: [Batch, EMB_DIM]

# Main Deck Input
main_t_emb_flat = self.main_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM]
main_t_emb = main_t_emb_flat[:, None, :].expand(-1, DECK_SIZE, -1) # Shape: [Batch, DECK_SIZE, EMB_DIM]
main_mask_emb = self.main_mask_mlp(mask) # mask shape: [Batch, DECK_SIZE, 1] -> Output: [Batch, DECK_SIZE, EMB_DIM]
h_main = x_t + main_t_emb + main_mask_emb # x_t shape: [Batch, DECK_SIZE, EMB_DIM]
h_main_proj = self.main_input_proj(h_main) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, DECK_SIZE, model_dim]

# Sideboard Input
sb_decoder_t_emb_flat = self.sb_time_mlp(sin_emb) # Shape: [Batch, EMB_DIM]
sb_decoder_t_emb = sb_decoder_t_emb_flat[:, None, :].expand(-1, SIDEBOARD_SIZE, -1) # Shape: [Batch, SIDEBOARD_SIZE, EMB_DIM]
sb_decoder_mask_emb = self.sb_mask_mlp(sb_mask) # sb_mask shape: [Batch, SIDEBOARD_SIZE, 1] -> Output: [Batch, SIDEBOARD_SIZE, EMB_DIM]
h_sb = sb_x_t + sb_decoder_t_emb + sb_decoder_mask_emb # sb_x_t shape: [Batch, SIDEBOARD_SIZE, EMB_DIM]
h_sb_proj = self.sb_input_proj(h_sb) # Linear(EMB_DIM, model_dim) -> Shape: [Batch, SIDEBOARD_SIZE, model_dim]

Main deck path (encoder): processes h_main_proj through a standard nn.TransformerEncoder (layers=8, nhead=8). The output is projected back to EMB_DIM to predict main deck noise (main_noise_pred).

# Within DiffusionModel.__init__
main_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...)
self.main_transformer_encoder = nn.TransformerEncoder(main_encoder_layer, num_layers=num_layers)
self.main_output_proj = nn.Linear(model_dim, EMB_DIM)

# Within DiffusionModel.forward
main_encoded = self.main_transformer_encoder(h_main_proj)
main_noise_pred = self.main_output_proj(main_encoded)

Sideboard context path (encoder): processes the original main deck embeddings x0 (noise-free) through a separate, shallow transformer encoder to create context (sb_context_encoded). This context is used by the sideboard decoder.

# Within DiffusionModel.__init__
self.sb_context_input_proj = nn.Linear(EMB_DIM, model_dim)
sb_context_encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=nhead, ...)
self.sideboard_context_encoder = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=1)

# Within DiffusionModel.forward
h_sb_context = x0 # Original main deck embeddings
h_sb_context_proj = self.sb_context_input_proj(h_sb_context)
sb_context_encoded = self.sideboard_context_encoder(h_sb_context_proj)

Sideboard path (decoder): processes projected sideboard embeddings h_sb_proj using a transformer decoder, conditioned on sb_context_encoded via cross-attention. The decoder output is passed through another encoder before the final projection back to EMB_DIM to predict sideboard noise (sb_noise_pred). The use of cross-attention to condition the sideboard generation on the main deck context is similar to how text-to-image models condition image generation on a text prompt.

# Within DiffusionModel.__init__
sb_decoder_layer = nn.TransformerDecoderLayer(d_model=model_dim, nhead=nhead, ...)
self.sb_transformer_decoder = nn.TransformerDecoder(sb_decoder_layer, num_layers=1)
# Uses the same sb_context_encoder_layer definition for the subsequent encoder
self.sb_transformer_output = nn.TransformerEncoder(sb_context_encoder_layer, num_layers=sb_num_layers)
self.sb_output_proj = nn.Linear(model_dim, EMB_DIM)

# Within DiffusionModel.forward
sb_decoded = self.sb_transformer_decoder(tgt=h_sb_proj, memory=sb_context_encoded)
sb_decoded = self.sb_transformer_output(sb_decoded)
sb_noise_pred = self.sb_output_proj(sb_decoded)

Conditioning: Masking Strategy

Conditional generation (deck completion) is handled via masking. Known card embeddings are provided by the user (or determined during training). This mechanism is analogous to providing a starting image and a mask for inpainting in image generation models, or providing a text prompt to guide generation.

Training: for each training sample (x0_embeddings, x0_indices, etc.), multiple masks (masks_per_deck) are generated dynamically per deck.
- The number of known main deck cards k_main is sampled from partitioned ranges [1, 59] across the generated masks to ensure diverse k values are seen.
- Sideboard k_sb is sampled randomly from [1, 14], with a 50% chance of being forced to 0. This is done so that the model performs well at generating sideboards from scratch, which I expect to be a common use case.

Masking logic: this function generates a single mask row (shape [deck_size, 1]) for a target k. It identifies unique available card indices in the current deck (current_deck_indices). It samples these unique cards without replacement using weights derived from pre-calculated popularity scores (self.card_popularity), slightly favoring less popular cards (0.5 + score, where score is 1.0 - normalized_count). It iterates through these weighted, shuffled unique cards. For each unique card, with 85% probability, it attempts to mask all available copies (up to k remaining); with 15% probability, it masks a random number of available copies (from 1 up to available, limited by k remaining). This process repeats until k positions are marked as known (1.0 in the mask). We typically want to mask all instances of a card to make the task harder, so that the model has to learn what cards go together and not just add additional copies.

# diffusion_model.py: DiffusionTrainer._create_mask_row (Simplified Pseudocode)
def _create_mask_row(k_target, deck_size, current_deck_indices, popularity_scores):
    mask_row = torch.zeros(deck_size, 1)
    available_indices = torch.ones(deck_size, dtype=torch.bool)
    masked_count = 0

    while masked_count < k_target and available_indices.any():
        # 1. Get unique card indices from currently *available* positions
        unique_cards = torch.unique(current_deck_indices[available_indices])
        if not unique_cards: break

        # 2. Calculate sampling weights (favor less popular)
        weights = torch.tensor([0.5 + popularity_scores.get(idx.item(), 1.0) for idx in unique_cards])

        # 3. Sample unique cards without replacement based on weights
        perm_indices = torch.multinomial(weights, num_samples=len(unique_cards), replacement=False)
        shuffled_unique_cards = unique_cards[perm_indices]

        for card_idx in shuffled_unique_cards:
            if masked_count >= k_target: break

            # 4. Find available positions for this specific card_idx
            potential_pos = (current_deck_indices == card_idx).nonzero(as_tuple=True)[0]
            available_pos = potential_pos[available_indices[potential_pos]] # Filter by available
            available_count = len(available_pos)
            if available_count == 0: continue

            needed = k_target - masked_count

            # 5. Decide how many copies to mask (85% all available, 15% random count)
            if random.random() < 0.85:
                num_to_mask = min(available_count, needed)
            else:
                max_can_mask = min(available_count, needed)
                if max_can_mask <= 0: continue
                num_to_mask = random.randint(1, max(1, max_can_mask))

            # 6. Select specific positions to mask and update mask_row/available_indices
            indices_to_mask = available_pos[torch.randperm(available_count)[:num_to_mask]]
            mask_row[indices_to_mask] = 1.0
            available_indices[indices_to_mask] = False
            masked_count += num_to_mask

    return mask_row

Loss calculation: the MSE loss is computed only between the predicted noise (main_noise_pred, sb_noise_pred) and the true noise (noise, sb_noise from q_sample) for the unknown (mask value 0.0) card slots. This focuses the model on learning to generate the missing parts.

# diffusion_model.py: DiffusionTrainer.p_losses
main_loss = ((noise - main_noise_pred) * (1 - mask.expand_as(noise))).pow(2).mean()
sb_loss = ((sb_noise - sb_noise_pred) * (1 - sb_mask.expand_as(sb_noise))).pow(2).mean()
total_loss = main_loss + sb_loss

Inference: during the reverse diffusion process (sampling x_{t-1} from x_t), the known card embeddings x0_known (provided by the user) are reapplied at each step to guide the generation towards the desired completion. A common approach (simplified):
1. Predict noise: epsilon_pred = model(x_t, x0_context, sb_x_t, t, mask, sb_mask)
2. Calculate the parameters (mean, variance) of the distribution p(x_{t-1} | x_t) using x_t, t, and epsilon_pred according to the diffusion schedule.
3. Sample the potential next state x_{t-1}_sample from this distribution (adding noise if t > 0, otherwise using the mean).
4. Re-apply knowns to the sample: x_{t-1}_conditioned = mask * x0_known + (1 - mask) * x_{t-1}_sample.
5. Use x_{t-1}_conditioned as the input x_t for the next step (t-2). The main deck context sb_context_encoded for the sideboard decoder is generated from the final denoised main deck embeddings (x0_main_final).

Training

The current model was trained using ~47,000 decks scraped from MTGTop8 and is format agnostic, with training data covering formats from Standard through Vintage. The full model contains ~56 million parameters. Due to its small size, training was feasible on consumer hardware, specifically a single Nvidia 3050 Laptop GPU with 4GB of VRAM, taking roughly 4 days to complete 100 epochs.

Dataset: DeckDataset loads decks and filters for exact 60 main deck / 15 sideboard card counts. It converts card names to the pre-trained Doc2Vec embeddings and retrieves corresponding integer indices using the mapping from the linear classifier. Decks with cards missing from embeddings or the classifier map are skipped. It also calculates card popularity scores based on deck frequency for the masking strategy.
Optimizer: AdamW with weight decay.
Objective: minimize the combined MSE loss total_loss described above, calculated over masks_per_deck different masks for each deck in the batch.
Process: standard PyTorch training loop that iterates through epochs, loads batches via DataLoader, calculates loss using p_losses, performs backpropagation, clips gradients, and updates the optimizer. Checkpoints save the model state together with the epoch number and config.

While the diffusion model learns the underlying patterns of deck construction from the training data, it doesn’t inherently guarantee adherence to strict game rules like the 4-copy limit for non-basic cards or format legality during the raw generation process. To address this, the inference functions employ an iterative refinement strategy after the initial denoising pass:

Initial generation: the standard reverse diffusion process is performed once to generate initial embeddings for all unknown card slots, conditioned on any user-provided cards.
Classification and rule check: the resulting embeddings (both originally known and newly generated) are converted back to card names using the trained linear classifier. The system then checks for violations:
- 4-copy limit: it counts occurrences of each non-basic card name. For sideboard generation, this count considers cards in both the main deck and the current sideboard iteration.
- Format legality: each generated card is checked for legality in the specified format (e.g., ‘Modern’, ‘Standard’). Basic lands are exempt from this check.
Identify violations: the system identifies the specific generated card slots that violate either the 4-copy limit or format legality. User-provided cards are never marked for regeneration.
Mask update and regeneration: a new mask is created. User-provided cards and valid generated cards from the current iteration are marked as “known”. Slots corresponding to rule violations are marked as “unknown”.
Re-run diffusion: the reverse diffusion sampling process is run again, using the updated mask and the embeddings of the known cards (including the valid generated ones) as fixed context. The model only needs to generate new embeddings for the slots marked as unknown due to rule violations.
Repeat: steps 2-5 are repeated up to a fixed number of maximum refinement iterations. This loop continues until no rule violations are found among the generated cards or the iteration limit is reached.

This refinement loop makes legal deck completions more likely by correcting rule violations after the initial generation, leveraging the classifier and external card data to guide the process without needing to bake these complex constraints directly into the diffusion model’s training objective. The final deck combines the original user input with the cards generated through this process.

Final Thoughts

This was a fun but time-consuming project (two weeks of me procrastinating before finals). Some day I’ll probably train v2, but until then you can keep up with what I’m up to on Twitter: @JakeABoggs