Hi there 👋

I’m Jake, an obsessive AI / ML researcher. Figured it was time to start sharing a few of the projects I’ve been working on…

Scalable Reinforcement Learning with LLMs - Atropos Guide

This weekend, I will be in attendance at the Nous Research – RL Environments Hackathon, so to prepare I’ve been playing around with Atropos, their new RL framework that we will be using for the event. After failing to find any guides online, I decided to write my own. What is Atropos? Atropos is a library from Nous Research for performing reinforcement learning with LLMs. It provides a framework for managing environments and collecting rollouts....

Evaluating Reasoning in LLMs Through MTG Deck Building

Update (2025-06-11): o3 (high) added after API cost reduction. Update (2025-06-06): Deepseek R1 05-28 and Gemini 2.5 Pro 06-05 added. Update (2025-05-22): Claude Sonnet 4 and Opus 4 added. Update (2025-05-14): Human Baseline, Gemini 2.5 Pro 03-25, Gemini 1.5 Flash, Deepseek V3 03-24, Qwen3 30B 3A added. Introduction Evaluating the advanced reasoning capabilities of Large Language Models (LLMs) requires specialized benchmarks that move beyond surface-level NLP tasks. Building upon my obsession applying AI models to my favorite card game, I’ve created ManaBench, a benchmark designed to probe an LLM’s capacity for reasoning using the collectible card game Magic: The Gathering (MTG) as a proxy....

Introducing Manamorphosis: A Diffusion Model for MTG Deck Generation

This post details Manamorphosis, a first-of-its-kind diffusion model developed to complete Magic: The Gathering decklists. It takes a set of known cards and fills in the rest to form a 60-card main deck. Subsequently, using the completed main deck as context, it can complete a 15-card sideboard. The core generative mechanism is based on Denoising Diffusion Probabilistic Models (DDPMs), the same family of models powering many image generation systems like Stable Diffusion and Midjourney, but adapted here for the unique domain of card sets....

Gait Analysis for Physical Therapy with YOLOv11

Analyzing how people walk using video is common in research and clinical settings, but getting accurate joint angles usually means either expensive equipment or manually annotating frames, which is slow and tedious. Over Easter weekend, I built a Python tool to help my mother with her research study analyzing patient videos. It uses YOLOv11-pose for automatic detection and adds an interactive interface for manual adjustments. What it Does The script (track....

AccountaBuddy: Your AI Accountability Partner - HackNC 2024

Check out the project on Devpost and view the source code on GitHub. I spent a weekend at HackNC building something to help manage all of my other side projects. What started as a simple “Wouldn’t it be cool if an AI actually rang you to ask if you’d done your work?” turned into AccountaBuddy, a lightweight app that does more than fire off push notifications - it actually calls you, celebrates your wins, and helps you problem-solve when tasks stall....

Daily Yap: A Synthetically Generated Conversational Audio Dataset

Training multimodal models often requires large, high-quality conversational audio datasets, which are currently scarce. This document details the creation of Daily Yap, a dataset developed to address this gap. Existing conversational audio datasets present several limitations: Content Scope: Many datasets focus on assistant-user interactions, lacking the breadth of topics found in general human dialogues. Audio-Text Alignment: Datasets with precise alignment between high-quality audio and accurate transcriptions are uncommon. Speaker Diversity: The use of few speakers limits the generalizability of models trained on these datasets....

Large Language Models for Magic: the Gathering

Magic: The Gathering (MTG) has always fascinated me with its complexity and strategic depth, thanks to its extensive rulebook and vast array of unique cards. Despite the game’s popularity, AI systems specifically designed for MTG have been few and far between, often falling short due to their inability to accurately interpret the intricate rules and interactions between cards. This blog post chronicles my recent endeavor to bridge this gap with large language models (LLMs) by creating a specialized dataset and evaluation metric to improve AI performance in MTG-related tasks....