Hi there đź‘‹

I’m Jake, an obsessive AI / ML researcher, currently at Endeavor AI. I’m the guy with the Crocs.

Translating Historical Manuscripts with Gemini 3

I’ve spent a lot of time working on document understanding for products like automated order entry and so I’m always looking for new ways to evaluate the visual capabilities of LLMs. About a month ago, I stumbled across this post by Mark Humphries about Gemini 3 Pro’s impressive ability to transcribe historical texts. This seemed interesting enough to spend an evening building an app around (and I wanted to impress my girlfriend, who majors in anthropology and is interested in medieval medicine). ...

January 19, 2026 Â· 4 min Â· 794 words Â· Jake Boggs

Scalable Reinforcement Learning with LLMs - Atropos Guide

This weekend, I will be in attendance at the Nous Research – RL Environments Hackathon, so to prepare I’ve been playing around with Atropos, their new RL framework that we will be using for the event. After failing to find any guides online, I decided to write my own. Update: I got 2nd place with VR-CLImax, my implementation of Verified Rewards via Completion Likelihood Improvement, an RL environment for teaching LLMs how to make jokes! You can find the code merged into the Atropos repository. ...

May 16, 2025 Â· 9 min Â· 1896 words Â· Jake Boggs

Evaluating Reasoning in LLMs Through MTG Deck Building

Update (2026-01-14): Gemini 3 Pro, Gemini 3 Flash, GPT 5.2 (medium), Claude Opus 4.5, and Grok 4.1 Fast added. View Older Updates Update (2025-08-08): GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4 Turbo 11-06, and GPT-3.5 Turbo added. Update (2025-08-05): Kimi K2, GPT OSS 120B (low), and GPT OSS 120B (high) added. Update (2025-07-13): Grok 4, Grok 3, Gemini 2.5 Flash, Claude Sonnet 4 (thinking), and Command A added. Update (2025-06-11): o3 (high) added after API cost reduction. Update (2025-06-06): Deepseek R1 05-28 and Gemini 2.5 Pro 06-05 added. Update (2025-05-22): Claude Sonnet 4 and Opus 4 added. Update (2025-05-14): Human Baseline, Gemini 2.5 Pro 03-25, Gemini 1.5 Flash, Deepseek V3 03-24, Qwen3 30B 3A added. Introduction I have an obsession with applying AI models to my favorite card game, so I’ve created ManaBench, a benchmark designed to probe an LLM’s capacity for reasoning using the collectible card game Magic: The Gathering (MTG). With its intricate interactions and deep strategy, MTG serves as an ideal micro-world to test an LLM’s ability to process contextual information, identify patterns, and make judgments that align with expert human choices. This post provides an overview of the benchmark’s construction and the methodology used for evaluation. ...

May 9, 2025 Â· 12 min Â· 2494 words Â· Jake Boggs

Introducing Manamorphosis: A Diffusion Model for MTG Deck Generation

This post details Manamorphosis, a first-of-its-kind diffusion model developed to complete Magic: The Gathering decklists. It takes a set of known cards and fills in the rest to form a 60-card main deck. Subsequently, using the completed main deck as context, it can complete a 15-card sideboard. The core generative mechanism is based on Denoising Diffusion Probabilistic Models (DDPMs), the same family of models powering many image generation systems like Stable Diffusion and Midjourney, but adapted here to produce sets of cards. I’m exciting to share this model, as I believe it is the state-of-the-art (and only) AI model trained specifically for decklist generation. ...

May 5, 2025 Â· 15 min Â· 3166 words Â· Jake Boggs

Gait Analysis for Physical Therapy with YOLOv11

Analyzing how people walk using video is common in research and clinical settings, but getting accurate joint angles usually means either expensive equipment or manually annotating frames, which is slow and tedious. Over Easter weekend, I built a Python tool to help my mother with her research study analyzing patient videos. It uses YOLOv11-pose for automatic detection and adds an interactive interface for manual adjustments. Demo Your browser does not support the video tag. How it Works ...

April 21, 2025 Â· 5 min Â· 960 words Â· Jake Boggs

AccountaBuddy: Your AI Accountability Partner - HackNC 2024

Check out the project on Devpost and view the source code on GitHub. I spent a weekend at HackNC building something to help manage all of my other side projects. What started as “Wouldn’t it be cool if an AI actually rang you to ask if you’d done your work?” turned into AccountaBuddy, a lightweight app that does more than fire off push notifications - it actually calls you, celebrates your wins, and helps you problem-solve when progress stalls. ...

November 4, 2024 Â· 2 min Â· 349 words Â· Jake Boggs

Daily Yap: A Synthetically Generated Conversational Audio Dataset

Training multimodal models often requires large, high-quality conversational audio datasets, which are scarce at the time of writing. Existing conversational audio datasets present several limitations: Content Scope: Many datasets focus on assistant-user interactions, lacking the breadth of topics found in general human dialogues. Audio-Text Alignment: Datasets with precise alignment between high-quality audio and accurate transcriptions are uncommon. Speaker Diversity: The use of few speakers limits the generalizability of models trained on these datasets. Scalability: Human recording is resource-intensive, hindering the creation of large-scale datasets. Daily Yap was created to overcome these challenges by providing a synthetically generated conversational audio resource suitable for training real-time conversational audio models. ...

June 23, 2024 Â· 5 min Â· 876 words Â· Jake Boggs

Large Language Models for Magic: the Gathering

Context: this was a project from one of my classes. I dumped the content from my final paper here with some slight tweaks to make a blog post. Introduction Magic: The Gathering (MTG) has always fascinated me with its complexity and depth, thanks to its extensive rulebook and vast array of unique cards. Despite the game’s popularity, AI systems specifically designed for MTG have been few and far between, often falling short due to their inability to accurately interpret the intricate rules and interactions between cards. This blog post chronicles my recent endeavor to bridge this gap with large language models (LLMs) by creating a specialized dataset and evaluation metric to improve AI performance in MTG-related tasks. ...

May 24, 2024 Â· 8 min Â· 1493 words Â· Jake Boggs