Jake Boggs

I'm Jake, an obsessive AI / ML researcher. You might recognize me as the guy with the patriotic Crocs.

Posts

HieroglyphBench: A Modern OCR Benchmark Using Ancient Text

Jun 2026 · 5 min

Standard OCR benchmarks are mostly saturated, so I made something harder. HieroglyphBench tests how well VLMs can read ancient Egyptian hieroglyphs.

How Well Do Benchmarks Predict Which Models People Use?

Jun 2026 · 8 min

Evaluating LLM benchmarks by analyzing how they correlate with OpenRouter market share.

Benchmark Dashboard

May 2026 · 6 min

My own version of the Epoch Capabilities Index for tracking niche benchmarks and my takes on how to identify ones worth paying attention to.

Translating Historical Manuscripts with Gemini 3

Jan 2026 · 4 min

An app for transcribing and translating historical manuscripts using Gemini 3 and GPT-5.2. Built as an early Christmas present for my girlfriend.

Scalable Reinforcement Learning with LLMs - Atropos Guide

May 2025 · 9 min

Quick intro to Atropos, the RL framework from Nous Research. Written while prepping for their hackathon, where I got 2nd place with an environment for teaching LLMs how to write jokes.

Evaluating Reasoning in LLMs Through MTG Deck Building

May 2025 · 12 min

ManaBench is my benchmark for testing LLM reasoning using Magic: The Gathering deck building. Models are tasked with picking the best card from 6 options to complete a deck.

Introducing Manamorphosis: A Diffusion Model for MTG Deck Generation

May 2025 · 15 min

I trained a diffusion model to complete Magic: The Gathering decklists. Give it some cards, it fills in the rest.

Gait Analysis for Physical Therapy with YOLOv11

Apr 2025 · 4 min

Built a video analysis tool for my mom's physical therapy research. Uses YOLOv11-pose for automatic joint detection with a drag-to-correct interface.

Daily Yap: A Synthetically Generated Conversational Audio Dataset

Jun 2024 · 4 min

90-hour conversational audio dataset. Used GPT-4o to clean up Daily Dialog transcripts and XTTS v2 to synthesize speech with 8 different voices. Available on HuggingFace.

Large Language Models for Magic: the Gathering

May 2024 · 7 min

One of my early attempts at teaching LLMs to understand MTG. Created a dataset of 80k question-answer pairs covering card text, rulings, and combos, then fine-tuned Llama 3 8B with QLoRA.