Benchmark Dashboard
May 2026 · 5 minMy own version of the Epoch Capabilities Index for tracking niche benchmarks and my takes on how to identify ones worth paying attention to.
I'm Jake, an obsessive AI / ML researcher. You might recognize me as the guy with the patriotic Crocs.
My own version of the Epoch Capabilities Index for tracking niche benchmarks and my takes on how to identify ones worth paying attention to.
An app for transcribing and translating historical manuscripts using Gemini 3 and GPT-5.2. Built as an early Christmas present for my girlfriend.
Quick intro to Atropos, the RL framework from Nous Research. Written while prepping for their hackathon, where I got 2nd place with an environment for teaching LLMs how to write jokes.
ManaBench is my benchmark for testing LLM reasoning using Magic: The Gathering deck building. Models are tasked with picking the best card from 6 options to complete a deck.
I trained a diffusion model to complete Magic: The Gathering decklists. Give it some cards, it fills in the rest. Trained on 47k tournament decks.
Built a video analysis tool for my mom's physical therapy research. Uses YOLOv11-pose for automatic joint detection with a drag-to-correct interface.
90-hour conversational audio dataset. Used GPT-4o to clean up Daily Dialog transcripts and XTTS v2 to synthesize speech with 8 different voices. Available on HuggingFace.
One of my early attempts at teaching LLMs to understand MTG. Created a dataset of 80k question-answer pairs covering card text, rulings, and combos, then fine-tuned Llama 3 8B with QLoRA.