Jake Boggs

HieroglyphBench: A Modern OCR Benchmark Using Ancient Text

Today I’m releasing HieroglyphBench, a challenging benchmark that tests how well VLMs can transcribe ancient Egyptian hieroglyphs. You can check out the dataset on Huggingface and the source code on Github.

Given a column of hieroglyphs, models must output the signs they see as Gardiner sign-list codes, in reading order. A Gardiner code names a sign’s type regardless of which way it faces: the owl is G17, the mouth is D21, and the seated man is A1. The results are scored using edit distance against ground-truth transcriptions.

Every reasoning-capable model is run at medium effort. The three open-weight models, Kimi K2.6, MiniMax M3, and Qwen 3.7 Plus, use their official providers to avoid third-party hosting issues.

Many visual understanding benchmarks are nearing saturation and struggle to differentiate between the latest frontier models, despite substantial differences in their capabilities. HieroglyphBench exposes this gap, with the best models (all of which are from the Gemini family) barely scoring 50%. Models from the other labs struggle even more and most do not reach 20%.

Examples

Photographed column of hieroglyphs, Pyramid of Unas plate 20 (col 3)
Ground truth Ground-truth signs rendered as hieroglyphs
D21 G17 O4 D21 G43 N5 Q3 N35 G17 M17 D4 E23
Gemini 3.5 Flash 83% Gemini 3.5 Flash predicted signs as hieroglyphs
D21 G17 O4 D21 G43 N5 N35 G14 M17 D4 E23
GPT-5.5 42% GPT-5.5 predicted signs as hieroglyphs
D21 G1 O4 D21 G1 N5 U6 N35 G1 S29 D21 E34
Claude Opus 4.8 17% Claude Opus 4.8 predicted signs as hieroglyphs
N1 G17 O4 N37 G17 N5 X1 N35 G29 S29 D58 G43 N35 T22
Photographed column of hieroglyphs, Pyramid of Unas plate 21 (col 11)
Ground truth Ground-truth signs rendered as hieroglyphs
N31 N35 V28 D21 X1 N31 E34 N35 M17 S29 Q3 M17 D4 G43 F13 Q3 X1 N35 N37 N35
Gemini 3.1 Pro 60% Gemini 3.1 Pro predicted signs as hieroglyphs
U33 N35 S34 D21 X1 U33 I3 E34 N35 M17 M17 M17 Q3 D4 G43 F31 Q3 X1 N35 N37 N35
Kimi K2.6 20% Kimi K2.6 predicted signs as hieroglyphs
O36 F40 D21 X1 I9 G5 N35 Z1 Z1 M17 O29 D21 G39 I9 D36 X8 N35 O29
Qwen 3.7 Plus 20% Qwen 3.7 Plus predicted signs as hieroglyphs
O29 Z1 D21 Q3 M17 L2 D21 N5 G1 D54 F21 G17 N35 Z1 N35
Photographed column of hieroglyphs, Pyramid of Unas plate 7 (col 12)
Ground truth Ground-truth signs rendered as hieroglyphs
D46 V4 N14 S29 N35 V13 G43 G17 D21 N35 V31 G43 Q3 N35
Gemini 3 Flash Preview 64% Gemini 3 Flash Preview predicted signs as hieroglyphs
D21 D46 V1 N14 S29 N35 V31 G43 G17 D21 N35 V30 G43 N35
GPT-5.4 Mini 21% GPT-5.4 Mini predicted signs as hieroglyphs
O1 Z9 Z1 N35 G43 G43
MiniMax M3 7% MiniMax M3 predicted signs as hieroglyphs
Aa2 N14 Z4 V31 T8 G5 Z4 G17 O11

The top models track the column sign by sign and slip mostly on signs that look alike: one bird for another, a reed for a forearm. Lower-scoring models catch an occasional sign but lose the order or fill the gaps with noise.

Dataset construction

The source is the Pyramid of Unas dataset (Morris Franken’s GlyphDataset), built from Alexandre Piankoff’s 1955 photographic plates of the Pyramid Texts in the tomb of the pharaoh Unas during the late 5th Dynasty, around 2350 BCE.

From this, I build column-level inscription items by detecting and splitting columns into chunks. Each chunk is cropped from the original plate photograph using the bounding box of its signs plus padding, and the ground truth is the ordered list of Gardiner codes in that crop. The result was ~200 images, which I then skimmed through manually to find 30 examples for the final dataset. This is enough to keep the noise low (re-running doesn’t move the scores more than ~1%) while also staying cheap enough for me to update the leaderboard as new models are released.

Gathering the data was by far the most time-consuming of this project. I tried scraping a bunch of different sources, but they were all too low quality. I was hoping for some more diversity from different monuments, but I think the current version is good enough for now. If you’re reading this and you know of another high-quality source I could add, please reach out and maybe I’ll be able to add it into a v2!

Scoring

The model returns a JSON array of Gardiner codes, enforced with a strict JSON-schema structured output. From this, I compute two metrics:

  • Sign error rate: the Levenshtein distance between the predicted sequence and the ground-truth, divided by the number of ground-truth signs.
  • Sign accuracy: 1 − sign error rate, floored at 0, so a prediction with more errors than there are signs scores 0 rather than going negative.

Codes are canonicalized before comparison (G017 -> G17, aa1 -> Aa1) so formatting differences don’t count as errors. Both metrics are computed per inscription and then averaged across the whole eval set, weighting each inscription equally. Random guessing scores ~0, since there are over a thousand glyphs.

Thoughts

Last December, after Gemini 3 was released, I had some fun using it to translate medieval manuscripts. This made me wonder how far the models could be pushed, but I couldn’t find any evaluations online, so it just stayed as a nagging idea in the back of my head. I recently had a bit of time to do more testing and this benchmark is the result. None of the current models are reliable enough yet to be used for serious research, but they’re improving quickly and it seems plausible that they’ll saturate this task within a generation or two. This is very interesting to me because while it is possible labs are training specifically for this task, it seems unlikely and I suspect this is an emergent capability. I plan to track this as new models are released and will update the leaderboard.