Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling.
Abstract
A paper homepage focused on the actual claim, not just the benchmark assets.
Core Finding
Color is easy. Style is not.
Most strong models reach 95% to 100% on color, but only 23% to 34% on style.
Scaling Paradox
Bigger models do not reliably see fonts better.
Performance is non-monotonic across Qwen families, which points to a data bottleneck.
Remedy
Small synthetic fine-tuning closes much of the gap.
LoRA on 3,000 samples lifts Qwen3-VL-8B from 51.1% to 60.8% overall.
Method
FontBench isolates typographic perception from content understanding.
Benchmark Scope
- 250rendered images
- 1,000multiple-choice questions
- 26fonts across four scripts
- 25%random baseline for each property
- 4 propertiesfont family, size, style, and color
- 3 difficulty levelseasy, medium, and hard distractor settings
- 15 modelscovering open-source and commercial VLMs
Main Results
The benchmark exposes a clean perception hierarchy in current VLMs.
Tables
Key quantitative tables from the paper.
Best Overall
66.7% Gemini-3-FlashBest Family
80.8% Gemini-3-FlashBest Size
52.4% Gemini-3-FlashBest Fine-Tuned
60.8% Qwen3-VL-8B + LoRAMain Results
Accuracy (%) on FontBench by property and overall.
| Model | Type | Family | Size | Style | Color | Overall |
|---|---|---|---|---|---|---|
| Commercial and API Models | ||||||
| Gemini-3-Flash#1 overall | Proprietary | 80.8 | 52.4 | 33.6 | 100.0 | 66.7 |
| GPT-5.2#2 overall | Proprietary | 58.8 | 50.0 | 31.2 | 99.6 | 59.9 |
| Claude-Sonnet-4.6#3 overall | Proprietary | 64.0 | 44.8 | 28.0 | 97.6 | 58.6 |
| Doubao-Seed-1.6 | Proprietary | 44.8 | 44.4 | 30.4 | 98.8 | 54.6 |
| Qwen3-Max | Proprietary | 46.0 | 32.4 | 31.2 | 99.2 | 52.2 |
| Gemini-3-Pro | Proprietary | 40.8 | 41.2 | 32.0 | 94.0 | 52.0 |
| GLM-4.5V | MoE | 25.2 | 22.0 | 22.8 | 26.8 | 24.2 |
| Open-Source and Weight-Available Models | ||||||
| Qwen3-VL-30B-A3B | MoE | 49.6 | 40.4 | 28.0 | 99.6 | 54.4 |
| Qwen3-VL-32B | Open | 42.4 | 37.2 | 26.0 | 100.0 | 51.4 |
| Qwen2.5-VL-7B | Open | 35.2 | 44.4 | 27.6 | 97.6 | 51.2 |
| Qwen2.5-VL-72B | Open | 38.8 | 36.0 | 33.2 | 96.4 | 51.1 |
| Qwen3-VL-8B | Open | 36.8 | 39.2 | 28.8 | 99.6 | 51.1 |
| GLM-4.6V | Open | 39.2 | 35.6 | 25.6 | 100.0 | 50.1 |
| Qwen2.5-VL-32B | Open | 36.0 | 34.4 | 30.4 | 89.2 | 47.5 |
| Pixtral-12B | Open | 26.0 | 27.2 | 28.4 | 24.8 | 26.6 |
Fine-Tuning
LoRA results on FontBench and FRB overall transfer.
| Model | Family | Size | Style | Color | Overall | FRB |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 35.2 | 44.4 | 27.6 | 97.6 | 51.2 | 6.9 |
| Qwen3-VL-8B | 36.8 | 39.2 | 28.8 | 99.6 | 51.1 | 9.6 |
| Qwen2.5-VL-7B + LoRA | 46.0 | 66.4 | 27.6 | 98.0 | 59.5 | 9.6 |
| Qwen3-VL-8B + LoRABest adapted model | 52.8 | 60.0 | 30.4 | 100.0 | 60.8 | 13.1 |
| Qwen2.5-VL-32B + LoRA | 43.6 | 34.4 | 31.2 | 94.8 | 51.0 | 13.3 |
FRB Cross-Benchmark
Easy, hard, and overall accuracy on the 15-font Stroop-style task.
| Model | Easy | Hard | Overall |
|---|---|---|---|
| Gemini-3-FlashBest FRB | 56.7 | 29.8 | 40.5 |
| Claude-Sonnet-4.6 | 34.0 | 15.6 | 22.9 |
| GPT-5.2 | 26.7 | 16.9 | 20.8 |
| Gemini-3-Pro | 24.7 | 10.7 | 16.3 |
| Qwen3-VL-8B + LoRA | 22.7 | 6.7 | 13.1 |
| Qwen3-Max | 15.3 | 6.7 | 10.1 |
| Qwen3-VL-30B-A3B | 14.7 | 6.7 | 9.9 |
| Qwen3-VL-8B | 14.0 | 6.7 | 9.6 |
Analysis
The failure is robust, structured, and visible across multiple probes.
Resources
Use the paper, code, benchmark metadata, and citation directly from this page.
Links
BibTeX
@misc{zhou2026readingneqseeingdiagnosing,
title={Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models},
author={Heng Zhou and Ao Yu and Li Kang and Yuchen Fan and Yutao Fan and Xiufeng Song and Hejia Geng and Yiran Qin},
year={2026},
eprint={2603.08497},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08497},
}