Reading != Seeing

Teaser figure showing that VLMs read text correctly but fail on font family, size, and style. — Reading != Seeing: four strong VLMs read the text correctly, but only color is recognized consistently. Family, size, and style remain weak.

Abstract

A paper homepage focused on the actual claim, not just the benchmark assets.

Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling.

Core Finding

Color is easy. Style is not.

Most strong models reach 95% to 100% on color, but only 23% to 34% on style.

Scaling Paradox

Bigger models do not reliably see fonts better.

Performance is non-monotonic across Qwen families, which points to a data bottleneck.

Remedy

Small synthetic fine-tuning closes much of the gap.

LoRA on 3,000 samples lifts Qwen3-VL-8B from 51.1% to 60.8% overall.

Method

FontBench isolates typographic perception from content understanding.

Pipeline figure showing text corpus, rendering, question generation, and VLM evaluation. — Text rendering, MCQ generation, and model evaluation are controlled so the task tests typographic perception rather than OCR alone.

Benchmark Scope

250rendered images
1,000multiple-choice questions
26fonts across four scripts
25%random baseline for each property

4 propertiesfont family, size, style, and color
3 difficulty levelseasy, medium, and hard distractor settings
15 modelscovering open-source and commercial VLMs

Sample gallery showing the four typographic properties across multiple scripts. — Representative benchmark samples across family, size, style, and color.

Dataset statistics figure showing scripts, difficulty, and font category breakdowns. — Dataset composition across scripts, difficulty levels, and font categories.

Main Results

The benchmark exposes a clean perception hierarchy in current VLMs.

Heatmap of model accuracy by font family, size, style, and color. — Model-by-property accuracy: color is largely solved, while style remains consistently weak.

Concept figure showing the hierarchy from color to style in terms of perceptual complexity. — Perception hierarchy: color maps to low-level statistics, while style requires relational reasoning.

Radar charts showing per-property model profiles and the effect of fine-tuning. — All models share a spiky profile, and fine-tuning changes that profile unevenly across properties.

Bar chart comparing baseline and LoRA fine-tuned models. — Targeted LoRA fine-tuning closes much of the overall gap and especially improves size perception.

Tables

Key quantitative tables from the paper.

Best Overall

66.7% Gemini-3-Flash

Best Family

80.8% Gemini-3-Flash

Best Size

52.4% Gemini-3-Flash

Best Fine-Tuned

60.8% Qwen3-VL-8B + LoRA

Main Results

Accuracy (%) on FontBench by property and overall.

Model	Type	Family	Size	Style	Color	Overall
Commercial and API Models
Gemini-3-Flash#1 overall	Proprietary	80.8	52.4	33.6	100.0	66.7
GPT-5.2#2 overall	Proprietary	58.8	50.0	31.2	99.6	59.9
Claude-Sonnet-4.6#3 overall	Proprietary	64.0	44.8	28.0	97.6	58.6
Doubao-Seed-1.6	Proprietary	44.8	44.4	30.4	98.8	54.6
Qwen3-Max	Proprietary	46.0	32.4	31.2	99.2	52.2
Gemini-3-Pro	Proprietary	40.8	41.2	32.0	94.0	52.0
GLM-4.5V	MoE	25.2	22.0	22.8	26.8	24.2
Open-Source and Weight-Available Models
Qwen3-VL-30B-A3B	MoE	49.6	40.4	28.0	99.6	54.4
Qwen3-VL-32B	Open	42.4	37.2	26.0	100.0	51.4
Qwen2.5-VL-7B	Open	35.2	44.4	27.6	97.6	51.2
Qwen2.5-VL-72B	Open	38.8	36.0	33.2	96.4	51.1
Qwen3-VL-8B	Open	36.8	39.2	28.8	99.6	51.1
GLM-4.6V	Open	39.2	35.6	25.6	100.0	50.1
Qwen2.5-VL-32B	Open	36.0	34.4	30.4	89.2	47.5
Pixtral-12B	Open	26.0	27.2	28.4	24.8	26.6

Fine-Tuning

LoRA results on FontBench and FRB overall transfer.

Model	Family	Size	Style	Color	Overall	FRB
Qwen2.5-VL-7B	35.2	44.4	27.6	97.6	51.2	6.9
Qwen3-VL-8B	36.8	39.2	28.8	99.6	51.1	9.6
Qwen2.5-VL-7B + LoRA	46.0	66.4	27.6	98.0	59.5	9.6
Qwen3-VL-8B + LoRABest adapted model	52.8	60.0	30.4	100.0	60.8	13.1
Qwen2.5-VL-32B + LoRA	43.6	34.4	31.2	94.8	51.0	13.3

FRB Cross-Benchmark

Easy, hard, and overall accuracy on the 15-font Stroop-style task.

Model	Easy	Hard	Overall
Gemini-3-FlashBest FRB	56.7	29.8	40.5
Claude-Sonnet-4.6	34.0	15.6	22.9
GPT-5.2	26.7	16.9	20.8
Gemini-3-Pro	24.7	10.7	16.3
Qwen3-VL-8B + LoRA	22.7	6.7	13.1
Qwen3-Max	15.3	6.7	10.1
Qwen3-VL-30B-A3B	14.7	6.7	9.9
Qwen3-VL-8B	14.0	6.7	9.6

Analysis

The failure is robust, structured, and visible across multiple probes.

Figure showing resolution ablation and robustness under corruptions. — Resolution sensitivity and corruption robustness reveal a capability-fragility trade-off.

Stroop effect figure on the FRB benchmark. — FRB cross-benchmark evaluation confirms the typographic Stroop effect across a wider model set.

Attention heatmaps showing failure modes in font family, size, and style recognition. — Attention analysis suggests family, size, and style fail for different mechanistic reasons rather than a single generic weakness.

Resources

Use the paper, code, benchmark metadata, and citation directly from this page.

Links

Paper PDF Project README Benchmark Metadata GitHub Repo

BibTeX

@misc{zhou2026readingneqseeingdiagnosing,
  title={Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models},
  author={Heng Zhou and Ao Yu and Li Kang and Yuchen Fan and Yutao Fan and Xiufeng Song and Hejia Geng and Yiran Qin},
  year={2026},
  eprint={2603.08497},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.08497},
}