Project Page / FontBlind Code Release / 2026

Reading != Seeing Diagnosing and Closing the Typography Gap in Vision-Language Models

Heng Zhou, Ao Yu, Li Kang, Yuchen Fan, Yutao Fan, Xiufeng Song, Hejia Geng, Yiran Qin

Heng Zhou and Ao Yu contributed equally.

Vision-Language Models read text almost perfectly, but still struggle to perceive how that text is rendered. This project page presents the paper, benchmark, figures, and code release for diagnosing that gap.

Teaser figure showing that VLMs read text correctly but fail on font family, size, and style.
Reading != Seeing: four strong VLMs read the text correctly, but only color is recognized consistently. Family, size, and style remain weak.

Abstract

A paper homepage focused on the actual claim, not just the benchmark assets.

Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling.

Core Finding

Color is easy. Style is not.

Most strong models reach 95% to 100% on color, but only 23% to 34% on style.

Scaling Paradox

Bigger models do not reliably see fonts better.

Performance is non-monotonic across Qwen families, which points to a data bottleneck.

Remedy

Small synthetic fine-tuning closes much of the gap.

LoRA on 3,000 samples lifts Qwen3-VL-8B from 51.1% to 60.8% overall.

Method

FontBench isolates typographic perception from content understanding.

Pipeline figure showing text corpus, rendering, question generation, and VLM evaluation.
Text rendering, MCQ generation, and model evaluation are controlled so the task tests typographic perception rather than OCR alone.

Benchmark Scope

  • 250rendered images
  • 1,000multiple-choice questions
  • 26fonts across four scripts
  • 25%random baseline for each property
  • 4 propertiesfont family, size, style, and color
  • 3 difficulty levelseasy, medium, and hard distractor settings
  • 15 modelscovering open-source and commercial VLMs
Sample gallery showing the four typographic properties across multiple scripts.
Representative benchmark samples across family, size, style, and color.
Dataset statistics figure showing scripts, difficulty, and font category breakdowns.
Dataset composition across scripts, difficulty levels, and font categories.

Main Results

The benchmark exposes a clean perception hierarchy in current VLMs.

Heatmap of model accuracy by font family, size, style, and color.
Model-by-property accuracy: color is largely solved, while style remains consistently weak.
Concept figure showing the hierarchy from color to style in terms of perceptual complexity.
Perception hierarchy: color maps to low-level statistics, while style requires relational reasoning.
Radar charts showing per-property model profiles and the effect of fine-tuning.
All models share a spiky profile, and fine-tuning changes that profile unevenly across properties.
Bar chart comparing baseline and LoRA fine-tuned models.
Targeted LoRA fine-tuning closes much of the overall gap and especially improves size perception.

Tables

Key quantitative tables from the paper.

Best Overall

66.7% Gemini-3-Flash

Best Family

80.8% Gemini-3-Flash

Best Size

52.4% Gemini-3-Flash

Best Fine-Tuned

60.8% Qwen3-VL-8B + LoRA

Main Results

Accuracy (%) on FontBench by property and overall.

Model Type Family Size Style Color Overall
Commercial and API Models
GPT-5.2#2 overall Proprietary 58.8 50.0 31.2 99.6 59.9
Claude-Sonnet-4.6#3 overall Proprietary 64.0 44.8 28.0 97.6 58.6
Doubao-Seed-1.6 Proprietary 44.8 44.4 30.4 98.8 54.6
Qwen3-Max Proprietary 46.0 32.4 31.2 99.2 52.2
Gemini-3-Pro Proprietary 40.8 41.2 32.0 94.0 52.0
GLM-4.5V MoE 25.2 22.0 22.8 26.8 24.2
Open-Source and Weight-Available Models
Qwen3-VL-30B-A3B MoE 49.6 40.4 28.0 99.6 54.4
Qwen3-VL-32B Open 42.4 37.2 26.0 100.0 51.4
Qwen2.5-VL-7B Open 35.2 44.4 27.6 97.6 51.2
Qwen2.5-VL-72B Open 38.8 36.0 33.2 96.4 51.1
Qwen3-VL-8B Open 36.8 39.2 28.8 99.6 51.1
GLM-4.6V Open 39.2 35.6 25.6 100.0 50.1
Qwen2.5-VL-32B Open 36.0 34.4 30.4 89.2 47.5
Pixtral-12B Open 26.0 27.2 28.4 24.8 26.6

Fine-Tuning

LoRA results on FontBench and FRB overall transfer.

Model Family Size Style Color Overall FRB
Qwen2.5-VL-7B35.244.427.697.651.26.9
Qwen3-VL-8B36.839.228.899.651.19.6
Qwen2.5-VL-7B + LoRA46.066.427.698.059.59.6
Qwen2.5-VL-32B + LoRA43.634.431.294.851.013.3

FRB Cross-Benchmark

Easy, hard, and overall accuracy on the 15-font Stroop-style task.

Model Easy Hard Overall
Claude-Sonnet-4.634.015.622.9
GPT-5.226.716.920.8
Gemini-3-Pro24.710.716.3
Qwen3-VL-8B + LoRA22.76.713.1
Qwen3-Max15.36.710.1
Qwen3-VL-30B-A3B14.76.79.9
Qwen3-VL-8B14.06.79.6

Analysis

The failure is robust, structured, and visible across multiple probes.

Figure showing resolution ablation and robustness under corruptions.
Resolution sensitivity and corruption robustness reveal a capability-fragility trade-off.
Stroop effect figure on the FRB benchmark.
FRB cross-benchmark evaluation confirms the typographic Stroop effect across a wider model set.
Attention heatmaps showing failure modes in font family, size, and style recognition.
Attention analysis suggests family, size, and style fail for different mechanistic reasons rather than a single generic weakness.

Resources

Use the paper, code, benchmark metadata, and citation directly from this page.

BibTeX

@misc{zhou2026readingneqseeingdiagnosing,
  title={Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models},
  author={Heng Zhou and Ao Yu and Li Kang and Yuchen Fan and Yutao Fan and Xiufeng Song and Hejia Geng and Yiran Qin},
  year={2026},
  eprint={2603.08497},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.08497},
}