行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

揭示视觉计数在视觉-语言模型中的瓶颈

2026-05-28 Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan 苏黎世联邦理工学院程思齐Sophie

本文研究了大型视觉语言模型（VLMs）在视觉计数任务中的系统泛化瓶颈。作者将视觉计数分解为三个认知阶段：视觉识别、数量感知和符号映射。

核心观点：VLMs 在插值方面表现出色，但在系统泛化方面存在灾难性失败，尤其是在视觉计数任务中。这并非由于感知或数量理解能力不足，而是由于无法将视觉感知到的数量映射到符号标签。

研究方法：作者构建了一个合成实验室，使用围棋棋盘和线性探针，严格控制训练分布，并采用解耦的训练课程，使模型在视觉计数方面的训练上限为 49，而语言解码器则在文本计数方面的训练上限为 99。此外，作者在 Qwen3-VL 这一最先进的 VLM 上验证了研究结果。

关键数据和研究结论：

基线悖论：模型在文本计数任务中表现完美，但在视觉计数任务中，当数量超过训练边界时，准确率骤降至 0%。
假设验证：
- 假设 A（视觉识别阶段）：作者通过线性探针技术检测视觉编码器输出的隐藏数量（NH），发现 NH 在整个范围内都保持线性可分，表明模型能够有效识别对象，因此排除感知失败。
- 假设 B（数量感知阶段）：作者通过比较计数任务和比较计数任务（判断两组对象数量是否相同）的结果，发现模型能够利用抽象的数量信息进行推理，即使在训练集之外，也证明了模型保留了数量感知能力。
- 假设 C（符号映射阶段）：作者通过分析错误拓扑和电路差异，发现模型在视觉计数任务中的错误分布呈现非高斯分布，且视觉计数和文本计数的注意力头几乎完全分离，表明模型无法将视觉感知到的数量映射到符号标签，验证了“断裂数量”假说。

研究结论：VLMs 缺乏一个通用的数量空间来连接不同模态，而是学习了两个独立的、模态特定的统计流形，导致无法将视觉感知到的数量映射到符号标签。因此，要解决 VLMs 的泛化瓶颈，需要引入归纳先验，强制执行统一的表示，而不仅仅是扩大数据规模。

Xingzhou Pang* 1 Yifan Hou* 1 Junling Wang1 Mrinmaya Sachan1{xingzhou.pang, yifan.hou, junling.wang, mrinmaya.sachan}@inf.ethz.ch Abstract While Large Vision-Language Models (VLMs)excel at interpolation, they suffer catastrophic fail-ures in systematic generalization, most notablyin visual counting. In this work, we investigatethis extrapolation bottleneck by deconstructingvisual counting into three cognitive stages:visualindividuation,magnitude awareness, andsym-bolic mapping. Using synthetic Go boards andlinear probes, we demonstrate that visual back-bones maintain robust, linearly separable repre-sentations of quantity well into the extrapolationregime, ruling out perceptual failure.Further-more, models retain latent magnitude awareness,successfully performing comparative reasoningon quantities they fail to enumerate. We pinpointthe collapse to thesymbolic mappingstage, wherethe model fails to project valid visual magnitudesonto symbolic tokens. Our findings support afrac-tured magnitude hypothesis: VLMs fail to acquireauniversal number space, instead learning dis-joint, modality-specific statistical manifolds thatprevent cross-modal grounding for unseen quanti-ties. Validated on the state-of-the-art foundationmodel, our results suggest that bridging this gaprequires inductive priors enforcing unified repre-sentations, as data scaling alone is insufficient.1 and artificial intelligence (Fodor & Pylyshyn, 1988; Marcus,2003; Lake & Baroni, 2018; Hupkes et al., 2020). On theone hand, while Large Vision-Language Models (VLMs)have demonstrated impressive proficiency in describing vi-sual scenes and solving visual reasoning (OpenAI, 2023;Team, 2023; 2025; Bai et al., 2025), others claim they aremere statistical interpolators, often excelling within the sup-port of their training distribution but are error-prone whenrequired to extrapolate (Thrush et al., 2022a; Y¨uksekg¨on¨ulet al., 2023b; Paiss et al., 2023). This limitation is mostclearly evident in the task ofvisual counting. Counting serves as a canonical testbed for reasoning asit isolates the problem of extrapolation in its purest form.Grounded in a simple recursive algorithm (n→n+ 1),counting allows humans to enumerate arbitrary quanti-ties zero-shot once the principles of cardinality are ac-quired (Dedekind, 1965; Carey, 2000; Dehaene, 2011; Pi-antadosi et al., 2012). In contrast, neural models treat count-ing as a pattern-matching problem, degrading catastrophi-cally when object quantities exceed those observed duringtraining (Wallace et al., 2019; Bender & Koller, 2020; Presset al., 2022; Anil et al., 2022). This failure raises a funda-mental diagnostic question:Does the failure to count stemfrom an inability to perceive distinct objects, an inability tocomprehend quantity, or an inability to map that quantity toa label?arXiv:2605.30170v1 [cs.MM] 28 May 2026 1. Introduction Systematic generalization, the ability to learn a rule fromfinite examples and apply it to inputs outside the trainingdistribution, remains the central chasm between biological We investigate the bottleneck of visual counting in VLMsby deconstructing it into three cognitive stages (Fig. 1):visual individuation(recognition),magnitude awareness(numerosity), andsymbolic mapping(articulation). To rig- 2. Experimental Framework orously isolate architectural biases from the noise inherentin real-world datasets, we construct a synthetic laboratoryusing Go game boards (§ 2.1). Crucially, we employ a de-coupled training curriculum similar to VLM training: themodel is trained to visually count only up toN= 49, whileits language decoder is pretrained to count textually up toN= 99. This creates a specificvisual extrapolationregime(50−99), where the model possesses the linguistic labelsbut with a lack of visual-textual pairings, and afull extrapo-lationscenario (100−120) where both the visual densitiesand the textual quantities are unseen. We further validate ourfindings on a state-of-the-art VLM (Qwen3-VL) to confirmthat the observed failure mechanisms persist in real-worldarchitectures regardless of pretraining scale (§ 2.2). To rigorously investigate the visual counting bottleneck, weadopt a two-fold experimental design. First, we constructasynthetic laboratoryusing a custom VLM trained fromscratch, allowing us to strictly control the training distri-bution and decouple visual exposure from textual priors(§ 2.1). Second, we perform areal-world validationusing astate-of-the-art pretrained VLM to confirm that our findingshold in foundation models trained at scale (§ 2.2). 2.1. Study 1: The Synthetic Laboratory Architecture.For initial analysis, we train a lightweight“Toy VLM” composed of standard architectural primitives:a vision Transformer encoder (ViT-Base configuration butonly with 2 layers) for visual perception (Dosovitskiy et al.,2021), connected to a causal Transformer decoder (GPT-2style, 2 layers) for text generation (Radford et al., 2019).This mirrors the architectural

点击免费查看完整报告

揭示视觉计数在视觉-语言模型中的瓶颈

你可能感兴趣

关于大规模语言模型在科学研究中的应用综述

推理机器学习：迈向人机协作视觉与语言模型

机械设备行业周报：谷歌发布视觉-语言-动作模型RT-2，关注核心零部件供应链

视觉语言建模遇见遥感：模型、数据集与前景展望

视觉语言模型泛化到新领域：全面综述

GR-3技术报告：通用机器人视觉-语言-动作模型

通往L3智能驾驶与具身智能之钥——视觉-语言-动作模型(VLA)产业研究

AI产业跟踪：商汤发布并开源NEO原生多模态模型架构，实现视觉、语言深层统一

传媒行业周报：谷歌更新视频生成模型Veo 3.1，阿里通义千问推出其最强视觉语言模型系列

ELEPHANT：大型语言模型中社会式谄媚的测量与理解