您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [苏黎世联邦理工学院]:揭示视觉计数在视觉-语言模型中的瓶颈 - 发现报告

揭示视觉计数在视觉-语言模型中的瓶颈

2026-05-28 Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan 苏黎世联邦理工学院 程思齐Sophie
报告封面

Xingzhou Pang* 1 Yifan Hou* 1 Junling Wang1 Mrinmaya Sachan1{xingzhou.pang, yifan.hou, junling.wang, mrinmaya.sachan}@inf.ethz.ch Abstract While Large Vision-Language Models (VLMs)excel at interpolation, they suffer catastrophic fail-ures in systematic generalization, most notablyin visual counting. In this work, we investigatethis extrapolation bottleneck by deconstructingvisual counting into three cognitive stages:visualindividuation,magnitude awareness, andsym-bolic mapping. Using synthetic Go boards andlinear probes, we demonstrate that visual back-bones maintain robust, linearly separable repre-sentations of quantity well into the extrapolationregime, ruling out perceptual failure.Further-more, models retain latent magnitude awareness,successfully performing comparative reasoningon quantities they fail to enumerate. We pinpointthe collapse to thesymbolic mappingstage, wherethe model fails to project valid visual magnitudesonto symbolic tokens. Our findings support afrac-tured magnitude hypothesis: VLMs fail to acquireauniversal number space, instead learning dis-joint, modality-specific statistical manifolds thatprevent cross-modal grounding for unseen quanti-ties. Validated on the state-of-the-art foundationmodel, our results suggest that bridging this gaprequires inductive priors enforcing unified repre-sentations, as data scaling alone is insufficient.1 and artificial intelligence (Fodor & Pylyshyn, 1988; Marcus,2003; Lake & Baroni, 2018; Hupkes et al., 2020). On theone hand, while Large Vision-Language Models (VLMs)have demonstrated impressive proficiency in describing vi-sual scenes and solving visual reasoning (OpenAI, 2023;Team, 2023; 2025; Bai et al., 2025), others claim they aremere statistical interpolators, often excelling within the sup-port of their training distribution but are error-prone whenrequired to extrapolate (Thrush et al., 2022a; Y¨uksekg¨on¨ulet al., 2023b; Paiss et al., 2023). This limitation is mostclearly evident in the task ofvisual counting. Counting serves as a canonical testbed for reasoning asit isolates the problem of extrapolation in its purest form.Grounded in a simple recursive algorithm (n→n+ 1),counting allows humans to enumerate arbitrary quanti-ties zero-shot once the principles of cardinality are ac-quired (Dedekind, 1965; Carey, 2000; Dehaene, 2011; Pi-antadosi et al., 2012). In contrast, neural models treat count-ing as a pattern-matching problem, degrading catastrophi-cally when object quantities exceed those observed duringtraining (Wallace et al., 2019; Bender & Koller, 2020; Presset al., 2022; Anil et al., 2022). This failure raises a funda-mental diagnostic question:Does the failure to count stemfrom an inability to perceive distinct objects, an inability tocomprehend quantity, or an inability to map that quantity toa label?arXiv:2605.30170v1 [cs.MM] 28 May 2026 1. Introduction Systematic generalization, the ability to learn a rule fromfinite examples and apply it to inputs outside the trainingdistribution, remains the central chasm between biological We investigate the bottleneck of visual counting in VLMsby deconstructing it into three cognitive stages (Fig. 1):visual individuation(recognition),magnitude awareness(numerosity), andsymbolic mapping(articulation). To rig- 2. Experimental Framework orously isolate architectural biases from the noise inherentin real-world datasets, we construct a synthetic laboratoryusing Go game boards (§ 2.1). Crucially, we employ a de-coupled training curriculum similar to VLM training: themodel is trained to visually count only up toN= 49, whileits language decoder is pretrained to count textually up toN= 99. This creates a specificvisual extrapolationregime(50−99), where the model possesses the linguistic labelsbut with a lack of visual-textual pairings, and afull extrapo-lationscenario (100−120) where both the visual densitiesand the textual quantities are unseen. We further validate ourfindings on a state-of-the-art VLM (Qwen3-VL) to confirmthat the observed failure mechanisms persist in real-worldarchitectures regardless of pretraining scale (§ 2.2). To rigorously investigate the visual counting bottleneck, weadopt a two-fold experimental design. First, we constructasynthetic laboratoryusing a custom VLM trained fromscratch, allowing us to strictly control the training distri-bution and decouple visual exposure from textual priors(§ 2.1). Second, we perform areal-world validationusing astate-of-the-art pretrained VLM to confirm that our findingshold in foundation models trained at scale (§ 2.2). 2.1. Study 1: The Synthetic Laboratory Architecture.For initial analysis, we train a lightweight“Toy VLM” composed of standard architectural primitives:a vision Transformer encoder (ViT-Base configuration butonly with 2 layers) for visual perception (Dosovitskiy et al.,2021), connected to a causal Transformer decoder (GPT-2style, 2 layers) for text generation (Radford et al., 2019).This mirrors the architectural