Xingzhou Pang* 1 Yifan Hou* 1 Junling Wang1 Mrinmaya Sachan1{xingzhou.pang, yifan.hou, junling.wang, mrinmaya.sachan}@inf.ethz.ch Abstract While Large Vision-Language Models (VLMs)excel at interpolation, they suffer catastrophic fail-ures in systematic generalization, most notablyin visual counting. In this work, we investigatethis extrapolation bottleneck by deconstructingvisual counting into three cognitive stages:visualindividuation,magnitude awareness, andsym-bolic mapping. Using synthetic Go boards andlinear probes, we demonstrate that visual back-bones maintain robust, linearly separable repre-sentations of quantity well into the extrapolationregime, ruling out perceptual failure.Further-more, models retain latent magnitude awareness,successfully performing comparative reasoningon quantities they fail to enumerate. We pinpointthe collapse to thesymbolic mappingstage, wherethe model fails to project valid visual magnitudesonto symbolic tokens. Our findings support afrac-tured magnitude hypothesis: VLMs fail to acquireauniversal number space, instead learning dis-joint, modality-specific statistical manifolds thatprevent cross-modal grounding for unseen quanti-ties. Validated on the state-of-the-art foundationmodel, our results suggest that bridging this gaprequires inductive priors enforcing unified repre-sentations, as data scaling alone is insufficient.1 and artificial intelligence (Fodor & Pylyshyn, 1988; Marcus,2003; Lake & Baroni, 2018; Hupkes et al., 2020). On theone hand, while Large Vision-Language Models (VLMs)have demonstrated impressive proficiency in describing vi-sual scenes and solving visual reasoning (OpenAI, 2023;Team, 2023; 2025; Bai et al., 2025), others claim they aremere statistical interpolators, often excelling within the sup-port of their training distribution but are error-prone whenrequired to extrapolate (Thrush et al., 2022a; Y¨uksekg¨on¨ulet al., 2023b; Paiss et al., 2023). This limitation is mostclearly evident in the task ofvisual counting. Counting serves as a canonical testbed for reasoning asit isolates the problem of extrapolation in its purest form.Grounded in a simple recursive algorithm (n→n+ 1),counting allows humans to enumerate arbitrary quanti-ties zero-shot once the principles of cardinality are ac-quired (Dedekind, 1965; Carey, 2000; Dehaene, 2011; Pi-antadosi et al., 2012). In contrast, neural models treat count-ing as a pattern-matching problem, degrading catastrophi-cally when object quantities exceed those observed duringtraining (Wallace et al., 2019; Bender & Koller, 2020; Presset al., 2022; Anil et al., 2022). This failure raises a funda-mental diagnostic question:Does the failure to count stemfrom an inability to perceive distinct objects, an inability tocomprehend quantity, or an inability to map that quantity toa label?arXiv:2605.30170v1 [cs.MM] 28 May 2026 1. Introduction Systematic generalization, the ability to learn a rule fromfinite examples and apply it to inputs outside the trainingdistribution, remains the central chasm between biological We investigate the bottleneck of visual counting in VLMsby deconstructing it into three cognitive stages (Fig. 1):visual individuation(recognition),magnitude awareness(numerosity), andsymbolic mapping(articulation). To rig- 2. Experimental Framework orously isolate architectural biases from the noise inherentin real-world datasets, we construct a synthetic laboratoryusing Go game boards (§ 2.1). Crucially, we employ a de-coupled training curriculum similar to VLM training: themodel is trained to visually count only up toN= 49, whileits language decoder is pretrained to count textually up toN= 99. This creates a specificvisual extrapolationregime(50−99), where the model possesses the linguistic labelsbut with a lack of visual-textual pairings, and afull extrapo-lationscenario (100−120) where both the visual densitiesand the textual quantities are unseen. We further validate ourfindings on a state-of-the-art VLM (Qwen3-VL) to confirmthat the observed failure mechanisms persist in real-worldarchitectures regardless of pretraining scale (§ 2.2). To rigorously investigate the visual counting bottleneck, weadopt a two-fold experimental design. First, we constructasynthetic laboratoryusing a custom VLM trained fromscratch, allowing us to strictly control the training distri-bution and decouple visual exposure from textual priors(§ 2.1). Second, we perform areal-world validationusing astate-of-the-art pretrained VLM to confirm that our findingshold in foundation models trained at scale (§ 2.2). 2.1. Study 1: The Synthetic Laboratory Architecture.For initial analysis, we train a lightweight“Toy VLM” composed of standard architectural primitives:a vision Transformer encoder (ViT-Base configuration butonly with 2 layers) for visual perception (Dosovitskiy et al.,2021), connected to a causal Transformer decoder (GPT-2style, 2 layers) for text generation (Radford et al., 2019).This mirrors the architectural