
Malika Aubakirova∗†, Alex Atallah‡, Chris Clark‡, Justin Summerville‡, and Anjney Midha† ‡OpenRouter Inc.†a16z (Andreessen Horowitz) December, 2025 Abstract The past year has marked a turning point in the evolution and real-world use of large language models(LLMs). With the release of the first widely adopted reasoning model,o1, on December 5th, 2024, the fieldshifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment,experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empiricalunderstanding of how these models have actually been used in practice has lagged behind. In this work,we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs,to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time.In our empirical study, we observe substantial adoption of open-weight models, the outsized popularityof creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistancecategories, plus the rise of agentic inference. Furthermore, our retention analysis identifiesfoundationalcohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenonthe Cinderella“Glass Slipper”effect. These findings underscore that the way developers and end-usersengage with LLMs “in the wild” is complex and multifaceted. We discuss implications for model builders,AI developers, and infrastructure providers, and outline how a data-driven understanding of usage caninform better design and deployment of LLM systems. 1Introduction Just a year ago, the landscape of large language models looked fundamentally different.Prior to late2024, state-of-the-art systems were dominated by single-pass, autoregressive predictors optimized to continuetext sequences. Several precursor efforts attempted to approximate reasoning through advanced instructionfollowing and tool use. For instance,Anthropic’s Sonnet 2.1 & 3models excelled at sophisticatedtool useand Retrieval-Augmented Generation (RAG), andCohere’s Command Rmodels incorporated structuredtool-planning tokens.Separately, open source projects like those done byReflection explored supervisedchain-of-thought and self-critique loops during training.Although these advanced techniques producedreasoning-like outputs and superior instruction following, the fundamental inference procedure remainedbased on a single forward pass, emitting a surface-level trace learned from data rather than performingiterative, internal computation.This paradigm evolved onDecember 5, 2024, when OpenAI released the first full version of itso1 reasoning model (codenamedStrawberry) [4].The preview released on September 12, 2024 had alreadyindicated a departure from conventional autoregressive inference.Unlike prior systems,o1 employed anexpanded inference-time computation process involving internal multi-step deliberation, latent planning, anditerative refinement before generating a final output. Empirically, this enabled systematic improvements inmathematical reasoning, logical consistency, and multi-step decision-making, reflecting a shift from patterncompletion to structured internal cognition. In retrospect, last year marked the field’s true inflection point:earlier approaches gestured toward reasoning, buto1 introduced the first generally-deployed architecturethat performed reasoning through deliberate multi-stage computation rather than merelydescribingit [6, 7]. While recent advances in LLM capabilities have been widely documented, systematic evidence abouthow these models are actually used in practice remains limited [3, 5]. Existing accounts tend to emphasizequalitative demonstrations or benchmark performance rather than large-scale behavioral data.To bridgethis gap, we undertake an empirical study of LLM usage, leveraging a 100 trillion token dataset fromOpenRouter, a multi-model AI inference platform that serves as a hub for diverse LLM queries.OpenRouter’s vantage point provides a unique window into fine-grained usage patterns.Because it orchestrates requests across a wide array of models (spanning both closed source APIs and open-weightdeployments), OpenRouter captures a representative cross-section of how developers and end-users actuallyinvoke language models for various tasks. By analyzing this rich dataset, we can observe which models arechosen for which tasks, how usage varies across geographic regions and over time, and how external factorslike pricing or new model launches influence behavior.In this paper, we draw inspiration from prior empirical studies of AI adoption, including Anthropic’s economic impact and usage analyses [1] and OpenAI’s reportHow People Use ChatGPT[2], aiming fora neutral, evidence-driven discussion.We first describe our dataset and methodology, including how wecategorize tasks and models. We then delve into a series of analyses that illuminate d