WritingTogether AITogether AIpublished Feb 6, 2026seen 5d

What do LLMs think when you don't tell them what to think about?

Open original ↗

Captured source

source ↗
published Feb 6, 2026seen 5dcaptured 3dhttp 200method plain

What do LLMs think when you don't tell them what to think about?

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Research

Published 2/6/2026

What do LLMs think when you don't tell them what to think about?

Authors

Yongchan Kwon and James Zou

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Summary

Our interactions with large language models (LLMs) are dominated by task- or topic-specific questions, such as “ solve this coding problem ” or “ what is democracy? ” This framing strongly shapes how LLMs generate responses, constraining the range of behaviors and knowledge that can be observed. We study the behavior of LLMs using minimal, topic-neutral prompts. Despite the absence of explicit task or topic specification, LLMs generate diverse content; however, each model family exhibits distinct topical preferences . GPT-OSS favors programming and math, Llama leans literary, DeepSeek often produces religious content, and Qwen tends toward multiple-choice questions. We further observe that our generation settings can degenerate into repetitive or meaningless outputs, revealing model-specific quirks, such as Llama generating personal social media URLs. Paper: https://arxiv.org/abs/2602.01689 Project Page: https://tinyurl.com/mzr5cckz

What do we study? Most LLM behavior analyses are constrained: the prompt heavily shapes what the model can say [1,2]. This is useful—but it is also deeply limiting. Prompts act like filters. A math question forces mathematical reasoning; a chat template forces the model into an assistant persona. Large parts of the model’s generative space are never exercised at all. What we observe is not the model’s natural behavior, but the behavior induced by our instructions. So we ask a more basic question: What does a language model generate when you do not tell it what to generate? To answer this, we study near-unconstrained generation. We use topic-neutral, open-ended seed prompts such as “Actually,” “Let’s think step by step,” or even just punctuation like “.” We remove chat templates entirely—no system prompt, no roles—and use standard decoding. This setup approximates the model’s top-of-mind behavior: a glimpse of its learned generative prior before alignment and prompting take over. Why this matters If you care about model auditing, behavioral monitoring, LLM fingerprinting, or safety and privacy risks, conditional benchmarks alone are not enough. We need to look into raw model generations and see how LLMs behave in diverse inputs and prompting conditions. As an underexplored but scientifically intriguing setting, near-unconstrained generation allows us to observe what models prefer to talk about, what kinds of content they over-represent, and how they fail when fixed chat templates and special tags are removed. Crucially, these signals turn out to be systematic rather than anecdotal. Results: What LLMs think when you don't tell them what to think about? Result 1: Model families have distinct knowledge priors

Figure 1. Embedding visualization of near-unconstrained LLM generations. Each point represents a generated sample; colors denote semantic categories inferred post-hoc; dotted lines denote a high-density region. Even without explicit topics, model outputs cluster into clear regions, and each model family forms different semantic preferences. An interactive figure is at https://tinyurl.com/mzr5cckz . As shown in Figure 1(a), despite the lack of explicit instructions or topics in prompts, LLMs generate a broad range of topics. LLMs generate various categories, including the liberal arts (e.g., literature, philosophy, and education), science and engineering (e.g., physics, mathematics, and programming), as well as areas such as law, finance, music, sports, cooking, agriculture, archaeology, military, and fashion. More surprisingly, shown in Figure 1(b), different model families gravitate toward different parts of the semantic space—even when given the same minimal prompts. GPT-OSS overwhelmingly defaults to programming (27.1%) and mathematics (24.6%). More than half of a model family’s output concentrates in these two domains! Llama produces far more literary and narrative text (9.1%), with less emphasis on technical domains (See Figures 2). DeepSeek often generates religious content at a substantially higher rate than other families. Qwen frequently outputs multiple-choice exam questions, complete with answer options.

Figure 2. Top semantic categories by model family under near-unconstrained generation. Each family exhibits a stable and interpretable topic distribution. What is striking here is consistency. These distributions persist across different prompts, embedding models, and semantic labelers. The behavior looks less like noise and more like a population-level fingerprint. Figure 3 shows representative examples.

Figure 3. Examples of model outputs under unconstrained settings. More examples are available in the paper. Result 2: Depth is part of the prior too The differences are not only about what and how often models talk about, but also how deeply they go. We assess the complexity of programming and mathematical text in Figure 4. We find that GPT-OSS frequently produces advanced or expert-level content (68.2%) , such as depth-first search, breadth-first search, or dynamic programming (see Figure 3). Llama and Qwen skew much more toward basic or intermediate material. These depth differences remain even when controlling for labeling models and evaluation setups.

Figure 4. Distribution of difficulty levels in mathematics and programming text. Difficulty labels are annotated by Claude‑4.5‑Opus. Advanced and expert-level content appears far more often in GPT-OSS outputs. Result 3: Degenerate text is a signal, not just noise Lastly, following previous work [3,4], we also observe that models sometimes fall into repetitive or degenerate patterns, especially when constraints are removed. This behavior is usually discarded as gibberish. We treated it as data.…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive research post from a notable lab.