Reasoning Models and Deep Reasoning in LLMs: Chain-of-Thought, Tree of Thoughts, and Test-Time Compute

Language models don’t reason. Not in the way humans do. They predict the next token based on patterns learned from training data. But something interesting happens when you force them to show their work: the outputs get dramatically better. Not because the model suddenly “thinks” — but because the structure of the prompt shapes the computation in ways that produce more accurate results.

This post covers the three major strategies for eliciting reasoning behavior from LLMs: Chain-of-Thought prompting, Tree of Thoughts, and Test-Time Compute Scaling. These are not incremental prompt tricks. They represent a shift in how we architect interactions with language models — from single-shot question-answer to structured, multi-step inference pipelines.


Chain-of-Thought Prompting: Forcing the Model to Show Its Work

Chain-of-Thought (CoT) prompting was introduced by Wei et al. at Google Research in 2022. The idea is deceptively simple: instead of asking the model for a final answer directly, you provide examples that include intermediate reasoning steps — and the model learns to generate its own.

How It Works

Standard prompting:

Q: If a store has 23 apples and sells 17, how many remain?
A: 6

Chain-of-Thought prompting:

Q: If a store has 23 apples and sells 17, how many remain?
A: The store starts with 23 apples. It sells 17. 23 - 17 = 6. The store has 6 apples remaining.

The difference looks trivial. The performance difference is not.

Why It Works

When the model generates intermediate steps, it effectively decomposes a complex problem into simpler sub-problems that it can solve sequentially. Each intermediate token generated becomes part of the context for the next prediction. The model doesn’t “plan” — it creates a chain of computations where each step constrains and informs the next.

Wei et al. demonstrated that CoT prompting with PaLM (540B parameters) achieved state-of-the-art accuracy on the GSM8K math benchmark, surpassing even fine-tuned GPT-3 with a verifier. The gains were significant across arithmetic reasoning, commonsense reasoning, and symbolic reasoning tasks.

The Critical Caveat: Scale Dependency

CoT prompting only works reliably in large models. In smaller models (below roughly 100B parameters), chain-of-thought prompting often produces plausible-looking but incorrect reasoning chains. The model generates steps that look logical but contain errors — and because the steps look coherent, these errors are harder to detect than a simple wrong answer.

This is an important architectural consideration: if you’re building a system that relies on CoT reasoning, model size is not optional. Using CoT with an undersized model doesn’t just degrade gracefully — it can actively mislead.


Self-Consistency: Majority Voting Over Reasoning Paths

A natural extension of CoT, introduced by Wang et al. at Google Brain (ICLR 2023), is Self-Consistency. The insight: for any complex problem, there are usually multiple valid reasoning paths that arrive at the same correct answer.

How It Works

  1. Sample multiple reasoning paths. Instead of generating a single chain-of-thought with greedy decoding, sample 5, 10, or 40 diverse reasoning chains using temperature sampling
  2. Extract the final answer from each chain. Ignore the intermediate reasoning — just collect the answers
  3. Majority vote. The most common answer across all sampled chains is selected as the final output

Why It Matters

Self-Consistency treats the reasoning chain as a stochastic process rather than a deterministic one. Any single chain might contain errors. But if you sample enough chains, the correct answer tends to appear more frequently than any specific incorrect answer — because there are many ways to reason correctly, but errors tend to be more random and distributed.

The empirical results are substantial: +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA. These are large gains from a technique that requires no additional training — only more inference-time computation.

The trade-off is direct: you’re spending N times the compute for significantly higher accuracy. Whether that trade-off is worth it depends on the cost of being wrong.


Tree of Thoughts: Deliberate Search Over Reasoning Space

Chain-of-Thought is linear. You generate one chain, step by step, left to right. If a reasoning step goes wrong early, everything downstream is compromised. There’s no backtracking, no exploration of alternatives.

Tree of Thoughts (ToT), introduced by Yao et al. at Princeton (NeurIPS 2023), addresses this by turning reasoning into a search problem.

How It Works

Instead of generating a single linear chain, ToT:

  1. Decomposes the problem into intermediate “thoughts” — coherent reasoning units (a sentence, a paragraph, a partial solution)
  2. Generates multiple candidate thoughts at each step — branching the reasoning tree
  3. Evaluates each candidate — using the model itself to assess which thoughts are most promising
  4. Searches the tree — using breadth-first search (BFS) or depth-first search (DFS) to explore the most promising paths
  5. Backtracks when needed — abandoning dead-end reasoning paths and exploring alternatives

The Results Are Striking

On the Game of 24 (a mathematical reasoning task), GPT-4 with standard CoT prompting achieved 4% success. With Tree of Thoughts: 74%. That’s not a marginal improvement — it’s a qualitative shift in capability.

The Engineering Reality

ToT is powerful but expensive. Each “thought” evaluation requires a model call. A tree with branching factor 3 and depth 5 requires dozens to hundreds of inference calls per problem. For latency-sensitive applications, this is prohibitive. For high-stakes decisions where accuracy matters more than speed — architecture reviews, certification analysis, complex debugging — the trade-off may be worth it.

There’s also a deeper point: ToT demonstrates that the reasoning bottleneck is often in the inference strategy, not the model itself. The same model (GPT-4) goes from 4% to 74% accuracy by changing how it explores the problem space. The weights are identical. The architecture of the interaction is what changed.


Test-Time Compute Scaling: Spending More Compute Where It Matters

The most recent evolution in reasoning strategies is Test-Time Compute Scaling (TTS) — the principle behind OpenAI’s o1 and o3 models, and an increasingly active area of open-source research.

The idea: instead of fixing the computation budget at inference time, allocate more compute to harder problems. Let the model “think longer” when the problem demands it.

How It Works

TTS models are trained to produce extended reasoning traces before committing to a final answer. The model generates an internal chain-of-thought — sometimes hundreds or thousands of tokens — working through the problem step by step before producing its output.

Two key mechanisms:

Sequential scaling: The model generates longer reasoning chains for harder problems. More tokens = more intermediate computation = (in theory) better answers. This is what o1 does internally.

Parallel scaling: Sample multiple independent reasoning attempts and select the best one — either through majority voting (like Self-Consistency) or through a learned verifier that scores each attempt.

What the Research Shows

Recent large-scale studies reveal important nuances that temper the initial enthusiasm:

No single strategy universally dominates. A study spanning 30+ billion tokens across eight open-source models (7B–235B parameters) found that optimal TTS strategies depend on problem difficulty, model size, and trace length. There is no one-size-fits-all approach.

Longer chains don’t always help. Research on o1-like models (QwQ, DeepSeek-R1, LIMO) found that correct solutions are often shorter than incorrect ones. The models’ self-revision capabilities in longer chains frequently degrade performance — the model talks itself out of a correct answer. This is a direct challenge to the assumption that “more thinking = better answers.”

Parallel beats sequential in many cases. Sampling multiple independent solutions achieves better coverage and scalability than letting a single chain run longer. This has practical implications: it’s often more effective to generate 10 short reasoning attempts and vote than to generate one very long chain.

Simple methods can be surprisingly effective. The s1 model demonstrated that fine-tuning on just 1,000 curated reasoning examples, combined with budget forcing (controlling how long the model thinks via prompting), exceeded o1-preview on competition math by up to 27%. Massive training budgets are not always necessary.


The Hierarchy of Reasoning Strategies

These techniques form a natural progression in complexity and capability:

StrategyMechanismCompute CostBest For
Standard promptingDirect question → answer1xSimple factual queries
Chain-of-ThoughtLinear step-by-step reasoning1x (longer output)Arithmetic, multi-step logic
Self-ConsistencyMultiple CoT chains + majority voteNx (N samples)High-stakes decisions where accuracy matters
Tree of ThoughtsBranching search with evaluation and backtracking10–100xComplex planning, search problems
Test-Time Compute ScalingDynamic compute allocation per problemVariableHard reasoning, competition-level problems

Each level trades compute for accuracy. The engineering question is always: what’s the cost of being wrong?


What This Means for Engineers

These Are Architectural Decisions, Not Prompt Tricks

Choosing between CoT, Self-Consistency, ToT, and TTS is an infrastructure decision. It affects latency, cost, reliability, and the failure modes of your system. Treat it like choosing a database or a caching strategy — not like choosing a font.

Reasoning Quality Is Bounded by Verification

All of these strategies produce more confident-looking output. That makes verification more important, not less. A model that generates a 500-token reasoning chain with a wrong conclusion is harder to catch than one that outputs a single wrong answer. The reasoning chain creates an illusion of rigor.

If you’re in a regulated domain — payments, medical, legal — you need to architect verification into the pipeline, not just trust that more reasoning steps equals more accuracy.

The Model Is Not Reasoning — It’s Computing

This is worth repeating. These techniques improve output quality by structuring computation, not by enabling understanding. The model doesn’t “know” whether its intermediate steps are correct. It doesn’t have beliefs or intentions. It’s generating tokens that are statistically likely given the preceding context.

This isn’t a philosophical quibble. It has practical engineering consequences: the model can generate a perfectly structured, internally consistent reasoning chain that reaches a confidently stated wrong answer. The chain looks logical. The conclusion is wrong. And the better the reasoning strategy, the more convincing the wrong answers become.

Build for verification. Not for trust.


References

  • Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. arxiv.org/abs/2201.11903
  • Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. openreview.net
  • Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. arxiv.org/abs/2305.10601
  • “The Art of Scaling Test-Time Compute for Large Language Models.” 2025. arxiv.org/abs/2512.02008
  • Muennighoff, N. et al. “s1: Simple Test-Time Scaling.” 2025. arxiv.org/abs/2501.19393
  • “Revisiting the Test-Time Scaling of o1-like Models.” ACL 2025. aclanthology.org
  • The Obsolescence Paradox: Why the Best Engineers Will Thrive in the AI Era — engineering judgment in the age of AI reasoning systems
  • Prompt Engineering for POS — practical CoT applications in payment systems
  • AI Sycophancy — why confident-looking AI output still requires verification