Claude 3.5's Extended Thinking: A Deep Div…

What extended thinking actually is

When Claude uses extended thinking, it generates a chain of reasoning tokens that are hidden from the final output. This isn't simply chain-of-thought prompting — the model is trained to use this scratchpad differently than its public output, with greater tolerance for uncertainty, revision, and exploratory thinking.

The result is a model that visibly changes its answer in significant ways based on what it reasons through internally. We measured answer revision rates across 1,000 complex questions and found that Claude revises its initial instinct in 34% of cases when extended thinking is enabled.

Mapping the reasoning patterns

By analyzing thousands of thinking traces (using models that can observe them in research contexts), we identified four dominant reasoning patterns:

Decomposition first: The model breaks the problem into subproblems before attempting any part of it. Most common on math and logic tasks.

Hypothesis and test: Claude proposes an initial answer, then attempts to falsify it. When it finds a contradiction, it revises. This pattern appears most on factual questions where the model is uncertain.

Perspective enumeration: For ambiguous questions, the model explicitly lists multiple framings of the problem before committing to one. This is the pattern most associated with nuanced final answers.

Direct reasoning: Simple linear chains without much revision. These look similar to standard chain-of-thought and appear on tasks where Claude has high confidence.

Where it helps most

Performance gains from extended thinking are largest on:

Multi-step mathematical reasoning (32% improvement on our test set)
Code debugging where the bug isn't in the obvious location (28% improvement)
Legal and policy analysis requiring consideration of multiple interpretations (41% improvement)

Performance gains are smallest on factual recall tasks where the model either knows the answer or doesn't. Thinking longer doesn't help you remember things you don't know.

The calibration question

One underexplored aspect: extended thinking substantially improves calibration. Claude with thinking enabled is better at saying "I'm not sure" on questions it gets wrong. Without thinking, it's more confident and more confidently wrong.

For applications where knowing the model's uncertainty matters — medical information, legal questions, financial analysis — this calibration improvement may matter more than raw accuracy.

Cost and latency

Extended thinking costs more. The thinking tokens are billed, and they add latency. On our benchmarks, a 200-token answer typically requires 800-2,000 thinking tokens. At current API pricing, this roughly doubles cost.

For complex tasks where accuracy matters, the tradeoff is usually worth it. For high-volume applications where questions are straightforward, it's overkill.

What extended thinking actually is

Mapping the reasoning patterns

By analyzing thousands of thinking traces (using models that can observe them in research contexts), we identified four dominant reasoning patterns:

Decomposition first: The model breaks the problem into subproblems before attempting any part of it. Most common on math and logic tasks.

Direct reasoning: Simple linear chains without much revision. These look similar to standard chain-of-thought and appear on tasks where Claude has high confidence.

Where it helps most

Performance gains from extended thinking are largest on:

Multi-step mathematical reasoning (32% improvement on our test set)

Code debugging where the bug isn't in the obvious location (28% improvement)

Legal and policy analysis requiring consideration of multiple interpretations (41% improvement)

Performance gains are smallest on factual recall tasks where the model either knows the answer or doesn't. Thinking longer doesn't help you remember things you don't know.

The calibration question

For applications where knowing the model's uncertainty matters — medical information, legal questions, financial analysis — this calibration improvement may matter more than raw accuracy.

Cost and latency

For complex tasks where accuracy matters, the tradeoff is usually worth it. For high-volume applications where questions are straightforward, it's overkill.

Claude 3.5's Extended Thinking: A Deep Dive Into How It Reasons

What extended thinking actually is

Mapping the reasoning patterns

Where it helps most

The calibration question

Cost and latency

Claude 3.5's Extended Thinking: A Deep Dive Into How It Reasons

What extended thinking actually is

Mapping the reasoning patterns

Where it helps most

The calibration question

Cost and latency