What is a reasoning model?

A reasoning model generates a hidden chain of thought - typically thousands of tokens of scratch-work - before producing its final answer. This extended thinking lets it solve math problems, debug complex code, and handle multi-step logic that trips up standard models. DeepSeek-R1 and Qwen3 in thinking mode are the main open-weight examples in 2026.

What is the best open-weight reasoning model to run locally in 2026?

QwQ 32B at Q4_K_M for 24 GB cards - it beats the much larger DeepSeek-R1 671B on many benchmarks while fitting on a single consumer GPU. DeepSeek-R1 14B distill for 12 GB cards. Full DeepSeek-R1 671B if you have the infrastructure (100+ GB memory).

How much VRAM do I need for a reasoning model?

12 GB: DeepSeek-R1 14B distill at Q4_K_M (~9 GB). 16 GB: Qwen3 14B in thinking mode at Q5_K_M. 24 GB: QwQ 32B at Q4_K_M (~20 GB) - the best value reasoning model. 64+ GB unified: Qwen3 72B thinking or DeepSeek-R1 70B distill.

When should I use a reasoning model vs a standard model?

Use a reasoning model for: math and proofs, complex debugging, logic puzzles, planning tasks that need multi-step verification. Stick with a standard model for: chat, summarization, translation, creative writing, quick Q&A. Reasoning models are 3-10x slower and use more tokens for these routine tasks without any quality gain.

What is Qwen3 thinking mode?

Qwen3 models (14B, 32B, 72B) support a /think flag or a system prompt that activates extended chain-of-thought reasoning, similar to o1. In thinking mode, the model emits a ... block before its final answer. You can toggle this on or off per request, which is a key advantage over dedicated reasoning-only models.

Is DeepSeek-R1 safe to run locally?

The weights are publicly available on Hugging Face and have been widely audited by the research community. Like all open-weight models, running locally means data never leaves your machine. The distilled variants (7B, 14B, 32B, 70B) are the most practical for consumer hardware.

How do thinking tokens affect speed?

Reasoning models generate 500-5000+ thinking tokens before the visible answer. At 50 tok/s, a 2000-token think block adds 40 seconds of latency. For simple questions this is wasteful; for hard math or debugging it is the mechanism that produces the correct answer.

Can I run DeepSeek-R1 671B locally?

Only with a server-class setup: a Mac Studio Ultra with 192 GB, a multi-GPU workstation with 4x24 GB (96 GB VRAM total, with offloading), or a dedicated inference server. For most home users, the 32B or 70B distilled variants give 80-90% of the quality at a fraction of the hardware requirement.

Open-Weight Reasoning Models in 2026: DeepSeek-R1, Qwen3 Thinking & When to Use Them

By the RunLocalModel editorial team · Published May 13, 2026 · ~13 minute read

If you only read one paragraph In 2026, “reasoning model” means a model that generates a long internal chain of thought before answering — and the open-weight versions are genuinely impressive. QwQ 32B at Q4_K_M on a 24 GB card is the sweet spot: it outperforms the much larger DeepSeek-R1 671B on many math and coding benchmarks, and it is the best single-GPU reasoning model available locally. Qwen3 is the most flexible option because you can toggle thinking on or off per request. When to skip reasoning mode entirely: chat, summarization, translation, creative writing — the thinking tokens are slow and bring no quality gain for these tasks.

Quick answers

What is a reasoning model?: A model that generates a hidden chain of thought — typically thousands of tokens of scratch-work — before producing its visible answer. Enables hard math, debugging, and multi-step logic.
Best open-weight reasoning model for local use in 2026?: QwQ 32B at Q4_K_M for 24 GB cards. DeepSeek-R1 14B distill for 12 GB. Qwen3 14B/32B in thinking mode if you want toggle-able reasoning.
When should I use a reasoning model?: Math, proofs, complex debugging, multi-step planning. Not for chat, summarization, or quick Q&A — reasoning models are 3–10× slower there with no quality gain.
What is Qwen3 thinking mode?: A per-request flag (/think or a system prompt) that activates chain-of-thought reasoning in Qwen3 models. You can toggle it off for routine tasks — a key advantage over reasoning-only models.
How much VRAM do I need for a reasoning model?: 12 GB: R1-14B distill at Q4. 24 GB: QwQ 32B at Q4 (best value). 64 GB+: Qwen3 72B thinking or R1-70B distill.
Can I run DeepSeek-R1 671B locally?: Only with server-class hardware (Mac Studio Ultra 192 GB, multi-GPU workstation, etc.). For most users, the 32B or 70B distills give 80–90% of the quality.

How reasoning models actually work

Standard LLMs output tokens one after another and produce an answer in a single pass. A reasoning model is trained to first emit a chain of thought — a block of intermediate reasoning, typically formatted as <think>...</think> — and only then output the final answer. The key insight from the DeepSeek-R1 paper and subsequent OpenAI o1 analysis is that this extended scratch-work is not cosmetic: the model genuinely uses the thinking tokens to verify sub-steps, backtrack on errors, and build up the answer from first principles. DeepSeek-R1’s technical report showed that the reasoning distillation process — fine-tuning smaller models on R1’s chain-of-thought outputs — transfers much of this capability to 7B–70B models that otherwise could not solve competition-math problems.

The trade-off is latency and token budget. A reasoning model might spend 1,000–5,000 thinking tokens on a hard problem before answering. At 50 tok/s, that is 20–100 seconds of “thinking time” before you see the first word of the answer. For tasks where the thinking is irrelevant (writing an email, translating a sentence), this is pure waste.

What a thinking block looks like in practice When you prompt a reasoning model with a hard problem, the raw output starts with something like:

<think>

          Let me approach this step by step. The user wants to prove that...

          Actually, let me reconsider. If I substitute x = ...

          Yes, that works. Now I need to verify the base case...

          </think>

          The proof proceeds as follows: ...

Most UIs (Ollama, LM Studio, Continue.dev) either hide the think block or show it collapsed. The final answer after </think> is what you actually read.

The main open-weight reasoning models in 2026

DeepSeek-R1 and its distilled variants

DeepSeek-R1 from DeepSeek AI was the model that brought open-weight reasoning to mainstream attention in early 2025. The full model is 671B parameters (MoE architecture, ~37B active per token), requiring ~400 GB of memory to run at full precision — firmly in the datacenter tier. The more important story for local users is the distilled series: DeepSeek trained smaller dense models (7B, 14B, 32B, 70B) on R1’s chain-of-thought outputs, producing reasoning-capable models that fit on consumer hardware. The R1-Distill-14B fits in 12 GB at Q4_K_M and solves AMC/AIME-style competition math at a level that standard 14B models cannot touch.

On AIME 2024 pass@1, the R1-14B distill scores around 69% — comparable to full o1-preview on the same benchmark, at a fraction of the parameter count and running locally. The trade-off: R1 distills have a fixed reasoning style; you cannot turn off the thinking for simpler tasks.

QwQ 32B (Qwen reasoning series)

QwQ 32B from Alibaba is the single most impressive consumer-hardware reasoning model in 2026. It fits in ~20 GB at Q4_K_M (comfortable on a 24 GB card), and its benchmark scores are remarkable: it matches or beats DeepSeek-R1 671B on AIME 2024, LiveCodeBench, and MATH-500, despite being a 32B dense model vs a 671B MoE. QwQ was Alibaba’s proof-of-concept that compact reasoning models can achieve frontier-level performance on hard tasks. It is the first model to try if you have a 24 GB card and want reasoning capability.

Qwen3 in thinking mode (14B / 32B / 72B)

The Qwen3 family takes a different approach: rather than separate reasoning-only models, every Qwen3 model supports a thinking toggle. In Ollama and LM Studio, you can send /think at the start of a message, or set a system prompt flag, to switch the model into reasoning mode for that request. This makes Qwen3 the most flexible option: use thinking mode for math and debugging, turn it off for chat and quick lookups with no model swap required.

Qwen3 32B in thinking mode scores comparably to QwQ 32B on most reasoning benchmarks (they share architecture lineage). Qwen3 14B thinking mode is the best option for 16 GB cards.

DeepSeek-R1 70B distill

The R1-70B distill (Llama-3-based architecture) is the strongest open reasoning model below the 100+ GB tier. It needs ~42 GB at Q4_K_M — fitting a Mac Studio with 64 GB, a Strix Halo with 64 GB unified memory, or a dual-GPU NVIDIA setup. It is noticeably better than 32B on hard multi-step problems, but 32B covers 90% of home-user needs.

Benchmark comparison

Model	VRAM at Q4_K_M	AIME 2024 ~pass@1	MATH-500	HumanEval	Toggle thinking?
DeepSeek-R1-Distill 14B	~9 GB	~69%	~93%	~78%	No (always on)
Qwen3 14B (thinking mode)	~10 GB	~71%	~94%	~81%	Yes — per request
QwQ 32B	~20 GB	~79%	~97%	~88%	No (always on)
Qwen3 32B (thinking mode)	~20 GB	~79%	~96%	~87%	Yes — per request
DeepSeek-R1-Distill 70B	~42 GB	~83%	~98%	~91%	No (always on)
Qwen3 72B (thinking mode)	~43 GB	~84%	~98%	~92%	Yes — per request
DeepSeek-R1 671B (full)	~400 GB	~79%	~97%	~91%	No (always on)

Note the counterintuitive result: QwQ 32B at 20 GB matches or beats the full 671B DeepSeek-R1 on several benchmarks. This reflects both the quality of distillation training and the difference in inference settings. Benchmark numbers vary across providers and prompt templates; treat them as relative comparisons rather than absolute scores.

Recommended pick by hardware tier

Your VRAM / memory	Best reasoning model	Quant	Notes
8–10 GB	DeepSeek-R1-Distill 7B	Q4_K_M	Entry point. Good on basic math; struggles with competition problems.
12–16 GB	Qwen3 14B (thinking on)	Q4_K_M–Q5_K_M	Best value in this tier; toggle thinking off for everyday chat.
24 GB	QwQ 32B or Qwen3 32B thinking	Q4_K_M	Sweet spot. Both fit at ~20 GB; QwQ if pure reasoning, Qwen3 if flexibility matters.
64 GB unified (Apple/Strix Halo)	Qwen3 72B thinking	Q4_K_M	Near-frontier reasoning on a single device; very strong on complex code and math.
192 GB+ (Mac Ultra / multi-GPU)	DeepSeek-R1 671B	Q3_K_M or Q4_K_M	The full model; no quality difference from distills on most consumer tasks.

When reasoning mode actually helps — and when it does not

Use reasoning mode for these tasks

Competition math and proofs. AIME, AMC, and olympiad-style problems are precisely where reasoning models were trained. A QwQ 32B gets problems that stump GPT-4o.
Complex debugging across multiple files. When a bug requires tracking state across 300+ lines, the model’s ability to reason step-by-step and verify intermediate hypotheses matters. Pair with a good coding setup.
Algorithm design and optimization. “Find the most efficient way to solve X” benefits from the model checking its proposed approach before committing.
Logic puzzles and constraint satisfaction. Planning problems, scheduling, and any task where you need to reason about what is and is not consistent with a set of constraints.
Careful document analysis. Tasks like “find all contradictions in this contract” or “verify whether this proof is correct” are good fits.

Stick with a standard model for these tasks

Chat and conversational Q&A. You do not need 2,000 thinking tokens to answer “what’s the capital of France.” A standard model is 5–10× faster and gives the same answer.
Summarization and translation. These are pattern-matching tasks. Reasoning overhead adds latency without improving output.
Creative writing. Reasoning models can feel stilted in creative tasks because they approach them analytically. A general model like Qwen3 14B with thinking off or Mistral 7B is more natural.
Short code generation from a clear spec. “Write a Python function that reverses a string” does not benefit from 800 thinking tokens. Thinking mode shines when the spec is underspecified or the problem is genuinely hard.

Practical tip: If you are using Qwen3 with thinking mode available, a good heuristic is: toggle thinking ON for any task where you would not be confident of the answer yourself, and OFF for tasks you could answer quickly. The model generally mirrors human difficulty.

Running reasoning models locally: the practical setup

Ollama (easiest)

ollama pull qwq:32b-q4_K_M — Ollama handles the model download, quantization selection, and serving. The think block is generated internally; most Ollama clients show only the final answer by default. To see the raw thinking, use Ollama’s /api/chat endpoint directly and look for the thinking field in the response stream.

For Qwen3 with toggle: ollama pull qwen3:14b-q4_K_M then send /think Your hard problem here in the Ollama chat interface to activate reasoning for that message.

llama.cpp server mode

llama-server -m qwq-32b-q4_k_m.gguf --ctx-size 32768. The large context size matters — thinking blocks can run to thousands of tokens, and you need enough context to hold the full think + answer without truncation. For QwQ 32B at 32k context on a 24 GB card, this is tight; drop to 16k if you see OOM errors.

LM Studio

Search for QwQ-32B or DeepSeek-R1-Distill in the LM Studio model browser, download the Q4_K_M GGUF variant. LM Studio renders <think> blocks in a collapsible section in the chat UI — one of the cleanest ways to see the model’s reasoning process.

The context length problem

Reasoning models have a unique context pressure issue. A single hard problem might consume:

Your prompt: ~500 tokens
The thinking block: 1,000–8,000 tokens (varies enormously by problem difficulty)
The final answer: 200–500 tokens

At 32k context, that is fine for one or two turns. In a multi-turn conversation where each turn generates a 3,000-token think block, you hit the context ceiling much faster than with a standard model. Strategies: (a) use a larger context window (needs more VRAM — see our quantization guide), (b) keep reasoning sessions short and focused, or (c) use Qwen3 and toggle thinking off once you have the hard answer and are doing follow-up chat.

What is coming next

Two trends in 2026. First, budget forcing: the ability to set a maximum thinking-token budget per request is appearing in newer model releases. This lets you trade reasoning quality for latency — “think for up to 500 tokens, then answer” for quick questions, “think for up to 4,000 tokens” for hard ones. Qwen3 supports this via the enable_thinking flag and thinking budget parameters. Second, distillation is getting denser: the gap between a 14B distill and the full 671B is shrinking with each new training run. Expect the 14B and 32B tier to keep improving without needing more hardware.

Related guides on this site

Best Local LLM for Coding in 2026 — pair a reasoning model with a good coding setup
Choosing the Right Quantization for Local LLMs in 2026 — manage context window vs VRAM trade-offs
VRAM vs Unified Memory: When Each Wins for Local AI
Apple Silicon vs RTX 4090 for Local LLMs: A Real-World Comparison
AMD Radeon for Local LLMs in 2026: Where ROCm Stands
Best Local AI Models by Use Case (2026 Guide)