RunLocalModel.com

Open-Weight Reasoning Models in 2026: DeepSeek-R1, Qwen3 Thinking & When to Use Them

By the RunLocalModel editorial team · Published May 13, 2026 · ~13 minute read

If you only read one paragraph In 2026, “reasoning model” means a model that generates a long internal chain of thought before answering — and the open-weight versions are genuinely impressive. QwQ 32B at Q4_K_M on a 24 GB card is the sweet spot: it outperforms the much larger DeepSeek-R1 671B on many math and coding benchmarks, and it is the best single-GPU reasoning model available locally. Qwen3 is the most flexible option because you can toggle thinking on or off per request. When to skip reasoning mode entirely: chat, summarization, translation, creative writing — the thinking tokens are slow and bring no quality gain for these tasks.
Quick answers
What is a reasoning model?
A model that generates a hidden chain of thought — typically thousands of tokens of scratch-work — before producing its visible answer. Enables hard math, debugging, and multi-step logic.
Best open-weight reasoning model for local use in 2026?
QwQ 32B at Q4_K_M for 24 GB cards. DeepSeek-R1 14B distill for 12 GB. Qwen3 14B/32B in thinking mode if you want toggle-able reasoning.
When should I use a reasoning model?
Math, proofs, complex debugging, multi-step planning. Not for chat, summarization, or quick Q&A — reasoning models are 3–10× slower there with no quality gain.
What is Qwen3 thinking mode?
A per-request flag (/think or a system prompt) that activates chain-of-thought reasoning in Qwen3 models. You can toggle it off for routine tasks — a key advantage over reasoning-only models.
How much VRAM do I need for a reasoning model?
12 GB: R1-14B distill at Q4. 24 GB: QwQ 32B at Q4 (best value). 64 GB+: Qwen3 72B thinking or R1-70B distill.
Can I run DeepSeek-R1 671B locally?
Only with server-class hardware (Mac Studio Ultra 192 GB, multi-GPU workstation, etc.). For most users, the 32B or 70B distills give 80–90% of the quality.

How reasoning models actually work

Standard LLMs output tokens one after another and produce an answer in a single pass. A reasoning model is trained to first emit a chain of thought — a block of intermediate reasoning, typically formatted as <think>...</think> — and only then output the final answer. The key insight from the DeepSeek-R1 paper and subsequent OpenAI o1 analysis is that this extended scratch-work is not cosmetic: the model genuinely uses the thinking tokens to verify sub-steps, backtrack on errors, and build up the answer from first principles. DeepSeek-R1’s technical report showed that the reasoning distillation process — fine-tuning smaller models on R1’s chain-of-thought outputs — transfers much of this capability to 7B–70B models that otherwise could not solve competition-math problems.

The trade-off is latency and token budget. A reasoning model might spend 1,000–5,000 thinking tokens on a hard problem before answering. At 50 tok/s, that is 20–100 seconds of “thinking time” before you see the first word of the answer. For tasks where the thinking is irrelevant (writing an email, translating a sentence), this is pure waste.

What a thinking block looks like in practice When you prompt a reasoning model with a hard problem, the raw output starts with something like:

<think>
Let me approach this step by step. The user wants to prove that...
Actually, let me reconsider. If I substitute x = ...
Yes, that works. Now I need to verify the base case...
</think>
The proof proceeds as follows: ...


Most UIs (Ollama, LM Studio, Continue.dev) either hide the think block or show it collapsed. The final answer after </think> is what you actually read.

The main open-weight reasoning models in 2026

DeepSeek-R1 and its distilled variants

DeepSeek-R1 from DeepSeek AI was the model that brought open-weight reasoning to mainstream attention in early 2025. The full model is 671B parameters (MoE architecture, ~37B active per token), requiring ~400 GB of memory to run at full precision — firmly in the datacenter tier. The more important story for local users is the distilled series: DeepSeek trained smaller dense models (7B, 14B, 32B, 70B) on R1’s chain-of-thought outputs, producing reasoning-capable models that fit on consumer hardware. The R1-Distill-14B fits in 12 GB at Q4_K_M and solves AMC/AIME-style competition math at a level that standard 14B models cannot touch.

On AIME 2024 pass@1, the R1-14B distill scores around 69% — comparable to full o1-preview on the same benchmark, at a fraction of the parameter count and running locally. The trade-off: R1 distills have a fixed reasoning style; you cannot turn off the thinking for simpler tasks.

QwQ 32B (Qwen reasoning series)

QwQ 32B from Alibaba is the single most impressive consumer-hardware reasoning model in 2026. It fits in ~20 GB at Q4_K_M (comfortable on a 24 GB card), and its benchmark scores are remarkable: it matches or beats DeepSeek-R1 671B on AIME 2024, LiveCodeBench, and MATH-500, despite being a 32B dense model vs a 671B MoE. QwQ was Alibaba’s proof-of-concept that compact reasoning models can achieve frontier-level performance on hard tasks. It is the first model to try if you have a 24 GB card and want reasoning capability.

Qwen3 in thinking mode (14B / 32B / 72B)

The Qwen3 family takes a different approach: rather than separate reasoning-only models, every Qwen3 model supports a thinking toggle. In Ollama and LM Studio, you can send /think at the start of a message, or set a system prompt flag, to switch the model into reasoning mode for that request. This makes Qwen3 the most flexible option: use thinking mode for math and debugging, turn it off for chat and quick lookups with no model swap required.

Qwen3 32B in thinking mode scores comparably to QwQ 32B on most reasoning benchmarks (they share architecture lineage). Qwen3 14B thinking mode is the best option for 16 GB cards.

DeepSeek-R1 70B distill

The R1-70B distill (Llama-3-based architecture) is the strongest open reasoning model below the 100+ GB tier. It needs ~42 GB at Q4_K_M — fitting a Mac Studio with 64 GB, a Strix Halo with 64 GB unified memory, or a dual-GPU NVIDIA setup. It is noticeably better than 32B on hard multi-step problems, but 32B covers 90% of home-user needs.

Benchmark comparison

ModelVRAM at Q4_K_MAIME 2024 ~pass@1MATH-500HumanEvalToggle thinking?
DeepSeek-R1-Distill 14B~9 GB~69%~93%~78%No (always on)
Qwen3 14B (thinking mode)~10 GB~71%~94%~81%Yes — per request
QwQ 32B~20 GB~79%~97%~88%No (always on)
Qwen3 32B (thinking mode)~20 GB~79%~96%~87%Yes — per request
DeepSeek-R1-Distill 70B~42 GB~83%~98%~91%No (always on)
Qwen3 72B (thinking mode)~43 GB~84%~98%~92%Yes — per request
DeepSeek-R1 671B (full)~400 GB~79%~97%~91%No (always on)

Note the counterintuitive result: QwQ 32B at 20 GB matches or beats the full 671B DeepSeek-R1 on several benchmarks. This reflects both the quality of distillation training and the difference in inference settings. Benchmark numbers vary across providers and prompt templates; treat them as relative comparisons rather than absolute scores.

Recommended pick by hardware tier

Your VRAM / memoryBest reasoning modelQuantNotes
8–10 GBDeepSeek-R1-Distill 7BQ4_K_MEntry point. Good on basic math; struggles with competition problems.
12–16 GBQwen3 14B (thinking on)Q4_K_M–Q5_K_MBest value in this tier; toggle thinking off for everyday chat.
24 GBQwQ 32B or Qwen3 32B thinkingQ4_K_MSweet spot. Both fit at ~20 GB; QwQ if pure reasoning, Qwen3 if flexibility matters.
64 GB unified (Apple/Strix Halo)Qwen3 72B thinkingQ4_K_MNear-frontier reasoning on a single device; very strong on complex code and math.
192 GB+ (Mac Ultra / multi-GPU)DeepSeek-R1 671BQ3_K_M or Q4_K_MThe full model; no quality difference from distills on most consumer tasks.

When reasoning mode actually helps — and when it does not

Use reasoning mode for these tasks

Stick with a standard model for these tasks

Practical tip: If you are using Qwen3 with thinking mode available, a good heuristic is: toggle thinking ON for any task where you would not be confident of the answer yourself, and OFF for tasks you could answer quickly. The model generally mirrors human difficulty.

Running reasoning models locally: the practical setup

Ollama (easiest)

ollama pull qwq:32b-q4_K_M — Ollama handles the model download, quantization selection, and serving. The think block is generated internally; most Ollama clients show only the final answer by default. To see the raw thinking, use Ollama’s /api/chat endpoint directly and look for the thinking field in the response stream.

For Qwen3 with toggle: ollama pull qwen3:14b-q4_K_M then send /think Your hard problem here in the Ollama chat interface to activate reasoning for that message.

llama.cpp server mode

llama-server -m qwq-32b-q4_k_m.gguf --ctx-size 32768. The large context size matters — thinking blocks can run to thousands of tokens, and you need enough context to hold the full think + answer without truncation. For QwQ 32B at 32k context on a 24 GB card, this is tight; drop to 16k if you see OOM errors.

LM Studio

Search for QwQ-32B or DeepSeek-R1-Distill in the LM Studio model browser, download the Q4_K_M GGUF variant. LM Studio renders <think> blocks in a collapsible section in the chat UI — one of the cleanest ways to see the model’s reasoning process.

The context length problem

Reasoning models have a unique context pressure issue. A single hard problem might consume:

At 32k context, that is fine for one or two turns. In a multi-turn conversation where each turn generates a 3,000-token think block, you hit the context ceiling much faster than with a standard model. Strategies: (a) use a larger context window (needs more VRAM — see our quantization guide), (b) keep reasoning sessions short and focused, or (c) use Qwen3 and toggle thinking off once you have the hard answer and are doing follow-up chat.

What is coming next

Two trends in 2026. First, budget forcing: the ability to set a maximum thinking-token budget per request is appearing in newer model releases. This lets you trade reasoning quality for latency — “think for up to 500 tokens, then answer” for quick questions, “think for up to 4,000 tokens” for hard ones. Qwen3 supports this via the enable_thinking flag and thinking budget parameters. Second, distillation is getting denser: the gap between a 14B distill and the full 671B is shrinking with each new training run. Expect the 14B and 32B tier to keep improving without needing more hardware.

Related guides on this site