RunLocalModel.com

Choosing the Right Quantization for Local LLMs in 2026: Q4 vs Q5 vs Q6 vs Q8

By the RunLocalModel editorial team · Published May 6, 2026 · ~12 minute read

If you only read one paragraph For 99% of desktop users in 2026, Q4_K_M is still the right default. Move up to Q5_K_M or Q6_K if the model is small enough that your VRAM is not the bottleneck. Only use Q8_0 if you are doing evaluation, fine-tuning verification, or running on a workstation with tons of VRAM to spare. The new IQ family (e.g. IQ3_XXS, IQ4_XS) is genuinely useful when you are 1-2 GB short of fitting a model.

Why quantization exists at all

A modern open-weights model is shipped, by default, in 16-bit floating point (FP16 or BF16). At that precision, every parameter takes 2 bytes. An 8-billion parameter model is therefore 16 GB before you even load a single token of context. That number is bigger than the VRAM in most consumer GPUs, including the very common 8 GB and 12 GB cards.

Quantization is the trick that makes local inference work on normal hardware. Instead of storing each weight as a 16-bit float, you store it in fewer bits - 8, 6, 5, 4, or even 2 - using clever per-block scaling so that the rounding error stays small. The result is a much smaller file that still produces almost the same outputs.

The format that has won on the desktop is GGUF - the file format used by llama.cpp, which in turn powers Ollama, LM Studio, Jan, KoboldCpp, and most other desktop runtimes. Almost everything in this article assumes you are picking a GGUF file.

The quantization names, decoded

GGUF quantizations have intimidating names. They follow a pattern: Q<bits>_<variant>.

You will also occasionally see F16 (the original, unquantized 16-bit weights) and the brand-new MXFP4 format used by some 2026-era frontier-class open models. Treat F16 as "the reference" and MXFP4 as a 4-bit format optimized for newer GPUs.

How much quality do you actually lose?

The honest answer is: a lot less than you would expect, until you go below 4 bits.

On standard perplexity benchmarks against the FP16 reference, the typical penalty for the most popular quantizations on a well-trained 7B-or-larger model looks roughly like this in our reading of public llama.cpp evaluations and our own spot checks:

QuantBits / weightFile size vs FP16Quality lossWhat it feels like
Q8_0~8.5~53%NegligibleIndistinguishable from FP16 in casual use.
Q6_K~6.6~41%Very smallIndistinguishable from FP16 in casual use.
Q5_K_M~5.7~36%SmallSlightly more "off" answers on hard reasoning, basically fine for chat.
Q4_K_M~4.8~30%Noticeable but acceptableThe community default. Occasional small errors, still feels "smart".
IQ4_XS~4.3~27%Similar to Q4_K_M, sometimes betterGreat for "I am 500 MB short of fitting it".
IQ3_XXS~3.1~19%Visible degradationNoticeably "dumber". Use only if Q4 will not fit at all.
Q2_K~2.6~16%SevereOften produces broken or repetitive output. Last resort.

Two things to keep in mind. First, quality loss scales inversely with model size: a 70B at Q4 feels almost identical to FP16, while a 3B at Q4 starts to noticeably hallucinate. Second, perplexity is a blunt instrument; it does not capture how a model behaves on coding, agentic tool use, or long-context tasks, all of which we have observed to degrade faster than perplexity alone suggests.

Counterintuitive but important: a larger model at a lower quantization is almost always better than a smaller model at a higher quantization, as long as both fit. Llama 3.1 70B at Q4_K_M beats Llama 3.1 8B at Q8_0 on essentially every benchmark we have looked at, despite using less than 2× the memory.

How quantization affects speed

A common misconception: "smaller quant = faster." That is mostly true on GPU, and only sometimes true on CPU.

Inference speed for local LLMs is dominated by memory bandwidth, not raw compute. Every generated token requires reading the entire model weights from memory at least once. So if your 8B model is 8 GB at Q8 and 4.5 GB at Q4, the Q4 version will run roughly 1.7× faster on the same GPU - simply because there is less data to move.

Two important caveats:

What about the KV cache?

People obsess over weight quantization and forget that the KV cache can be quantized too. The KV cache stores the attention state for every token in your context window. At long context (32k+ tokens), it can rival the size of the model weights themselves.

llama.cpp lets you quantize the KV cache to q8_0 or even q4_0 via the --cache-type-k and --cache-type-v flags (Ollama exposes this through its OLLAMA_KV_CACHE_TYPE environment variable). In our testing:

Our home-page compatibility checker already accounts for KV cache quantization in its VRAM estimate. The methodology is documented on the how-it-works page.

Concrete recommendations by hardware

8 GB VRAM (RTX 3050, 3060 Ti, 4060, M-series base)

Stick to 7B-8B models at Q4_K_M. If a model is 1 GB too big, try IQ4_XS or quantize the KV cache to q8_0. Avoid going above 8B - Q3 on a 13B will almost always feel worse than Q4_K_M on a 7B.

12 GB VRAM (RTX 3060 12GB, 4070)

The sweet spot. 7B-8B at Q5_K_M or Q6_K. Or 12B-14B at Q4_K_M with a 4k-8k context. This tier is where local LLMs start to feel genuinely useful for daily work.

16 GB VRAM (RTX 4060 Ti 16GB, 4070 Ti Super, M-series 16GB unified)

12B-14B at Q5_K_M, or 30B-32B at Q4_K_M with a 4k context. On Apple Silicon, watch out for the unified memory pressure - leave at least 4 GB headroom for the OS.

24 GB VRAM (RTX 3090, 4090, 7900 XTX)

30B-34B at Q5_K_M with a comfortable 8k context, or 70B at IQ3_XXS if you really want to feel the difference (and accept the speed hit). This is where Q6_K and Q8_0 on 8B-14B models become viable for users who care about marginal quality gains.

48 GB+ (workstation, dual-GPU, or 64GB+ Apple Silicon)

70B at Q4_K_M or Q5_K_M with a 16k+ context window. At this tier, the quantization choice matters less than the engine and OS configuration.

You can plug your exact hardware into the home-page checker to see these recommendations applied to specific models from our database, with VRAM and tokens-per-second estimates.

The five-minute decision rule

When in doubt, this is the algorithm we use ourselves:

  1. Pick the largest model that fits at Q4_K_M with your target context length.
  2. If that model leaves more than 25% of your VRAM unused, step up to Q5_K_M.
  3. If it still leaves more than 25% unused, step up to Q6_K.
  4. If you cannot fit your target model at Q4_K_M, try IQ4_XS or quantize the KV cache to q8_0 before dropping to a smaller model.
  5. Only consider Q8_0 when you are explicitly evaluating quality and have memory to burn.

What is changing in 2026

Two trends are worth watching. First, several major open releases now ship in MXFP4 as a first-class format rather than as a quantized derivative. When a model is trained with quantization in mind, the quality loss is dramatically smaller, and you can expect more 4-bit-native releases throughout the year. Second, the gap between IQ and K quants on consumer GPUs continues to shrink as runtimes improve their dequantization kernels - IQ4_XS may eventually replace Q4_K_M as the community default.

For now, though, our recommendation has not changed in 18 months: Q4_K_M is the right default, and you should only deviate when you have a concrete reason to.

Related guides on this site