Choosing the Right Quantization for Local LLMs in 2026: Q4 vs Q5 vs Q6 vs Q8
Q4_K_M is still the right default. Move up to Q5_K_M or Q6_K if the model is small enough that your VRAM is not the bottleneck. Only use Q8_0 if you are doing evaluation, fine-tuning verification, or running on a workstation with tons of VRAM to spare. The new IQ family (e.g. IQ3_XXS, IQ4_XS) is genuinely useful when you are 1-2 GB short of fitting a model.
Why quantization exists at all
A modern open-weights model is shipped, by default, in 16-bit floating point (FP16 or BF16). At that precision, every parameter takes 2 bytes. An 8-billion parameter model is therefore 16 GB before you even load a single token of context. That number is bigger than the VRAM in most consumer GPUs, including the very common 8 GB and 12 GB cards.
Quantization is the trick that makes local inference work on normal hardware. Instead of storing each weight as a 16-bit float, you store it in fewer bits - 8, 6, 5, 4, or even 2 - using clever per-block scaling so that the rounding error stays small. The result is a much smaller file that still produces almost the same outputs.
The format that has won on the desktop is GGUF - the file format used by llama.cpp, which in turn powers Ollama, LM Studio, Jan, KoboldCpp, and most other desktop runtimes. Almost everything in this article assumes you are picking a GGUF file.
The quantization names, decoded
GGUF quantizations have intimidating names. They follow a pattern: Q<bits>_<variant>.
- The number is the average bits per weight.
Q4means roughly 4 bits per weight;Q8means roughly 8. - The suffix tells you which weights get more bits.
_K_M("K-quants, medium") gives more bits to the most sensitive weights (attention layers, output projection) and fewer to the rest._0is the older, uniform style._Sis small,_Lis large. - The
IQprefix (e.g.IQ4_XS,IQ3_XXS) marks the newer "importance-aware" quantizations that use a calibration dataset to decide what to round. They are smaller than the equivalentKquants at similar quality, but slower on CPU.
You will also occasionally see F16 (the original, unquantized 16-bit weights) and the brand-new MXFP4 format used by some 2026-era frontier-class open models. Treat F16 as "the reference" and MXFP4 as a 4-bit format optimized for newer GPUs.
How much quality do you actually lose?
The honest answer is: a lot less than you would expect, until you go below 4 bits.
On standard perplexity benchmarks against the FP16 reference, the typical penalty for the most popular quantizations on a well-trained 7B-or-larger model looks roughly like this in our reading of public llama.cpp evaluations and our own spot checks:
| Quant | Bits / weight | File size vs FP16 | Quality loss | What it feels like |
|---|---|---|---|---|
Q8_0 | ~8.5 | ~53% | Negligible | Indistinguishable from FP16 in casual use. |
Q6_K | ~6.6 | ~41% | Very small | Indistinguishable from FP16 in casual use. |
Q5_K_M | ~5.7 | ~36% | Small | Slightly more "off" answers on hard reasoning, basically fine for chat. |
Q4_K_M | ~4.8 | ~30% | Noticeable but acceptable | The community default. Occasional small errors, still feels "smart". |
IQ4_XS | ~4.3 | ~27% | Similar to Q4_K_M, sometimes better | Great for "I am 500 MB short of fitting it". |
IQ3_XXS | ~3.1 | ~19% | Visible degradation | Noticeably "dumber". Use only if Q4 will not fit at all. |
Q2_K | ~2.6 | ~16% | Severe | Often produces broken or repetitive output. Last resort. |
Two things to keep in mind. First, quality loss scales inversely with model size: a 70B at Q4 feels almost identical to FP16, while a 3B at Q4 starts to noticeably hallucinate. Second, perplexity is a blunt instrument; it does not capture how a model behaves on coding, agentic tool use, or long-context tasks, all of which we have observed to degrade faster than perplexity alone suggests.
Llama 3.1 70B at Q4_K_M beats Llama 3.1 8B at Q8_0 on essentially every benchmark we have looked at, despite using less than 2× the memory.
How quantization affects speed
A common misconception: "smaller quant = faster." That is mostly true on GPU, and only sometimes true on CPU.
Inference speed for local LLMs is dominated by memory bandwidth, not raw compute. Every generated token requires reading the entire model weights from memory at least once. So if your 8B model is 8 GB at Q8 and 4.5 GB at Q4, the Q4 version will run roughly 1.7× faster on the same GPU - simply because there is less data to move.
Two important caveats:
- The
IQfamily is slower on CPU. The same calibration that makesIQ4_XSsmaller thanQ4_K_Mat similar quality also makes it more expensive to dequantize on the fly. On a laptop without a GPU, expectIQ4_XSto be 10-25% slower thanQ4_K_M. On a recent NVIDIA GPU the gap is usually under 5%. - Going below Q4 rarely helps speed. Once you are reading less than the GPU's bandwidth budget per token, the bottleneck shifts elsewhere (the KV cache, the attention computation), and
Q3vsQ4on the same model can be within margin of error on tokens-per-second while costing you noticeable quality.
What about the KV cache?
People obsess over weight quantization and forget that the KV cache can be quantized too. The KV cache stores the attention state for every token in your context window. At long context (32k+ tokens), it can rival the size of the model weights themselves.
llama.cpp lets you quantize the KV cache to q8_0 or even q4_0 via the --cache-type-k and --cache-type-v flags (Ollama exposes this through its OLLAMA_KV_CACHE_TYPE environment variable). In our testing:
q8_0KV cache is essentially free quality-wise. Halves your KV memory. Use it.q4_0KV cache is risky. It works, but you will see degraded long-context behavior - particularly for code and structured output. Only do this if you are desperate to fit a longer context.
Our home-page compatibility checker already accounts for KV cache quantization in its VRAM estimate. The methodology is documented on the how-it-works page.
Concrete recommendations by hardware
8 GB VRAM (RTX 3050, 3060 Ti, 4060, M-series base)
Stick to 7B-8B models at Q4_K_M. If a model is 1 GB too big, try IQ4_XS or quantize the KV cache to q8_0. Avoid going above 8B - Q3 on a 13B will almost always feel worse than Q4_K_M on a 7B.
12 GB VRAM (RTX 3060 12GB, 4070)
The sweet spot. 7B-8B at Q5_K_M or Q6_K. Or 12B-14B at Q4_K_M with a 4k-8k context. This tier is where local LLMs start to feel genuinely useful for daily work.
16 GB VRAM (RTX 4060 Ti 16GB, 4070 Ti Super, M-series 16GB unified)
12B-14B at Q5_K_M, or 30B-32B at Q4_K_M with a 4k context. On Apple Silicon, watch out for the unified memory pressure - leave at least 4 GB headroom for the OS.
24 GB VRAM (RTX 3090, 4090, 7900 XTX)
30B-34B at Q5_K_M with a comfortable 8k context, or 70B at IQ3_XXS if you really want to feel the difference (and accept the speed hit). This is where Q6_K and Q8_0 on 8B-14B models become viable for users who care about marginal quality gains.
48 GB+ (workstation, dual-GPU, or 64GB+ Apple Silicon)
70B at Q4_K_M or Q5_K_M with a 16k+ context window. At this tier, the quantization choice matters less than the engine and OS configuration.
The five-minute decision rule
When in doubt, this is the algorithm we use ourselves:
- Pick the largest model that fits at
Q4_K_Mwith your target context length. - If that model leaves more than 25% of your VRAM unused, step up to
Q5_K_M. - If it still leaves more than 25% unused, step up to
Q6_K. - If you cannot fit your target model at
Q4_K_M, tryIQ4_XSor quantize the KV cache toq8_0before dropping to a smaller model. - Only consider
Q8_0when you are explicitly evaluating quality and have memory to burn.
What is changing in 2026
Two trends are worth watching. First, several major open releases now ship in MXFP4 as a first-class format rather than as a quantized derivative. When a model is trained with quantization in mind, the quality loss is dramatically smaller, and you can expect more 4-bit-native releases throughout the year. Second, the gap between IQ and K quants on consumer GPUs continues to shrink as runtimes improve their dequantization kernels - IQ4_XS may eventually replace Q4_K_M as the community default.
For now, though, our recommendation has not changed in 18 months: Q4_K_M is the right default, and you should only deviate when you have a concrete reason to.
Related guides on this site
- How we estimate VRAM and tokens/second - the math behind every number on the home page.
- Best Local AI Models by Use Case (2026) - which models we actually recommend for chat, coding, reasoning, and vision.
- How to run local AI - end-to-end setup walkthrough.
- Ollama vs LM Studio - which runtime to install first.
- FAQ - quick answers to the questions we get most often.