Best Local AI Models by Use Case (2026 Guide)
This guide is a hand-picked reference to the most practical open-weight AI models for running locally in 2026. Instead of listing every model on Hugging Face, we focus on well-known, widely-used models that work well with Ollama, LM Studio, and llama.cpp. Each entry includes a short original description, hardware hints, and a link to the model page.
Q4_K_M.
New to running AI locally? Start with our step-by-step local AI setup guide for Windows, macOS, and Linux, then come back here to choose your first model.
General Chat Assistants
These models are good first choices for general questions, writing help, and everyday conversation. They are well supported by Ollama and LM Studio, and there are many community fine-tunes available.
Meta's Llama 3 8B is a strong general-purpose assistant that runs comfortably on modern laptops and desktops. It is a good default choice when you are not sure which model to pick first.
chatgeneral
Smaller and much faster than 8B. Great for older laptops, low-memory machines, and first-time testing before trying larger models.
chatlightweight
Mistral 7B has been a community favorite for years because of its clean instruction-following and low memory use. Good for summaries, short answers, and quick prompts.
chatinstruct
Google's Gemma 2 9B is a solid open-weight assistant that often feels more conversational than similarly sized alternatives. A good option if Llama 3 is not a fit for your workflow.
chatgeneral
Qwen 2.5 7B handles multilingual conversation and general Q&A well. It is a great alternative if you need strong Chinese or Asian-language support.
chatmultilingual
Coding and Developer Assistants
Coding models are trained or fine-tuned on source code. They are good for autocomplete-style suggestions, bug explanations, and short code rewrites. Pairing a coding model with a code editor integration (for example in VS Code) is a common workflow.
A compact coding model that performs well for small refactors, snippet generation, and code explanations.
code
A strong general coding model. Works well for Python, JavaScript, and many other mainstream languages. A good default coding assistant on mid-range hardware.
codemultilingual
Meta's Code Llama family. Best if you want a larger coding model and your hardware can fit around 10 GB of VRAM or Unified Memory.
code
Reasoning and Math
These models are tuned or designed for multi-step problem solving, math, and structured answers. They often trade some conversational style for better logical accuracy.
Llama 3.1 improved on Llama 3 in long-context reasoning and tool use. It is still very practical to run on consumer hardware.
reasoninggeneral
A step up in reasoning quality compared to 7B-class models, while still feasible on a single modern GPU or a well-specced Apple Silicon Mac.
reasoningmultilingual
Microsoft's Phi family is known for punching above its weight on math and structured reasoning relative to its small size.
reasoninglightweight
Vision and Multimodal
Multimodal models accept images along with text prompts. They are useful for describing screenshots, understanding diagrams, and answering visual questions.
A practical multimodal model that can describe images and reason about visual content. Works well through LM Studio and several Ollama-compatible runtimes.
visionmultimodal
LLaVA is one of the most established open-source multimodal families. It is a good fit if you want a well-tested image-to-text model.
visionmultimodal
Lightweight Models for Older or Low-Memory Machines
If your computer has 8 GB of RAM or less, or if you only have integrated graphics, start with a small model. These are also great for quick tests before downloading heavier models.
The smallest of the Llama 3.2 family. Surprisingly usable for short prompts, summaries, and testing your local AI setup.
lightweightchat
Gemma 2 2B is a good low-memory alternative to Llama for general chat on CPU-only or low-VRAM setups.
lightweightchat
A solid 3B Qwen model with strong multilingual behavior, which is helpful for users outside English-only workflows.
lightweightmultilingual
Embeddings and Retrieval
Embedding models don't chat. Instead they turn text into vectors that let you build local search, RAG (retrieval augmented generation), and semantic similarity tools. They are small, fast, and run well on CPU.
A classic small embedding model. Fast on CPU, good general baseline for search and similarity tasks.
embeddings
A well-benchmarked English embedding model. Good default for document search and simple RAG pipelines.
embeddings
A multilingual embedding model that works in many languages. Useful if you want to build search or clustering over non-English text.
embeddingsmultilingual
How to Choose
Our general suggestion is to start with a 3B model, then move up to 7B or 8B, and only try 13B+ or specialized models once you understand how your machine behaves under load. For background on our estimates, see how RunLocalModel estimates VRAM and speed.
Quick reference
- Under 8 GB RAM: start with Llama 3.2 1B or 3B, Gemma 2 2B, or Phi-3.5 mini.
- 8-16 GB: Llama 3 8B, Mistral 7B, Gemma 2 9B, Qwen 2.5 7B.
- 16-32 GB: Qwen 2.5 14B, CodeLlama 13B, Qwen 2.5 Coder 7B with larger context.
- Apple Silicon 32 GB+ Unified Memory: larger 13B-30B models are usually comfortable.
- 48 GB+ VRAM or Unified Memory: 70B-class models become possible, but still run heavier.
Want to double-check whether a specific model fits your hardware?
Run the compatibility checker