About RunLocalModel
Why we built this
Running large language models on your own machine has gone from "almost impossible" to "surprisingly practical" in just a few years. But the experience of getting started is still painful: you find a model on Hugging Face, see four different quantization options (Q4_K_M, Q5_K_M, Q6_K, Q8_0), six file sizes, and no clear answer to the only question that actually matters - "will this run on my computer, and how fast?"
Most people end up downloading an 8 GB file, watching their laptop crawl, and giving up. We wanted to fix the 30 seconds of decision-making that happens before the download starts. That is it. No newsletter, no paid tier, no account.
Who runs the site
RunLocalModel is maintained by a small, independent team of software engineers who have been running local LLMs since the early llama.cpp days. We work on the site in our spare time. We are not affiliated with NVIDIA, AMD, Apple, Hugging Face, Ollama, LM Studio, or any model vendor. When we recommend a model or a tool, it is because we have personally used it on the hardware we are talking about.
If you want to reach us - bug reports, hardware data corrections, partnership questions, anything - please use our contact page.
What the site actually does
- Compatibility checker on the home page: pick your GPU (or input your VRAM and memory bandwidth manually), pick a model and quantization, and we tell you whether it fits, what context length you can afford, and a rough tokens-per-second estimate.
- Editorial guides we have written based on our own testing - Best Local AI Models by Use Case (2026), How to Run Local AI, Choosing the Right Quantization in 2026, and Ollama vs LM Studio.
- A clear methodology page that documents exactly how every number on the site is calculated, with formulas.
- A FAQ for the questions we keep getting in email.
Our editorial principles
- Show the math. Every estimate on the site is reproducible. The formulas are documented on the methodology page, not buried in our codebase.
- Be honest about uncertainty. VRAM estimation is not exact. Real performance depends on your OS, background apps, thermal throttling, the exact build of llama.cpp you are using, and even your GPU driver version. We try to flag this everywhere it matters.
- No invented benchmarks. When we cite a tokens-per-second number, it either comes from our own measurement on real hardware or from a clearly linked third-party source.
- No affiliate-driven recommendations. Our model and tool recommendations are based on what we actually use day to day. We do not change a recommendation because someone paid us to.
- Update or retract. If we get something wrong, we fix it and add a note to the changelog below.
Where our data comes from
- Model metadata - pulled from the public Hugging Face API and refreshed at build time.
- GPU specifications - manually compiled from official datasheets published by NVIDIA, AMD, Apple, and Intel.
- VRAM and KV-cache estimation logic - based on the actual memory allocation behavior of llama.cpp, which is the engine that powers Ollama, LM Studio, and most desktop runtimes.
What this site is not
- It is not a benchmark database. We provide estimates, not measurements. If you need exact numbers for purchasing decisions, please run the model yourself.
- It is not a model download host. We link to the original Hugging Face pages and never re-host weights.
- It is not a chat product. We do not run any model on your behalf - everything you do here happens client-side in your browser.
Privacy
We do not require an account, do not collect personal information, and do not have a database of users. The site uses Google Analytics (anonymous, aggregated traffic only) and may serve ads via Google AdSense in the future. Full details are on our privacy page.
Changelog
- May 6, 2026 - Published the Choosing the Right Quantization long-form guide. Added About and Contact pages. Redesigned the site header and footer. Expanded the home page introduction.
- May 5, 2026 - Published llama.cpp vs Ollama vs LM Studio vs Hugging Face vs MLX - a plain-English explainer of the local AI ecosystem.
- May 4, 2026 - Published How to Run Gemma on Your Phone - a practical Android and iOS guide with model recommendations, real tokens-per-second numbers, and step-by-step setup.
- April 29, 2026 - Published Apple Silicon vs RTX 4090 for Local LLMs - a head-to-head with concrete throughput, capacity, power, and cost comparisons.
- April 22, 2026 - Published VRAM vs Unified Memory - when each architecture wins for local AI workloads.
- April 2026 - Added the How to Run Local AI walkthrough and the 2026 Model Guide. Refreshed the Hugging Face model database.
- March 2026 - Added the llama.cpp-style VRAM estimator (weights + KV cache + overhead) alongside the original simple estimator.
- February 2026 - First public release with the GPU database, model database, and a single-method VRAM estimator.