RunLocalModel.com

How to Run Gemma on Your Phone in 2026: Android & iOS Guide

By the RunLocalModel editorial team · Published May 4, 2026 · ~11 minute read

If you only read one paragraph On a phone made in the last two years, you can absolutely run Gemma locally - and in 2026 the experience is good, not just a demo. For most people, install PocketPal AI from the App Store or Play Store, download Gemma 3 1B (any phone) or Gemma 3 4B (8 GB+ RAM), and you are chatting in five minutes. Anything bigger than 4B is technically possible on a top-end phone but the user experience falls off a cliff. The right reason to do this on a phone is privacy and offline use, not raw speed.

What is Gemma, in one paragraph

Gemma is Google's family of open-weight models. Gemma 1 launched in February 2024, Gemma 2 in mid-2024, and Gemma 3 in early 2025 with first-class multimodal (vision) support. The headline reason Gemma matters for phones is the small sizes: Gemma 3 ships in a 1B and a 4B variant alongside the larger 12B and 27B models. The 1B and 4B were explicitly trained with mobile and edge deployment in mind - Google publishes them in formats that map cleanly to the inference engines built into Android and iOS phones.

The license is friendly enough for personal and commercial use (with a few use-case restrictions). Weights are hosted on Hugging Face and on Google's own Kaggle and AI Studio properties. If you are new to where models live, see our local AI tools explainer.

Can your phone actually run it?

The answer almost always boils down to RAM. Gemma at Q4_K_M quantization (the standard 4-bit format used by the llama.cpp family of mobile runtimes) takes roughly:

ModelFile sizeWorking RAM neededRealistic phone fit
Gemma 3 1B~700 MB~1.2-1.5 GBAny phone with 4 GB+ total RAM.
Gemma 2 2B~1.4 GB~2.0-2.4 GBAny phone with 6 GB+ total RAM.
Gemma 3 4B~2.5 GB~3.5-4.2 GBPhones with 8 GB+ RAM (most flagships from 2023 onward).
Gemma 2 9B~5.5 GB~6.5-7.5 GBOnly on 12-16 GB Android flagships, very tight.
Gemma 3 12B~7.0 GB~8.5-9.5 GB16 GB Android flagships only, will often crash other apps.

The "working RAM" column matters more than the file size - you need room for the weights, the KV cache for the conversation, the chat app itself, and the OS. iOS in particular caps how much RAM a single app can consume (typically 50-60% of total system RAM), which is why an 8 GB iPhone behaves like a "4-5 GB usable" device for LLM apps. Android is more permissive but the OS will still kill backgrounded apps to free RAM for the model.

Our recommendation: for almost everyone, the right Gemma to run on a phone in 2026 is either Gemma 3 1B (universal compatibility, instant responses) or Gemma 3 4B (much smarter, requires a flagship phone from the last 2-3 years). Skip 9B and 12B unless you are explicitly testing the limit.

Realistic speed expectations

Tokens-per-second varies enormously by chip generation, runtime, and whether the inference is running on CPU, GPU, or NPU. Approximate sustained throughput from public benchmarks and our own measurements on Gemma at Q4_K_M:

PhoneChipGemma 1BGemma 2-4B
iPhone 15 Pro / 16A17 Pro / A1850-80 tok/s20-30 tok/s
iPhone 17 ProA19 Pro70-110 tok/s30-45 tok/s
Pixel 9 ProTensor G440-60 tok/s15-22 tok/s
Pixel 10 ProTensor G555-80 tok/s22-32 tok/s
Galaxy S24 UltraSnapdragon 8 Gen 355-80 tok/s25-35 tok/s
Galaxy S25 UltraSnapdragon 8 Elite80-120 tok/s35-55 tok/s
Galaxy S26 UltraSnapdragon 8 Elite Gen 2100-140 tok/s45-70 tok/s
Mid-range 2024 phoneDimensity 7000-class15-25 tok/stoo slow to be pleasant

Two patterns to notice. First, 1B is fast everywhere - even mid-range phones produce tokens faster than you can read. Second, on 4B the chip generation matters a lot: a 2024 flagship is roughly twice as fast as a 2022 one, because mobile NPUs and the inference engines that target them have improved meaningfully each year.

Sustained workloads will throttle. Phones, especially fanless ones, lower clocks aggressively when they get warm. Throughput numbers above are for the first ~2 minutes of inference. After 5-10 minutes of continuous generation, expect 30-50% lower tokens-per-second and a noticeably hot device. For long chats, take breaks or hold the phone in a way that lets the back radiate heat.

The recommended path on iOS

Option A (easiest): PocketPal AI

PocketPal AI is an open-source iOS (and Android) app that wraps llama.cpp with a clean chat UI and a built-in model browser that pulls directly from Hugging Face.

1. Install PocketPal AI from the App Store. Free, no account required.
2. Open it, tap "Models," then "Add Model." Search "Gemma" - the in-app browser shows community-quantized GGUF files from Hugging Face.
3. For an 8 GB iPhone (15 Pro, 16), download gemma-3-4b-it-Q4_K_M.gguf (~2.5 GB). For older or non-Pro iPhones, download gemma-3-1b-it-Q4_K_M.gguf (~700 MB).
4. Tap "Load" on the downloaded model, then "Chat." First response is a few seconds; subsequent responses are faster.

Option B (more polished): Private LLM or LLM Farm

Private LLM is a paid app (one-time purchase, no subscription) with a more polished interface and a curated model library that includes pre-tested Gemma builds. LLM Farm is open-source and very flexible but rougher around the edges. Both use llama.cpp under the hood and produce essentially the same throughput as PocketPal.

About Apple Intelligence: the on-device model that powers Apple's writing tools and Siri rewrite features is Apple's own foundation model, not Gemma. Apple Intelligence cannot run Gemma. If you want Gemma specifically, you need a third-party app from the list above.

The recommended path on Android

Option A (easiest): PocketPal AI

The Android version of PocketPal AI is essentially identical to the iOS version - same UI, same Hugging Face integration. The setup steps are the same as iOS above.

Option B (best raw speed on Pixel and Snapdragon): Google AI Edge / MediaPipe LLM Inference

Google publishes a sample app called MediaPipe LLM Inference that runs Gemma natively through Android's NNAPI / GPU delegate / NPU paths. This is the same inference path used internally by Google's own apps. On a Pixel, this is typically 30-60% faster than llama.cpp through PocketPal because it can use the Tensor TPU directly.

The trade-off is setup: there is no Play Store app, you build it yourself from the GitHub repo with Android Studio. Worth it if you have a Pixel 9 / 10 Pro and want maximum speed. Not worth it if you just want to chat.

Option C (paid, polished GUI): Layla

Layla is a paid Android app aimed at offline AI chat, with character-AI style features. It supports Gemma and many other GGUF models. Good if you want a polished experience and do not mind a one-time purchase.

Option D (for power users): Termux + llama.cpp

On Android you can install Termux (terminal emulator), then build llama.cpp from source and run llama-cli against a Gemma GGUF you downloaded yourself. This gives you full control over flags - context size, sampling, KV cache quantization (see our quantization guide) - and lets you script Gemma into shell pipelines on the phone. Slower than NPU paths but maximum flexibility.

When does running Gemma on a phone actually make sense?

Honest framing: on-phone inference is meaningfully slower and meaningfully harder on the battery than just using ChatGPT or Gemini in the cloud. The reasons it can still be the right answer:

And the reasons it might not be the right answer:

Battery, heat, and storage - the practical reality

Gemma 3 1B vs Gemma 3 4B: which to actually use

A quick guide:

What is changing in 2026

Three trends worth watching. First, Google's AI Core APIs on Android (and the equivalent first-party Apple Intelligence on iOS) are increasingly exposing on-device models to third-party apps via system-level APIs - meaning many apps you already use will quietly start using on-device LLMs without you having to install anything. Second, the smaller Gemma models keep getting smarter at the same parameter count, thanks to better training data and distillation from larger models. Third, mobile NPUs in 2026 chips (A19 Pro, Snapdragon 8 Elite Gen 2, Tensor G5) are roughly 2× faster at LLM inference than 2024-era chips - the speed numbers in this article will look conservative within a year.

For now, in May 2026, the right move is exactly what it was last year: install PocketPal AI, download Gemma 3 1B or 4B, and try it for an evening. If it fits how you actually use your phone, keep it. If it does not, you have lost 5 minutes and 700 MB.

Related guides on this site