How to Run Gemma on Your Phone in 2026: Android & iOS Guide
What is Gemma, in one paragraph
Gemma is Google's family of open-weight models. Gemma 1 launched in February 2024, Gemma 2 in mid-2024, and Gemma 3 in early 2025 with first-class multimodal (vision) support. The headline reason Gemma matters for phones is the small sizes: Gemma 3 ships in a 1B and a 4B variant alongside the larger 12B and 27B models. The 1B and 4B were explicitly trained with mobile and edge deployment in mind - Google publishes them in formats that map cleanly to the inference engines built into Android and iOS phones.
The license is friendly enough for personal and commercial use (with a few use-case restrictions). Weights are hosted on Hugging Face and on Google's own Kaggle and AI Studio properties. If you are new to where models live, see our local AI tools explainer.
Can your phone actually run it?
The answer almost always boils down to RAM. Gemma at Q4_K_M quantization (the standard 4-bit format used by the llama.cpp family of mobile runtimes) takes roughly:
| Model | File size | Working RAM needed | Realistic phone fit |
|---|---|---|---|
| Gemma 3 1B | ~700 MB | ~1.2-1.5 GB | Any phone with 4 GB+ total RAM. |
| Gemma 2 2B | ~1.4 GB | ~2.0-2.4 GB | Any phone with 6 GB+ total RAM. |
| Gemma 3 4B | ~2.5 GB | ~3.5-4.2 GB | Phones with 8 GB+ RAM (most flagships from 2023 onward). |
| Gemma 2 9B | ~5.5 GB | ~6.5-7.5 GB | Only on 12-16 GB Android flagships, very tight. |
| Gemma 3 12B | ~7.0 GB | ~8.5-9.5 GB | 16 GB Android flagships only, will often crash other apps. |
The "working RAM" column matters more than the file size - you need room for the weights, the KV cache for the conversation, the chat app itself, and the OS. iOS in particular caps how much RAM a single app can consume (typically 50-60% of total system RAM), which is why an 8 GB iPhone behaves like a "4-5 GB usable" device for LLM apps. Android is more permissive but the OS will still kill backgrounded apps to free RAM for the model.
Realistic speed expectations
Tokens-per-second varies enormously by chip generation, runtime, and whether the inference is running on CPU, GPU, or NPU. Approximate sustained throughput from public benchmarks and our own measurements on Gemma at Q4_K_M:
| Phone | Chip | Gemma 1B | Gemma 2-4B |
|---|---|---|---|
| iPhone 15 Pro / 16 | A17 Pro / A18 | 50-80 tok/s | 20-30 tok/s |
| iPhone 17 Pro | A19 Pro | 70-110 tok/s | 30-45 tok/s |
| Pixel 9 Pro | Tensor G4 | 40-60 tok/s | 15-22 tok/s |
| Pixel 10 Pro | Tensor G5 | 55-80 tok/s | 22-32 tok/s |
| Galaxy S24 Ultra | Snapdragon 8 Gen 3 | 55-80 tok/s | 25-35 tok/s |
| Galaxy S25 Ultra | Snapdragon 8 Elite | 80-120 tok/s | 35-55 tok/s |
| Galaxy S26 Ultra | Snapdragon 8 Elite Gen 2 | 100-140 tok/s | 45-70 tok/s |
| Mid-range 2024 phone | Dimensity 7000-class | 15-25 tok/s | too slow to be pleasant |
Two patterns to notice. First, 1B is fast everywhere - even mid-range phones produce tokens faster than you can read. Second, on 4B the chip generation matters a lot: a 2024 flagship is roughly twice as fast as a 2022 one, because mobile NPUs and the inference engines that target them have improved meaningfully each year.
The recommended path on iOS
Option A (easiest): PocketPal AI
PocketPal AI is an open-source iOS (and Android) app that wraps llama.cpp with a clean chat UI and a built-in model browser that pulls directly from Hugging Face.
gemma-3-4b-it-Q4_K_M.gguf (~2.5 GB). For older or non-Pro iPhones, download gemma-3-1b-it-Q4_K_M.gguf (~700 MB).
Option B (more polished): Private LLM or LLM Farm
Private LLM is a paid app (one-time purchase, no subscription) with a more polished interface and a curated model library that includes pre-tested Gemma builds. LLM Farm is open-source and very flexible but rougher around the edges. Both use llama.cpp under the hood and produce essentially the same throughput as PocketPal.
The recommended path on Android
Option A (easiest): PocketPal AI
The Android version of PocketPal AI is essentially identical to the iOS version - same UI, same Hugging Face integration. The setup steps are the same as iOS above.
Option B (best raw speed on Pixel and Snapdragon): Google AI Edge / MediaPipe LLM Inference
Google publishes a sample app called MediaPipe LLM Inference that runs Gemma natively through Android's NNAPI / GPU delegate / NPU paths. This is the same inference path used internally by Google's own apps. On a Pixel, this is typically 30-60% faster than llama.cpp through PocketPal because it can use the Tensor TPU directly.
The trade-off is setup: there is no Play Store app, you build it yourself from the GitHub repo with Android Studio. Worth it if you have a Pixel 9 / 10 Pro and want maximum speed. Not worth it if you just want to chat.
Option C (paid, polished GUI): Layla
Layla is a paid Android app aimed at offline AI chat, with character-AI style features. It supports Gemma and many other GGUF models. Good if you want a polished experience and do not mind a one-time purchase.
Option D (for power users): Termux + llama.cpp
On Android you can install Termux (terminal emulator), then build llama.cpp from source and run llama-cli against a Gemma GGUF you downloaded yourself. This gives you full control over flags - context size, sampling, KV cache quantization (see our quantization guide) - and lets you script Gemma into shell pipelines on the phone. Slower than NPU paths but maximum flexibility.
When does running Gemma on a phone actually make sense?
Honest framing: on-phone inference is meaningfully slower and meaningfully harder on the battery than just using ChatGPT or Gemini in the cloud. The reasons it can still be the right answer:
- Privacy. The prompt and response never leave the device. There is no provider account, no telemetry, no risk of a future ToS change about training. For sensitive notes, journaling, medical questions, or work prompts your employer would not want sent to a third-party API, this is the whole game.
- True offline use. A flight, a subway, a hike, a cabin with no signal. The model works.
- Zero cost per token. For high-volume use - generating hundreds of completions during a coding session, running a personal agent loop overnight - on-device is free after the upfront energy cost.
- Latency for short prompts. A 1B model on a recent phone responds in under 100 ms for short queries. No network round-trip.
- Curiosity and learning. There is something genuinely satisfying about realizing the supercomputer in your pocket can do this on its own.
And the reasons it might not be the right answer:
- You need the absolute latest, smartest model. Frontier-quality answers still require a 70B+ model running on a real machine - see our Apple Silicon vs RTX 4090 comparison.
- You will run the model continuously for hours. Battery and heat will turn the phone into a hand warmer.
- You need vision, audio, or tool use that only larger hosted models support reliably.
Battery, heat, and storage - the practical reality
- Battery: sustained inference on Gemma 4B drains roughly 8-15% per 10 minutes on a recent flagship. A full battery is good for 1-2 hours of continuous chat. Background-mode generation typically drains faster because the OS cannot park the chip.
- Heat: the phone will get noticeably warm within 5-10 minutes of heavy use. The OS will throttle the chip rather than let it cook itself - we have measured 30-50% throughput drops after sustained generation. Keep the phone uncovered (no thick case) and out of direct sun if you are doing long sessions.
- Storage: the model file itself is 0.7-3 GB, plus the app's own cache. Budget 5-8 GB of free storage if you plan to keep multiple model sizes installed for different use cases.
- RAM contention: the OS may close other apps to make room. Expect Safari / Chrome tabs and music apps to be evicted while the model is loaded.
Gemma 3 1B vs Gemma 3 4B: which to actually use
A quick guide:
- Use Gemma 3 1B when you want responses fast, short summaries, simple Q&A, language detection, or autocompletion-style help. It is quick on every phone and rarely "feels slow."
- Use Gemma 3 4B when you want noticeably better reasoning - longer answers, multi-step instructions, code snippets, more nuanced rewrites. It needs a flagship phone but the quality jump from 1B is large.
- Use Gemma 2 2B as a middle ground if 1B feels weak and 4B does not fit.
- Use Gemma 3 12B only if you specifically have a 16 GB Android flagship and want to feel the limit. It will fit, it will be slow, and it will eat your other apps. Cool to demonstrate, painful to live with.
What is changing in 2026
Three trends worth watching. First, Google's AI Core APIs on Android (and the equivalent first-party Apple Intelligence on iOS) are increasingly exposing on-device models to third-party apps via system-level APIs - meaning many apps you already use will quietly start using on-device LLMs without you having to install anything. Second, the smaller Gemma models keep getting smarter at the same parameter count, thanks to better training data and distillation from larger models. Third, mobile NPUs in 2026 chips (A19 Pro, Snapdragon 8 Elite Gen 2, Tensor G5) are roughly 2× faster at LLM inference than 2024-era chips - the speed numbers in this article will look conservative within a year.
For now, in May 2026, the right move is exactly what it was last year: install PocketPal AI, download Gemma 3 1B or 4B, and try it for an evening. If it fits how you actually use your phone, keep it. If it does not, you have lost 5 minutes and 700 MB.