How to Run Gemma on Your Phone in 2026: Android & iOS Guide

By the RunLocalModel editorial team · Published May 4, 2026 · ~11 minute read

If you only read one paragraph On a phone made in the last two years, you can absolutely run Gemma locally - and in 2026 the experience is good, not just a demo. For most people, install PocketPal AI from the App Store or Play Store, download Gemma 3 1B (any phone) or Gemma 3 4B (8 GB+ RAM), and you are chatting in five minutes. Anything bigger than 4B is technically possible on a top-end phone but the user experience falls off a cliff. The right reason to do this on a phone is privacy and offline use, not raw speed.

What is Gemma, in one paragraph

Gemma is Google's family of open-weight models. Gemma 1 launched in February 2024, Gemma 2 in mid-2024, and Gemma 3 in early 2025 with first-class multimodal (vision) support. The headline reason Gemma matters for phones is the small sizes: Gemma 3 ships in a 1B and a 4B variant alongside the larger 12B and 27B models. The 1B and 4B were explicitly trained with mobile and edge deployment in mind - Google publishes them in formats that map cleanly to the inference engines built into Android and iOS phones.

The license is friendly enough for personal and commercial use (with a few use-case restrictions). Weights are hosted on Hugging Face and on Google's own Kaggle and AI Studio properties. If you are new to where models live, see our local AI tools explainer.

Can your phone actually run it?

The answer almost always boils down to RAM. Gemma at Q4_K_M quantization (the standard 4-bit format used by the llama.cpp family of mobile runtimes) takes roughly:

Model	File size	Working RAM needed	Realistic phone fit
Gemma 3 1B	~700 MB	~1.2-1.5 GB	Any phone with 4 GB+ total RAM.
Gemma 2 2B	~1.4 GB	~2.0-2.4 GB	Any phone with 6 GB+ total RAM.
Gemma 3 4B	~2.5 GB	~3.5-4.2 GB	Phones with 8 GB+ RAM (most flagships from 2023 onward).
Gemma 2 9B	~5.5 GB	~6.5-7.5 GB	Only on 12-16 GB Android flagships, very tight.
Gemma 3 12B	~7.0 GB	~8.5-9.5 GB	16 GB Android flagships only, will often crash other apps.

The "working RAM" column matters more than the file size - you need room for the weights, the KV cache for the conversation, the chat app itself, and the OS. iOS in particular caps how much RAM a single app can consume (typically 50-60% of total system RAM), which is why an 8 GB iPhone behaves like a "4-5 GB usable" device for LLM apps. Android is more permissive but the OS will still kill backgrounded apps to free RAM for the model.

Our recommendation: for almost everyone, the right Gemma to run on a phone in 2026 is either Gemma 3 1B (universal compatibility, instant responses) or Gemma 3 4B (much smarter, requires a flagship phone from the last 2-3 years). Skip 9B and 12B unless you are explicitly testing the limit.

Realistic speed expectations

Tokens-per-second varies enormously by chip generation, runtime, and whether the inference is running on CPU, GPU, or NPU. Approximate sustained throughput from public benchmarks and our own measurements on Gemma at Q4_K_M:

Phone	Chip	Gemma 1B	Gemma 2-4B
iPhone 15 Pro / 16	A17 Pro / A18	50-80 tok/s	20-30 tok/s
iPhone 17 Pro	A19 Pro	70-110 tok/s	30-45 tok/s
Pixel 9 Pro	Tensor G4	40-60 tok/s	15-22 tok/s
Pixel 10 Pro	Tensor G5	55-80 tok/s	22-32 tok/s
Galaxy S24 Ultra	Snapdragon 8 Gen 3	55-80 tok/s	25-35 tok/s
Galaxy S25 Ultra	Snapdragon 8 Elite	80-120 tok/s	35-55 tok/s
Galaxy S26 Ultra	Snapdragon 8 Elite Gen 2	100-140 tok/s	45-70 tok/s
Mid-range 2024 phone	Dimensity 7000-class	15-25 tok/s	too slow to be pleasant

Two patterns to notice. First, 1B is fast everywhere - even mid-range phones produce tokens faster than you can read. Second, on 4B the chip generation matters a lot: a 2024 flagship is roughly twice as fast as a 2022 one, because mobile NPUs and the inference engines that target them have improved meaningfully each year.

Sustained workloads will throttle. Phones, especially fanless ones, lower clocks aggressively when they get warm. Throughput numbers above are for the first ~2 minutes of inference. After 5-10 minutes of continuous generation, expect 30-50% lower tokens-per-second and a noticeably hot device. For long chats, take breaks or hold the phone in a way that lets the back radiate heat.

The recommended path on iOS

Option A (easiest): PocketPal AI

PocketPal AI is an open-source iOS (and Android) app that wraps llama.cpp with a clean chat UI and a built-in model browser that pulls directly from Hugging Face.

1. Install PocketPal AI from the App Store. Free, no account required.

2. Open it, tap "Models," then "Add Model." Search "Gemma" - the in-app browser shows community-quantized GGUF files from Hugging Face.

3. For an 8 GB iPhone (15 Pro, 16), download gemma-3-4b-it-Q4_K_M.gguf (~2.5 GB). For older or non-Pro iPhones, download gemma-3-1b-it-Q4_K_M.gguf (~700 MB).

4. Tap "Load" on the downloaded model, then "Chat." First response is a few seconds; subsequent responses are faster.

Option B (more polished): Private LLM or LLM Farm

Private LLM is a paid app (one-time purchase, no subscription) with a more polished interface and a curated model library that includes pre-tested Gemma builds. LLM Farm is open-source and very flexible but rougher around the edges. Both use llama.cpp under the hood and produce essentially the same throughput as PocketPal.

About Apple Intelligence: the on-device model that powers Apple's writing tools and Siri rewrite features is Apple's own foundation model, not Gemma. Apple Intelligence cannot run Gemma. If you want Gemma specifically, you need a third-party app from the list above.

The recommended path on Android

Option A (easiest): PocketPal AI

The Android version of PocketPal AI is essentially identical to the iOS version - same UI, same Hugging Face integration. The setup steps are the same as iOS above.

Option B (best raw speed on Pixel and Snapdragon): Google AI Edge / MediaPipe LLM Inference

Google publishes a sample app called MediaPipe LLM Inference that runs Gemma natively through Android's NNAPI / GPU delegate / NPU paths. This is the same inference path used internally by Google's own apps. On a Pixel, this is typically 30-60% faster than llama.cpp through PocketPal because it can use the Tensor TPU directly.

The trade-off is setup: there is no Play Store app, you build it yourself from the GitHub repo with Android Studio. Worth it if you have a Pixel 9 / 10 Pro and want maximum speed. Not worth it if you just want to chat.

Option C (paid, polished GUI): Layla

Layla is a paid Android app aimed at offline AI chat, with character-AI style features. It supports Gemma and many other GGUF models. Good if you want a polished experience and do not mind a one-time purchase.

Option D (for power users): Termux + llama.cpp

On Android you can install Termux (terminal emulator), then build llama.cpp from source and run llama-cli against a Gemma GGUF you downloaded yourself. This gives you full control over flags - context size, sampling, KV cache quantization (see our quantization guide) - and lets you script Gemma into shell pipelines on the phone. Slower than NPU paths but maximum flexibility.

When does running Gemma on a phone actually make sense?

Honest framing: on-phone inference is meaningfully slower and meaningfully harder on the battery than just using ChatGPT or Gemini in the cloud. The reasons it can still be the right answer:

Privacy. The prompt and response never leave the device. There is no provider account, no telemetry, no risk of a future ToS change about training. For sensitive notes, journaling, medical questions, or work prompts your employer would not want sent to a third-party API, this is the whole game.
True offline use. A flight, a subway, a hike, a cabin with no signal. The model works.
Zero cost per token. For high-volume use - generating hundreds of completions during a coding session, running a personal agent loop overnight - on-device is free after the upfront energy cost.
Latency for short prompts. A 1B model on a recent phone responds in under 100 ms for short queries. No network round-trip.
Curiosity and learning. There is something genuinely satisfying about realizing the supercomputer in your pocket can do this on its own.

And the reasons it might not be the right answer:

You need the absolute latest, smartest model. Frontier-quality answers still require a 70B+ model running on a real machine - see our Apple Silicon vs RTX 4090 comparison.
You will run the model continuously for hours. Battery and heat will turn the phone into a hand warmer.
You need vision, audio, or tool use that only larger hosted models support reliably.

Battery, heat, and storage - the practical reality

Battery: sustained inference on Gemma 4B drains roughly 8-15% per 10 minutes on a recent flagship. A full battery is good for 1-2 hours of continuous chat. Background-mode generation typically drains faster because the OS cannot park the chip.
Heat: the phone will get noticeably warm within 5-10 minutes of heavy use. The OS will throttle the chip rather than let it cook itself - we have measured 30-50% throughput drops after sustained generation. Keep the phone uncovered (no thick case) and out of direct sun if you are doing long sessions.
Storage: the model file itself is 0.7-3 GB, plus the app's own cache. Budget 5-8 GB of free storage if you plan to keep multiple model sizes installed for different use cases.
RAM contention: the OS may close other apps to make room. Expect Safari / Chrome tabs and music apps to be evicted while the model is loaded.

Gemma 3 1B vs Gemma 3 4B: which to actually use

A quick guide:

Use Gemma 3 1B when you want responses fast, short summaries, simple Q&A, language detection, or autocompletion-style help. It is quick on every phone and rarely "feels slow."
Use Gemma 3 4B when you want noticeably better reasoning - longer answers, multi-step instructions, code snippets, more nuanced rewrites. It needs a flagship phone but the quality jump from 1B is large.
Use Gemma 2 2B as a middle ground if 1B feels weak and 4B does not fit.
Use Gemma 3 12B only if you specifically have a 16 GB Android flagship and want to feel the limit. It will fit, it will be slow, and it will eat your other apps. Cool to demonstrate, painful to live with.

What is changing in 2026

Three trends worth watching. First, Google's AI Core APIs on Android (and the equivalent first-party Apple Intelligence on iOS) are increasingly exposing on-device models to third-party apps via system-level APIs - meaning many apps you already use will quietly start using on-device LLMs without you having to install anything. Second, the smaller Gemma models keep getting smarter at the same parameter count, thanks to better training data and distillation from larger models. Third, mobile NPUs in 2026 chips (A19 Pro, Snapdragon 8 Elite Gen 2, Tensor G5) are roughly 2× faster at LLM inference than 2024-era chips - the speed numbers in this article will look conservative within a year.

For now, in May 2026, the right move is exactly what it was last year: install PocketPal AI, download Gemma 3 1B or 4B, and try it for an evening. If it fits how you actually use your phone, keep it. If it does not, you have lost 5 minutes and 700 MB.

Related guides on this site

llama.cpp vs Ollama vs LM Studio vs Hugging Face vs MLX - the underlying tools, explained
Choosing the Right Quantization for Local LLMs in 2026
Best Local AI Models by Use Case (2026 Guide)
How to Run Local AI on a desktop or laptop
VRAM vs Unified Memory: When Each Wins