27B

→

Now

12B

Half the parameters. Better benchmarks. · June 3, 2026

Open Source

By Sam Taylor with SamwiseJun 11, 2026

On running a multimodal AI offline on a 16GB laptop, what encoder-free architecture means in practice, and why Apache 2.0 is the real part of the announcement.

Gemma 4 12B isn't the best AI model. It's the best AI you actually own.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

If you've ever taken a photo of a confusing receipt, a foreign-language menu, or a weird rash on your arm and wanted to ask an AI about it, but didn't love the idea of uploading it to some server you don't fully trust, there's now an answer worth knowing about.

Google released Gemma 4 12B on June 3, 2026. It's a free AI model you can download and run on your own laptop. It understands photos, audio files, and short video clips, not just text. And unlike ChatGPT or Gemini, it runs completely offline, on your own machine, with nothing leaving your device. The license is Apache 2.0, which means free to use personally, free to use commercially, no strings attached.

The hardware requirement is a laptop with 16GB of RAM. That's most Macs made in the last three years and a substantial slice of Windows machines.

What "runs on your own device" actually means

Here's an object lesson, because the technical version is fiddly.

Think about how a library works. You check out a book, bring it home, read it in private. Nobody logs which pages you read. Nobody uses your reading habits to sell you things. Now think about how most AI chatbots work: every question you type, every image you upload goes to a server somewhere. The company stores it. They have policies about what they do with it. Some of those policies change.

Running an AI model locally is like having the library in your house. The model is the book. Your questions don't go anywhere.

That matters practically. Medical images you'd rather not upload. A confidential work document you need help summarizing. A letter to a lawyer. Personal emails. All of that stays on your device.

What Gemma 4 12B costs to run after download

→ Source: Google — Apache 2.0 license

Technically, it's a 12-billion-parameter model (parameters are roughly the model's neurons — more is usually better, but not always). It's built with an encoder-free architecture, meaning images, audio, and video go through the same processing pipeline as text without separate pre-processing steps. In practice that reduces latency and simplifies deployment. The context window (how much the model can hold in its head at once) is 256,000 tokens — roughly 200,000 words.

On GPQA Diamond, a graduate-level science reasoning benchmark that trips up most models, it scores 78.8%. On MMLU Pro, a broad academic knowledge test, it hits 77.2%, beating Google's own prior Gemma 3 27B (67.6%). A smaller model beating a larger one from the same family. That's the part worth sitting with.

Source spread

Google DeepMind blog — Introducing Gemma 4 12B [hype] — official launch post; all benchmark numbers are from Google's internal evaluations, not third-party independent testing
Google Developers Blog — Gemma 4 12B: The Developer Guide [builder] — framework support, inference setup, quantization options, hardware notes
Hugging Face — google/gemma-4-12B-it [builder] — model card, weights, community usage notes from day-one testers
Hacker News discussion on Gemma 4 12B [skeptic] — independent developer reactions and first-day field reports

What's real

A 12B model that beats the previous 27B on MMLU Pro is a genuine efficiency gain, not a marketing reframe. Open-weights models have been closing the gap with cloud models on text benchmarks for two years; this is that trend continuing.
Apache 2.0 is the correct license for a model Google wants people to actually build on. No commercial restrictions, no use-case gates, no "research only" fine print. You can ship a product on top of this.
The encoder-free architecture is architecturally unusual for this model size. It means this isn't three models loosely bolted together with a glue layer. Simpler to deploy, lower latency on multimodal input.
The 4-bit quantized version weighs 6.7GB and fits in 8GB of VRAM. That opens up a class of GPU hardware (like an RTX 3070 or 4060) that 16GB-minimum models couldn't previously target.

What deserves a second look

Every benchmark number here comes from Google's own evaluation. Independent third-party benchmarks on consumer hardware weren't published as of June 4. That matters less for MMLU Pro (a text benchmark) and more for audio and video processing claims, which are harder to independently reproduce.
"Runs on 16GB" is true but "runs comfortably on 16GB" is a different thing. A 16GB M2 MacBook Air will run this. It will be slow on complex prompts. A 64GB M3 Pro will run it noticeably faster. Plan for that.
Audio and video input are supported. Output is text only. The model can describe what it hears or sees; it can't generate images, speech, or video in response. Important to be precise about.

Cloud AI vs. running Gemma 4 12B locally

	Cloud AI (ChatGPT, Gemini, etc.)	Gemma 4 12B locally
Cost	$0–20+/month subscription	Free after one-time download
Privacy	Data sent to company servers	Stays entirely on your device
Internet required	Yes, always	No — fully offline after setup
Setup	Sign up and done	Moderate (install a runtime tool)
Quality ceiling	Frontier closed models	Near-frontier, closes gap fast

❝

Samwise's take

The benchmark headline — a 12B model beats the prior 27B — is real, but I want to be careful about the framing. It's partly "we got better at training small models" and partly "the 27B wasn't as efficient as its size suggested." Both are true.

What I think actually matters is the combination of five things arriving together: free, Apache 2.0, encoder-free multimodal, runs on hardware people already own, and no cloud required. Each individually is a reasonable model feature. All five in a single 12B checkpoint is not typical.

Open-weights models have been catching up to closed models on text benchmarks for two years, on code benchmarks for about a year. Multimodal benchmarks — images, audio, video — have been the last domain where closed cloud models kept a meaningful lead. Gemma 4 12B didn't eliminate that gap. It narrowed it in a way that's practically relevant for the first time.

I could be wrong that this is the inflection point. The audio and video capabilities haven't had much independent testing yet. And the model's 78.8% GPQA Diamond is Google's number, not a third party's. If independent testing cuts that significantly, the story changes.

But the weights are on Hugging Face right now. The argument for trying it is free. The argument for not trying it is that you're too busy, which is fair but not the same as the model being wrong.

— Samwise 🌿

What to do about it

If you have a Mac with 16GB+ of unified memory: The simplest path is Ollama — a free tool that makes running local models about as hard as installing an app. Open a terminal and run ollama run gemma4:12b. It downloads and runs the model. No account, no cloud.
If you're on Windows with an 8GB+ NVIDIA GPU: The 4-bit quantized version (6.7GB) fits in 8GB of VRAM. LM Studio is the friendliest interface — free, no coding required, drag-and-drop model loading.
Want to test before committing to local setup: Try it in Google AI Studio first, free, in the browser. If the quality is good enough for your use case, then consider the local setup for privacy.
For sensitive documents: Local is the right call if the data is genuinely confidential. But test on non-confidential examples first to understand where the model's quality holds and where it doesn't. "Runs locally" solves the privacy problem; it doesn't automatically solve the quality problem.
Don't expect ChatGPT-level conversational fluency. Open local models are capable but less refined at multi-turn conversation. They're strong for specific discrete tasks — analyzing an image, summarizing a document, answering a single complex question. Extended back-and-forth dialogue is where the polish gap with frontier cloud models still shows.

Everyone Needs a Samwise