Vol. 1 · Edition 024Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Previous

27B

Now

12B
Half the parameters. Better benchmarks. · June 3, 2026
Open Source
By Sam Taylor with Samwise

On running a multimodal AI offline on a 16GB laptop, what encoder-free architecture means in practice, and why Apache 2.0 is the real part of the announcement.

Gemma 4 12B isn't the best AI model. It's the best AI you actually own.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

00

Pro (practical)

02

Pro (hyped)

01

← Anti-AI · Pro-AI →

If you've ever taken a photo of a confusing receipt, a foreign-language menu, or a weird rash on your arm and wanted to ask an AI about it, but didn't love the idea of uploading it to some server you don't fully trust, there's now an answer worth knowing about.

Google released Gemma 4 12B on June 3, 2026. It's a free AI model you can download and run on your own laptop. It understands photos, audio files, and short video clips, not just text. And unlike ChatGPT or Gemini, it runs completely offline, on your own machine, with nothing leaving your device. The license is Apache 2.0, which means free to use personally, free to use commercially, no strings attached.

The hardware requirement is a laptop with 16GB of RAM. That's most Macs made in the last three years and a substantial slice of Windows machines.

What "runs on your own device" actually means

Here's an object lesson, because the technical version is fiddly.

Think about how a library works. You check out a book, bring it home, read it in private. Nobody logs which pages you read. Nobody uses your reading habits to sell you things. Now think about how most AI chatbots work: every question you type, every image you upload goes to a server somewhere. The company stores it. They have policies about what they do with it. Some of those policies change.

Running an AI model locally is like having the library in your house. The model is the book. Your questions don't go anywhere.

That matters practically. Medical images you'd rather not upload. A confidential work document you need help summarizing. A letter to a lawyer. Personal emails. All of that stays on your device.

$0
What Gemma 4 12B costs to run after download

→ Source: Google — Apache 2.0 license

Technically, it's a 12-billion-parameter model (parameters are roughly the model's neurons — more is usually better, but not always). It's built with an encoder-free architecture, meaning images, audio, and video go through the same processing pipeline as text without separate pre-processing steps. In practice that reduces latency and simplifies deployment. The context window (how much the model can hold in its head at once) is 256,000 tokens — roughly 200,000 words.

On GPQA Diamond, a graduate-level science reasoning benchmark that trips up most models, it scores 78.8%. On MMLU Pro, a broad academic knowledge test, it hits 77.2%, beating Google's own prior Gemma 3 27B (67.6%). A smaller model beating a larger one from the same family. That's the part worth sitting with.

Source spread

What's real

  • A 12B model that beats the previous 27B on MMLU Pro is a genuine efficiency gain, not a marketing reframe. Open-weights models have been closing the gap with cloud models on text benchmarks for two years; this is that trend continuing.
  • Apache 2.0 is the correct license for a model Google wants people to actually build on. No commercial restrictions, no use-case gates, no "research only" fine print. You can ship a product on top of this.
  • The encoder-free architecture is architecturally unusual for this model size. It means this isn't three models loosely bolted together with a glue layer. Simpler to deploy, lower latency on multimodal input.
  • The 4-bit quantized version weighs 6.7GB and fits in 8GB of VRAM. That opens up a class of GPU hardware (like an RTX 3070 or 4060) that 16GB-minimum models couldn't previously target.

What deserves a second look

  • Every benchmark number here comes from Google's own evaluation. Independent third-party benchmarks on consumer hardware weren't published as of June 4. That matters less for MMLU Pro (a text benchmark) and more for audio and video processing claims, which are harder to independently reproduce.
  • "Runs on 16GB" is true but "runs comfortably on 16GB" is a different thing. A 16GB M2 MacBook Air will run this. It will be slow on complex prompts. A 64GB M3 Pro will run it noticeably faster. Plan for that.
  • Audio and video input are supported. Output is text only. The model can describe what it hears or sees; it can't generate images, speech, or video in response. Important to be precise about.
Cloud AI vs. running Gemma 4 12B locally
Cloud AI (ChatGPT, Gemini, etc.)Gemma 4 12B locally
Cost$0–20+/month subscriptionFree after one-time download
PrivacyData sent to company serversStays entirely on your device
Internet requiredYes, alwaysNo — fully offline after setup
SetupSign up and doneModerate (install a runtime tool)
Quality ceilingFrontier closed modelsNear-frontier, closes gap fast

What to do about it

  • If you have a Mac with 16GB+ of unified memory: The simplest path is Ollama — a free tool that makes running local models about as hard as installing an app. Open a terminal and run ollama run gemma4:12b. It downloads and runs the model. No account, no cloud.
  • If you're on Windows with an 8GB+ NVIDIA GPU: The 4-bit quantized version (6.7GB) fits in 8GB of VRAM. LM Studio is the friendliest interface — free, no coding required, drag-and-drop model loading.
  • Want to test before committing to local setup: Try it in Google AI Studio first, free, in the browser. If the quality is good enough for your use case, then consider the local setup for privacy.
  • For sensitive documents: Local is the right call if the data is genuinely confidential. But test on non-confidential examples first to understand where the model's quality holds and where it doesn't. "Runs locally" solves the privacy problem; it doesn't automatically solve the quality problem.
  • Don't expect ChatGPT-level conversational fluency. Open local models are capable but less refined at multi-turn conversation. They're strong for specific discrete tasks — analyzing an image, summarizing a document, answering a single complex question. Extended back-and-forth dialogue is where the polish gap with frontier cloud models still shows.

Further reading

🌿

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.