Let's talk about FunctionGemma

Feb 8, 2026

Recently I wrote a post, I tried to make that model call tools — and hit a wall. Then, one week after publishing, Google released FunctionGemma: a Gemma 3 270M model specifically fine-tuned for function calling.

My first reaction was WOW! I missed it by a week but my problems are all solved. Two weeks later, I’m again on the starting line. Let’s dig into it and see why.

FunctionGemma in 30 Seconds

FunctionGemma is a 270M parameter model built on Gemma 3, fine-tuned specifically for function calling. Google positions it as a local traffic controller: a lightweight model that runs on-device, handling frequent simple commands and only escalating complex requests to cloud models.

The key specs from the model card:

Base accuracy: 58% on the Mobile Actions benchmark
After fine-tuning: 85% on the same benchmark
Size: 288 MB (dynamic int8 quantization)
Latency: 0.3s time-to-first-token on a Samsung S25 Ultra
Memory: ~551 MB peak RSS

Google provides two demo use cases for mobile: Tiny Garden (a voice-controlled gardening game) and Mobile Actions (translating commands like “turn on the flashlight” into system API calls). Both ship with fine-tuning recipes.

The model is explicitly not intended for zero-shot use. Google’s own documentation says: you need to fine-tune for your specific task. That 58% → 85% jump after fine-tuning isn’t optional — it’s the whole point.

What Google Gets Right

The framing is honest. Google doesn’t oversell FunctionGemma as a general-purpose agent. The documentation lists clear prerequisites:

You have a defined API surface — a bounded set of actions
You are ready to fine-tune — not just prompt engineering
You prioritize local-first deployment — latency and privacy matter
You are building compound systems — the small model handles simple routing, bigger models handle the rest

Both demo use cases fit this description perfectly. “Turn on the flashlight” maps to a single API call. “Plant sunflowers in the top row” maps to a game function with coordinates. These are bounded, local, failure-tolerant tasks.

But here’s where it gets interesting: those same tasks could be solved with a simple intent classifier. Or a regex. Or a dropdown menu: the language model adds marginal value when the action space is this constrained. Google knows this — they’re betting on the model becoming more capable over time, and the fine-tuning recipe scaling to more complex domains. The demos are a starting point, not the destination.

The question is: what’s the destination?

The real life scenarios

I tried to answer that question by building real use cases. Probably I’m not coming up with the best ideas, but my intention is to illustrate the difficulty of the problem:

Attempt 1: Context-aware notifications

The idea: use an on-device model to generate personalized meal reminders based on weather, time, location, and logging history. Instead of “Time for a snack?” you’d get “Nice pace today — a gazpacho would hit right in this heat 💪”.

Why FunctionGemma doesn’t fit: The model needs to generate creative, contextual text — that’s open-ended language generation, not function calling. You’d need tool calls to gather context (weather, meal history), but the actual value is in the generation step, which requires a larger model. And honestly, you could gather that context with deterministic code and just template the output.

Attempt 2: Hiking safety copilot

The idea: combine GPS, weather, elevation, and pace data to proactively alert hikers about risks. The model would decide which data sources to query and synthesize alerts.

Why FunctionGemma doesn’t fit: Every part of this pipeline is better served by deterministic logic. Deciding which sensors to read? That’s a decision tree. Computing risk levels from sensor data? That’s arithmetic and thresholds. Presenting alerts to a hiker? They’re glancing at a watch while moving — they need a dashboard, not a conversation. And critically: an 85% accuracy rate is not acceptable when safety is at stake.

Attempt 3: Cycling mid-route re-planning

The idea: a cyclist mid-ride says “I’m tired, find me a flatter way to the campsite” and the model interprets the constraint, calls a routing API, and presents alternatives.

Why FunctionGemma doesn’t fit — but it’s complicated: This is actually the most promising scenario. The language input is genuinely ambiguous (“flatter” is subjective, “tired” implies preference changes). The output is structured (a route). But the blocker is practical: route computation requires a routing engine, and running one locally on a phone isn’t feasible. The algorithms are complex, the map data is massive, and you need network access anyway. If you need the network for the hard part, do you really need an on-device model for the easy part (intent classification)?

The pattern

Each attempt followed the same arc: exciting architecture → honest evaluation → deterministic code wins. The model is either doing something too simple (where a classifier suffices) or something too complex (where it can’t deliver reliably).

There’s a constraint triangle at work here: on-device AI use cases need natural language input (otherwise deterministic UI wins), local processing capability (otherwise why not use the cloud), and bounded complexity (otherwise the model can’t handle it). Finding all three in one feature is genuinely hard.

This isn’t a failure of FunctionGemma specifically. It’s a structural challenge with sub-1B models on mobile devices in early 2026.

The Decision Tree

After enough failed attempts, I built a decision tree to formalize the evaluation:

flowchart TD
    START([🤔 I have a mobile feature idea<br/>that could use AI]) --> Q1

    Q1[Is the user input<br/>natural language?]
    Q1 -->|No| D1_DESC[Structured inputs — sensors,<br/>GPS, buttons, or toggles — and<br/>rule-based processing don't<br/>need language models] --> D_COMMON[✅ Use deterministic code]
    Q1 -->|Yes| Q3

    Q3[Can the task be solved<br/>with a finite set of<br/>predefined actions?]
    Q3 -->|No| D3[☁️ Use a cloud LLM<br/>Open-ended generation,<br/>complex reasoning, and<br/>creative tasks exceed<br/>on-device model capacity]
    Q3 -->|Yes| Q4

    Q4[Is failure tolerable?<br/>Can you retry or<br/>fall back gracefully?]
    Q4 -->|No| D4_DESC[Safety-critical, financial,<br/>or medical decisions need<br/>100% reliability, not 85%] --> D_COMMON
    Q4 -->|Yes| Q5

    Q5[Can all processing<br/>happen on-device?]
    Q5 -->|Yes| Q6
    Q5 -->|No| Q7

    Q6[Are you ready<br/>to fine-tune?]
    Q6 -->|No| D5[⏸️ Wait or reconsider<br/>Base FunctionGemma scores<br/>58% without fine-tuning.<br/>Not reliable enough for<br/>production use]
    Q6 -->|Yes| Q9[Is the task<br/>fully local?]

    Q7[Does on-device routing<br/>add real value before<br/>the remote call?]
    Q7 -->|Yes| Q6
    Q7 -->|No| D7[☁️ Use a cloud LLM directly<br/>If you need the network anyway<br/>and routing is simple, skip<br/>the on-device overhead]

    Q9 -->|Yes| D6[📱 FunctionGemma<br/>fine-tuned for your tools<br/>The sweet spot: bounded API,<br/>local-first, failure-tolerant]
    Q9 -->|No| D8[📱☁️ FunctionGemma as router<br/>+ cloud backend<br/>Local intent classification,<br/>remote execution via<br/>agent orchestration]

    classDef green fill:#c0f8d066,stroke:#c0f8d0,stroke-width:2px,color:black,font-size:20px
    classDef blue fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black,font-size:20px
    classDef yellow fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black,font-size:20px
    classDef red fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black,font-size:20px
    linkStyle default stroke:black,stroke-width:2px,color:black

    class START red
    class Q1,Q3,Q4,Q5,Q6,Q7,Q9 blue
    class D_COMMON,D1_DESC,D4_DESC green
    class D3,D7 yellow
    class D6,D8 green
    class D5 red

Decision Tree long version

The Uncomfortable Middle

Here’s what the tree reveals: the sweet spot for FunctionGemma on mobile is narrow.

The fully local path (Mobile Actions, Tiny Garden) works, but the use cases are ones where simpler approaches also work. “Turn on the flashlight” doesn’t need a language model. It needs a voice command parser, and those have existed for a decade.

I’m not saying the model is useless — I’m saying the use cases where on-device function calling genuinely outperforms simpler alternatives are hard to find today. And it’s worth noting that FunctionGemma may be part of a larger strategy we can’t see yet (Gemmaverse?). Google has been methodically filling gaps with specialized models: FunctionGemma, EmbeddingGemma, ShieldGemma, CodeGemma. The pieces are being placed. But on mobile specifically, there’s no clear application yet that ties them together into something a user would reach for daily.

What I Learned

Deterministic code is underrated. In the AI hype cycle, it’s easy to forget that if/else statements are fast, reliable, testable, and free. Every use case should start with the question: “Can I solve this without a model?”

The constraint triangle is real. On-device AI use cases need three things simultaneously: natural language input (otherwise deterministic UI wins), local processing (otherwise why not use the cloud), and bounded complexity (otherwise the model can’t handle it). Finding all three in one feature is genuinely hard.

Infrastructure is ahead of use cases. My testing platform, the Koog integration, the model conversion pipeline — all of it works. The architecture is ready. What’s missing isn’t tooling; it’s a compelling reason to use it. Maybe I’m trying to solve a problem I’m was not trained for.

What’s Next

I’m going to build the router pattern anyway. Not because I’ve found the perfect use case, but because demonstrating the integration — FunctionGemma dispatching to Koog orchestrating tool calls across local and remote services — has value as a reference implementation. The testing platform I’ve been developing is the right place for this.

The ecosystem is moving fast. LiteRT-LM is stabilizing, FunctionGemma has fine-tuning recipes, and Google is actively developing the edge AI stack. When the models get more capable and the runtimes more stable, the patterns I’ve been documenting will be ready.

Or maybe the killer use case will be something none of us have thought of yet. That’s usually how it works :)