Add Gemma 4 E2B recipes (MobiusBuilder + INT4) by justinchuby · Pull Request #381 · microsoft/olive-recipes

justinchuby · 2026-04-23T23:43:17Z

Olive recipes for google/gemma-4-E2B-it using the MobiusModelBuilder pass (Olive PR microsoft/Olive#2406).

Recipes

Config	Pipeline	Output
`gemma4_fp32_cpu.json`	MobiusModelBuilder (fp32)	4 ONNX components (~5GB)
`gemma4_int4_cuda.json`	MobiusModelBuilder (fp16) → OnnxBlockWiseRtnQuantization (int4)	4 quantized ONNX components (~2.8GB)
`gemma4_int4_kquant_cpu.json`	MobiusModelBuilder (fp32) → OnnxKQuantQuantization (int4)	4 quantized ONNX components (k-quant, CPU)

Prerequisites

pip install olive-ai[gpu] mobius-ai

Validated

INT4 CUDA pipeline tested end-to-end (~3.5 min):

97-98% of weights quantized (MatMulNBits + GatherBlockQuantized)
4 components: decoder (2.4G), audio (152M), embedding (199M), vision (89M)

Two Olive pipeline configs for google/gemma-4-E2B-it: - gemma4_fp32_cpu.json: FP32 build for CPU - gemma4_int4_cuda.json: FP16 build + OnnxBlockWiseRtnQuantization INT4 Requires olive-ai with MobiusModelBuilder pass (Olive PR #2406) and mobius-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Copilot

Pull request overview

Adds Olive recipe configs for building and quantizing the HuggingFace google/gemma-4-E2B-it model using the MobiusModelBuilder pass, including an INT4 CUDA pipeline.

Changes:

Added FP32 CPU build recipe (gemma4_fp32_cpu.json) using MobiusModelBuilder.
Added FP16 build + INT4 RTN quantization CUDA recipe (gemma4_int4_cuda.json) using OnnxBlockWiseRtnQuantization.
Added model-specific README documenting prerequisites and how to run both recipes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
google-gemma-4-E2B-it/gemma4_int4_cuda.json	New CUDA recipe: build FP16 with Mobius, then INT4 block-wise RTN quantization.
google-gemma-4-E2B-it/gemma4_fp32_cpu.json	New CPU recipe: build FP32 with Mobius.
google-gemma-4-E2B-it/README.md	Usage docs for running the two new recipes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

devang-ml · 2026-04-23T23:58:19Z

Please add info.yml and LICENSE file.

justinchuby · 2026-04-24T00:00:46Z

@copilot Add info.yml and LICENSE file.

Agent-Logs-Url: https://github.com/microsoft/olive-recipes/sessions/5a084083-4534-4971-82c9-91d806abcd01 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot · 2026-04-24T00:03:52Z

@copilot Add info.yml and LICENSE file.

Added info.yml and LICENSE (Apache 2.0, matching the google-deepmind/gemma repo) to google-gemma-4-E2B-it/ in commit 58b231a.

Please add info.yml and LICENSE file.

Added info.yml and LICENSE (Apache 2.0) to google-gemma-4-E2B-it/ in commit 58b231a.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

ai.google.dev
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

The MobiusModelBuilder pass now automatically generates ORT GenAI config files (genai_config.json, tokenizer.json, etc.) alongside the ONNX models. This commit updates the recipe documentation to reflect this feature.

xiaoyu-work · 2026-05-07T04:07:12Z

+        }
+    },
+    "passes": {
+        "mobius_build": { "type": "MobiusModelBuilder", "precision": "fp32" },


If the fp32 the default precision?

We can just bf16 too. Is it well supported?

xiaoyu-work · 2026-05-07T04:10:39Z

+            "save_as_external_data": true
+        }
+    },
+    "engine": { "target": "local_system" },


remove engine, and move target to top level

Restructure the Gemma4 E2B recipe directory to follow the established olive-recipes pattern (Qwen-Qwen3.5-4B/builtin): Directory structure: - cpu/config.json: FP32 CPU baseline (MobiusModelBuilder) - cuda/fp16/config.json: FP16 CUDA (MobiusModelBuilder) - cuda/int4/config.json: FP16 + INT4 block-wise RTN quantization New scripts: - optimize.py: Orchestrator with output validation (checks all 4 ONNX components, genai_config.json sections, special tokens, tokenizer) - eval.py: MMLU Pro benchmark via Olive's LMEvalORTGenAIEvaluator (lm-eval ortgenai backend) - inference.py: ORT GenAI text inference with chat template formatting Changes from original: - Replaced flat JSON files with device/variant directory layout - Removed k-quant CPU recipe (fp32 CPU baseline is more useful for debugging; k-quant can be re-added if needed) - Added FP16 CUDA recipe as unquantized GPU baseline - Added .gitignore, requirements.txt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

+import time
+
+# Register Olive's ORT GenAI evaluator with lm-eval
+import olive.evaluator.lmeval_ort  # noqa: F401


- Add past_present_share_buffer=False to eval.py for correct KV cache handling, matching the setting already in inference.py - Extract shared resolve_model_path into utils.py to deduplicate path resolution logic between eval.py and inference.py - Fix YAML list syntax in info.yml for ep and device fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

xiaoyu-work · 2026-05-07T20:28:32Z

+import time
+
+# Register Olive's ORT GenAI evaluator with lm-eval
+import olive.evaluator.lmeval_ort  # noqa: F401


since olive has built in https://github.com/microsoft/Olive/blob/main/olive/evaluator/lmeval_ort.py. can you add evaluator sec in the config file?

Copilot:

Looked into this — it doesn't work yet for these particular configs, blocked by an Olive-side gap:

google/gemma-4-E2B-it is multimodal, so mobius.build() selects the gemma4 task and returns 4 components (decoder / embedding / vision_encoder / audio_encoder). MobiusBuilder therefore returns a CompositeModelHandler.

olive/systems/local.py:52-53 explicitly does:
if model_config.type.lower() == "compositemodel": raise NotImplementedError
so as soon as we add evaluator to the config, the evaluate phase blows up.

LMEvaluator with model_class="ortgenai" assumes a single GenAI package and uses Path(model.model_path).parent to locate genai_config.json. For a composite, model.model_path points at one component's .onnx, so .parent ends up at e.g. cuda/int4/models/decoder/ — no genai_config.json there.

So today we have two options:

(a) Keep the standalone eval.py — it loads the GenAI package directly via og.Model(path) and side-steps the Olive evaluator dispatch. Works, just not integrated into the olive run lifecycle.

(b) Add CompositeModel evaluator support to Olive upstream — e.g. let LocalSystem.evaluate_model accept composite + plumb the package root into LMEvaluator. Then we can drop eval.py and add a proper evaluators section here.

For a text-only LLM recipe (single-component output) adding evaluators would Just Work — same pattern as the existing Qwen2.5-1.5B/olive/mixed.json or Phi-4-mini/olive/mixed.json configs in this repo.

Want me to file the Olive issue for composite evaluator support and follow up with a PR there, then come back and add the evaluators section here once it lands? Otherwise I'll leave eval.py as the entry point and note the rationale in the README.

I recommend keeping the current one and update later.

…tructure - Rename MobiusModelBuilder to MobiusBuilder (class renamed in Olive) - Move target from engine to root level per Olive convention - Remove unnecessary engine block from CPU config (uses defaults) - Fix unused variable warnings in inference.py - Update README: add olive-ai install, fix builder link - output_dir and log_severity_level at root level (not under engine) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

justinchuby · 2026-05-08T00:48:00Z

Let me rework this PR

- Replace OnnxBlockWiseRtnQuantization with OnnxKQuantQuantization (bits=4, block_size=32) on the CUDA INT4 recipe to match the mobius-standard k-quant (Q4_K_M) path used elsewhere in the project. - Add a matching cpu/int4/config.json so CPU users get the same quant. - Restructure cpu/ from a single flat config to cpu/fp32/ + cpu/int4/ so the layout mirrors cuda/fp16/ + cuda/int4/. - Delete optimize.py: MobiusBuilder already runs ModelPackage.save() + write_ort_genai_config(), so the wrapper was just `olive run` plus file-existence checks. Users now invoke `olive run --config <path>` directly, matching the convention used by builtin-pass recipes in this repo. - README, info.yml, inference.py, eval.py, utils.py updated for the new layout and to add the fp32 variant choice. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

The helper was an 8-line function shared by two scripts. Inlining the two copies is shorter than maintaining a separate module + import, and matches the layout of other olive-recipes recipes which keep each script self-contained. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

- Drop task: 'text-generation' from input_model. MobiusBuilder doesn't read it, and Olive's HfModel default (text-generation-with-past) is appropriate for downstream passes too. - Drop log_severity_level: 1 — keep Olive at its default log level so the configs match other builtin recipes. - Fix cpu/fp32/config.json output_dir: was 'cpu/models' from the old flat layout, should be 'cpu/fp32/models' to match the cpu/fp32/ recipe directory. - CPU configs already omit target (CPU+CPUExecutionProvider is the Olive default system), CUDA configs keep target at top level without the engine wrapper — both matching the convention xiaoyu-work asked for in PR #381. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 23, 2026 23:43

Copilot started reviewing on behalf of justinchuby April 23, 2026 23:45 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread google-gemma-4-E2B-it/gemma4_fp32_cpu.json Outdated

Comment thread google-gemma-4-E2B-it/gemma4_int4_cuda.json Outdated

justinchuby and others added 2 commits April 23, 2026 16:52

Update google-gemma-4-E2B-it/gemma4_fp32_cpu.json

de8dd94

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update google-gemma-4-E2B-it/gemma4_int4_cuda.json

8df41d2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of justinchuby April 24, 2026 00:00 View session

Add info.yml and LICENSE to google-gemma-4-E2B-it

58b231a

Agent-Logs-Url: https://github.com/microsoft/olive-recipes/sessions/5a084083-4534-4971-82c9-91d806abcd01 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby April 24, 2026 00:04

justinchuby mentioned this pull request May 4, 2026

feat: add MobiusModelBuilder Olive pass for mobius-backed ONNX export microsoft/Olive#2406

Merged

justinchuby added 2 commits May 4, 2026 07:36

docs: update README to mention ORT GenAI config files in output

2f04b0e

The MobiusModelBuilder pass now automatically generates ORT GenAI config files (genai_config.json, tokenizer.json, etc.) alongside the ONNX models. This commit updates the recipe documentation to reflect this feature.

Merge branch 'main' into justinchu/gemma4-mobius

0689728

justinchuby commented May 4, 2026

View reviewed changes

Comment thread google-gemma-4-E2B-it/gemma4_int4_cuda.json Outdated

justinchuby added 2 commits May 4, 2026 07:51

Apply suggestion from @justinchuby

0d866f7

feat: add Gemma4 INT4 k-quant CPU recipe

81fb245

justinchuby mentioned this pull request May 4, 2026

Olive feedback: excellent quantization experience with Gemma4 models microsoft/Olive#2440

Open

justinchuby requested review from jambayk and xiaoyu-work May 6, 2026 00:18