Add Gemma 4 E2B recipes (MobiusBuilder + INT4)#381
Conversation
Two Olive pipeline configs for google/gemma-4-E2B-it: - gemma4_fp32_cpu.json: FP32 build for CPU - gemma4_int4_cuda.json: FP16 build + OnnxBlockWiseRtnQuantization INT4 Requires olive-ai with MobiusModelBuilder pass (Olive PR #2406) and mobius-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Olive recipe configs for building and quantizing the HuggingFace google/gemma-4-E2B-it model using the MobiusModelBuilder pass, including an INT4 CUDA pipeline.
Changes:
- Added FP32 CPU build recipe (
gemma4_fp32_cpu.json) usingMobiusModelBuilder. - Added FP16 build + INT4 RTN quantization CUDA recipe (
gemma4_int4_cuda.json) usingOnnxBlockWiseRtnQuantization. - Added model-specific README documenting prerequisites and how to run both recipes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| google-gemma-4-E2B-it/gemma4_int4_cuda.json | New CUDA recipe: build FP16 with Mobius, then INT4 block-wise RTN quantization. |
| google-gemma-4-E2B-it/gemma4_fp32_cpu.json | New CPU recipe: build FP32 with Mobius. |
| google-gemma-4-E2B-it/README.md | Usage docs for running the two new recipes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Please add info.yml and LICENSE file. |
|
@copilot Add info.yml and LICENSE file. |
Agent-Logs-Url: https://github.com/microsoft/olive-recipes/sessions/5a084083-4534-4971-82c9-91d806abcd01 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Added
Added Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
The MobiusModelBuilder pass now automatically generates ORT GenAI config files (genai_config.json, tokenizer.json, etc.) alongside the ONNX models. This commit updates the recipe documentation to reflect this feature.
| } | ||
| }, | ||
| "passes": { | ||
| "mobius_build": { "type": "MobiusModelBuilder", "precision": "fp32" }, |
There was a problem hiding this comment.
If the fp32 the default precision?
There was a problem hiding this comment.
We can just bf16 too. Is it well supported?
| "save_as_external_data": true | ||
| } | ||
| }, | ||
| "engine": { "target": "local_system" }, |
There was a problem hiding this comment.
remove engine, and move target to top level
Restructure the Gemma4 E2B recipe directory to follow the established olive-recipes pattern (Qwen-Qwen3.5-4B/builtin): Directory structure: - cpu/config.json: FP32 CPU baseline (MobiusModelBuilder) - cuda/fp16/config.json: FP16 CUDA (MobiusModelBuilder) - cuda/int4/config.json: FP16 + INT4 block-wise RTN quantization New scripts: - optimize.py: Orchestrator with output validation (checks all 4 ONNX components, genai_config.json sections, special tokens, tokenizer) - eval.py: MMLU Pro benchmark via Olive's LMEvalORTGenAIEvaluator (lm-eval ortgenai backend) - inference.py: ORT GenAI text inference with chat template formatting Changes from original: - Replaced flat JSON files with device/variant directory layout - Removed k-quant CPU recipe (fp32 CPU baseline is more useful for debugging; k-quant can be re-added if needed) - Added FP16 CUDA recipe as unquantized GPU baseline - Added .gitignore, requirements.txt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
| import time | ||
|
|
||
| # Register Olive's ORT GenAI evaluator with lm-eval | ||
| import olive.evaluator.lmeval_ort # noqa: F401 |
- Add past_present_share_buffer=False to eval.py for correct KV cache handling, matching the setting already in inference.py - Extract shared resolve_model_path into utils.py to deduplicate path resolution logic between eval.py and inference.py - Fix YAML list syntax in info.yml for ep and device fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
| import time | ||
|
|
||
| # Register Olive's ORT GenAI evaluator with lm-eval | ||
| import olive.evaluator.lmeval_ort # noqa: F401 |
There was a problem hiding this comment.
since olive has built in https://github.com/microsoft/Olive/blob/main/olive/evaluator/lmeval_ort.py. can you add evaluator sec in the config file?
There was a problem hiding this comment.
Copilot:
Looked into this — it doesn't work yet for these particular configs, blocked by an Olive-side gap:
google/gemma-4-E2B-itis multimodal, somobius.build()selects thegemma4task and returns 4 components (decoder / embedding / vision_encoder / audio_encoder). MobiusBuilder therefore returns aCompositeModelHandler.olive/systems/local.py:52-53explicitly does:so as soon as we addif model_config.type.lower() == "compositemodel": raise NotImplementedError
evaluatorto the config, the evaluate phase blows up.LMEvaluatorwithmodel_class="ortgenai"assumes a single GenAI package and usesPath(model.model_path).parentto locategenai_config.json. For a composite,model.model_pathpoints at one component's.onnx, so.parentends up at e.g.cuda/int4/models/decoder/— nogenai_config.jsonthere.
So today we have two options:
(a) Keep the standalone eval.py — it loads the GenAI package directly via og.Model(path) and side-steps the Olive evaluator dispatch. Works, just not integrated into the olive run lifecycle.
(b) Add CompositeModel evaluator support to Olive upstream — e.g. let LocalSystem.evaluate_model accept composite + plumb the package root into LMEvaluator. Then we can drop eval.py and add a proper evaluators section here.
For a text-only LLM recipe (single-component output) adding evaluators would Just Work — same pattern as the existing Qwen2.5-1.5B/olive/mixed.json or Phi-4-mini/olive/mixed.json configs in this repo.
Want me to file the Olive issue for composite evaluator support and follow up with a PR there, then come back and add the evaluators section here once it lands? Otherwise I'll leave eval.py as the entry point and note the rationale in the README.
There was a problem hiding this comment.
I recommend keeping the current one and update later.
…tructure - Rename MobiusModelBuilder to MobiusBuilder (class renamed in Olive) - Move target from engine to root level per Olive convention - Remove unnecessary engine block from CPU config (uses defaults) - Fix unused variable warnings in inference.py - Update README: add olive-ai install, fix builder link - output_dir and log_severity_level at root level (not under engine) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
|
Let me rework this PR |
- Replace OnnxBlockWiseRtnQuantization with OnnxKQuantQuantization (bits=4, block_size=32) on the CUDA INT4 recipe to match the mobius-standard k-quant (Q4_K_M) path used elsewhere in the project. - Add a matching cpu/int4/config.json so CPU users get the same quant. - Restructure cpu/ from a single flat config to cpu/fp32/ + cpu/int4/ so the layout mirrors cuda/fp16/ + cuda/int4/. - Delete optimize.py: MobiusBuilder already runs ModelPackage.save() + write_ort_genai_config(), so the wrapper was just `olive run` plus file-existence checks. Users now invoke `olive run --config <path>` directly, matching the convention used by builtin-pass recipes in this repo. - README, info.yml, inference.py, eval.py, utils.py updated for the new layout and to add the fp32 variant choice. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
The helper was an 8-line function shared by two scripts. Inlining the two copies is shorter than maintaining a separate module + import, and matches the layout of other olive-recipes recipes which keep each script self-contained. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
- Drop task: 'text-generation' from input_model. MobiusBuilder doesn't read it, and Olive's HfModel default (text-generation-with-past) is appropriate for downstream passes too. - Drop log_severity_level: 1 — keep Olive at its default log level so the configs match other builtin recipes. - Fix cpu/fp32/config.json output_dir: was 'cpu/models' from the old flat layout, should be 'cpu/fp32/models' to match the cpu/fp32/ recipe directory. - CPU configs already omit target (CPU+CPUExecutionProvider is the Olive default system), CUDA configs keep target at top level without the engine wrapper — both matching the convention xiaoyu-work asked for in PR #381. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Olive recipes for google/gemma-4-E2B-it using the MobiusModelBuilder pass (Olive PR microsoft/Olive#2406).
Recipes
gemma4_fp32_cpu.jsongemma4_int4_cuda.jsongemma4_int4_kquant_cpu.jsonPrerequisites
Validated
INT4 CUDA pipeline tested end-to-end (~3.5 min):