Skip to content

Add Gemma 4 E2B recipes (MobiusBuilder + INT4)#381

Open
justinchuby wants to merge 16 commits into
mainfrom
justinchu/gemma4-mobius
Open

Add Gemma 4 E2B recipes (MobiusBuilder + INT4)#381
justinchuby wants to merge 16 commits into
mainfrom
justinchu/gemma4-mobius

Conversation

@justinchuby
Copy link
Copy Markdown

@justinchuby justinchuby commented Apr 23, 2026

Olive recipes for google/gemma-4-E2B-it using the MobiusModelBuilder pass (Olive PR microsoft/Olive#2406).

Recipes

Config Pipeline Output
gemma4_fp32_cpu.json MobiusModelBuilder (fp32) 4 ONNX components (~5GB)
gemma4_int4_cuda.json MobiusModelBuilder (fp16) → OnnxBlockWiseRtnQuantization (int4) 4 quantized ONNX components (~2.8GB)
gemma4_int4_kquant_cpu.json MobiusModelBuilder (fp32) → OnnxKQuantQuantization (int4) 4 quantized ONNX components (k-quant, CPU)

Prerequisites

pip install olive-ai[gpu] mobius-ai

Validated

INT4 CUDA pipeline tested end-to-end (~3.5 min):

  • 97-98% of weights quantized (MatMulNBits + GatherBlockQuantized)
  • 4 components: decoder (2.4G), audio (152M), embedding (199M), vision (89M)

Two Olive pipeline configs for google/gemma-4-E2B-it:
- gemma4_fp32_cpu.json: FP32 build for CPU
- gemma4_int4_cuda.json: FP16 build + OnnxBlockWiseRtnQuantization INT4

Requires olive-ai with MobiusModelBuilder pass (Olive PR #2406)
and mobius-ai package.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 23:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Olive recipe configs for building and quantizing the HuggingFace google/gemma-4-E2B-it model using the MobiusModelBuilder pass, including an INT4 CUDA pipeline.

Changes:

  • Added FP32 CPU build recipe (gemma4_fp32_cpu.json) using MobiusModelBuilder.
  • Added FP16 build + INT4 RTN quantization CUDA recipe (gemma4_int4_cuda.json) using OnnxBlockWiseRtnQuantization.
  • Added model-specific README documenting prerequisites and how to run both recipes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
google-gemma-4-E2B-it/gemma4_int4_cuda.json New CUDA recipe: build FP16 with Mobius, then INT4 block-wise RTN quantization.
google-gemma-4-E2B-it/gemma4_fp32_cpu.json New CPU recipe: build FP32 with Mobius.
google-gemma-4-E2B-it/README.md Usage docs for running the two new recipes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread google-gemma-4-E2B-it/gemma4_fp32_cpu.json Outdated
Comment thread google-gemma-4-E2B-it/gemma4_int4_cuda.json Outdated
justinchuby and others added 2 commits April 23, 2026 16:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@devang-ml
Copy link
Copy Markdown
Contributor

Please add info.yml and LICENSE file.

@justinchuby
Copy link
Copy Markdown
Author

@copilot Add info.yml and LICENSE file.

Agent-Logs-Url: https://github.com/microsoft/olive-recipes/sessions/5a084083-4534-4971-82c9-91d806abcd01

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 24, 2026

@copilot Add info.yml and LICENSE file.

Added info.yml and LICENSE (Apache 2.0, matching the google-deepmind/gemma repo) to google-gemma-4-E2B-it/ in commit 58b231a.

Please add info.yml and LICENSE file.

Added info.yml and LICENSE (Apache 2.0) to google-gemma-4-E2B-it/ in commit 58b231a.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • ai.google.dev
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

The MobiusModelBuilder pass now automatically generates ORT GenAI config
files (genai_config.json, tokenizer.json, etc.) alongside the ONNX models.
This commit updates the recipe documentation to reflect this feature.
Comment thread google-gemma-4-E2B-it/gemma4_int4_cuda.json Outdated
}
},
"passes": {
"mobius_build": { "type": "MobiusModelBuilder", "precision": "fp32" },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fp32 the default precision?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just bf16 too. Is it well supported?

Comment thread google-gemma-4-E2B-it/gemma4_int4_kquant_cpu.json Outdated
Comment thread google-gemma-4-E2B-it/gemma4_int4_kquant_cpu.json Outdated
"save_as_external_data": true
}
},
"engine": { "target": "local_system" },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove engine, and move target to top level

Comment thread google-gemma-4-E2B-it/README.md Outdated
Comment thread google-gemma-4-E2B-it/README.md Outdated
Restructure the Gemma4 E2B recipe directory to follow the established
olive-recipes pattern (Qwen-Qwen3.5-4B/builtin):

Directory structure:
- cpu/config.json: FP32 CPU baseline (MobiusModelBuilder)
- cuda/fp16/config.json: FP16 CUDA (MobiusModelBuilder)
- cuda/int4/config.json: FP16 + INT4 block-wise RTN quantization

New scripts:
- optimize.py: Orchestrator with output validation (checks all 4 ONNX
  components, genai_config.json sections, special tokens, tokenizer)
- eval.py: MMLU Pro benchmark via Olive's LMEvalORTGenAIEvaluator
  (lm-eval ortgenai backend)
- inference.py: ORT GenAI text inference with chat template formatting

Changes from original:
- Replaced flat JSON files with device/variant directory layout
- Removed k-quant CPU recipe (fp32 CPU baseline is more useful for
  debugging; k-quant can be re-added if needed)
- Added FP16 CUDA recipe as unquantized GPU baseline
- Added .gitignore, requirements.txt

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Comment thread google-gemma-4-E2B-it/inference.py Fixed
Comment thread google-gemma-4-E2B-it/inference.py Fixed
import time

# Register Olive's ORT GenAI evaluator with lm-eval
import olive.evaluator.lmeval_ort # noqa: F401
- Add past_present_share_buffer=False to eval.py for correct KV cache
  handling, matching the setting already in inference.py
- Extract shared resolve_model_path into utils.py to deduplicate
  path resolution logic between eval.py and inference.py
- Fix YAML list syntax in info.yml for ep and device fields

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Comment thread google-gemma-4-E2B-it/inference.py Fixed
Comment thread google-gemma-4-E2B-it/inference.py Fixed
import time

# Register Olive's ORT GenAI evaluator with lm-eval
import olive.evaluator.lmeval_ort # noqa: F401
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since olive has built in https://github.com/microsoft/Olive/blob/main/olive/evaluator/lmeval_ort.py. can you add evaluator sec in the config file?

Copy link
Copy Markdown
Author

@justinchuby justinchuby May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot:

Looked into this — it doesn't work yet for these particular configs, blocked by an Olive-side gap:

  1. google/gemma-4-E2B-it is multimodal, so mobius.build() selects the gemma4 task and returns 4 components (decoder / embedding / vision_encoder / audio_encoder). MobiusBuilder therefore returns a CompositeModelHandler.
  2. olive/systems/local.py:52-53 explicitly does:
    if model_config.type.lower() == "compositemodel":
        raise NotImplementedError
    so as soon as we add evaluator to the config, the evaluate phase blows up.
  3. LMEvaluator with model_class="ortgenai" assumes a single GenAI package and uses Path(model.model_path).parent to locate genai_config.json. For a composite, model.model_path points at one component's .onnx, so .parent ends up at e.g. cuda/int4/models/decoder/ — no genai_config.json there.

So today we have two options:

(a) Keep the standalone eval.py — it loads the GenAI package directly via og.Model(path) and side-steps the Olive evaluator dispatch. Works, just not integrated into the olive run lifecycle.

(b) Add CompositeModel evaluator support to Olive upstream — e.g. let LocalSystem.evaluate_model accept composite + plumb the package root into LMEvaluator. Then we can drop eval.py and add a proper evaluators section here.

For a text-only LLM recipe (single-component output) adding evaluators would Just Work — same pattern as the existing Qwen2.5-1.5B/olive/mixed.json or Phi-4-mini/olive/mixed.json configs in this repo.

Want me to file the Olive issue for composite evaluator support and follow up with a PR there, then come back and add the evaluators section here once it lands? Otherwise I'll leave eval.py as the entry point and note the rationale in the README.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend keeping the current one and update later.

Comment thread google-gemma-4-E2B-it/cuda/int4/config.json Outdated
@justinchuby justinchuby changed the title Add Gemma 4 E2B recipes (MobiusModelBuilder + INT4) Add Gemma 4 E2B recipes (MobiusBuilder + INT4) May 7, 2026
…tructure

- Rename MobiusModelBuilder to MobiusBuilder (class renamed in Olive)
- Move target from engine to root level per Olive convention
- Remove unnecessary engine block from CPU config (uses defaults)
- Fix unused variable warnings in inference.py
- Update README: add olive-ai install, fix builder link
- output_dir and log_severity_level at root level (not under engine)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
@justinchuby justinchuby marked this pull request as draft May 7, 2026 23:57
@justinchuby
Copy link
Copy Markdown
Author

Let me rework this PR

@justinchuby justinchuby self-assigned this May 9, 2026
- Replace OnnxBlockWiseRtnQuantization with OnnxKQuantQuantization (bits=4,
  block_size=32) on the CUDA INT4 recipe to match the mobius-standard
  k-quant (Q4_K_M) path used elsewhere in the project.
- Add a matching cpu/int4/config.json so CPU users get the same quant.
- Restructure cpu/ from a single flat config to cpu/fp32/ + cpu/int4/ so
  the layout mirrors cuda/fp16/ + cuda/int4/.
- Delete optimize.py: MobiusBuilder already runs ModelPackage.save() +
  write_ort_genai_config(), so the wrapper was just `olive run` plus
  file-existence checks. Users now invoke `olive run --config <path>`
  directly, matching the convention used by builtin-pass recipes in
  this repo.
- README, info.yml, inference.py, eval.py, utils.py updated for the
  new layout and to add the fp32 variant choice.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
The helper was an 8-line function shared by two scripts. Inlining the
two copies is shorter than maintaining a separate module + import,
and matches the layout of other olive-recipes recipes which keep
each script self-contained.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
- Drop task: 'text-generation' from input_model. MobiusBuilder doesn't
  read it, and Olive's HfModel default (text-generation-with-past) is
  appropriate for downstream passes too.
- Drop log_severity_level: 1 — keep Olive at its default log level so
  the configs match other builtin recipes.
- Fix cpu/fp32/config.json output_dir: was 'cpu/models' from the old
  flat layout, should be 'cpu/fp32/models' to match the cpu/fp32/
  recipe directory.
- CPU configs already omit target (CPU+CPUExecutionProvider is the
  Olive default system), CUDA configs keep target at top level
  without the engine wrapper — both matching the convention
  xiaoyu-work asked for in PR #381.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
@justinchuby justinchuby marked this pull request as ready for review May 22, 2026 23:01
Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants