Skip to content

fix(providers): honor llama.cpp per-model context windows#3311

Merged
can1357 merged 4 commits into
mainfrom
farm/b1c3c65a/fix-llama-cpp-context-window
Jun 24, 2026
Merged

fix(providers): honor llama.cpp per-model context windows#3311
can1357 merged 4 commits into
mainfrom
farm/b1c3c65a/fix-llama-cpp-context-window

Conversation

@roboomp

@roboomp roboomp commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Repro

A focused bun -e "$REPRO_SCRIPT" probe called discoverLlamaCppModels() with /v1/models returning ctx-88k as meta.n_ctx: 88832 and ctx-65k as meta.n_ctx: 65536 while /props returned n_ctx: 0; before the fix both discovered models reported contextWindow=128000.

Cause

packages/coding-agent/src/config/model-discovery.ts typed llama.cpp /v1/models entries as { id: string }, discarding data[].meta.n_ctx and meta.n_ctx_train, then built every model from the single /props server context window. packages/coding-agent/src/config/model-registry.ts also treated fresh llama.cpp discovery cache rows as reusable for 24h, so a restarted server could keep stale context metadata until manual cache deletion.

Fix

  • Parse llama.cpp /v1/models entries with meta.n_ctx and meta.n_ctx_train, preferring those per-model values over /props and falling back to /props only for unloaded models.
  • Add selected-model metadata refresh so a lazy-loaded/sleeping llama.cpp model patches its registry contextWindow and discovery-derived maxTokens when meta.n_ctx appears.
  • Bypass fresh-cache reuse for llama.cpp refreshes so normal refresh paths re-query the local server after restarts, while retaining cached fallback on discovery failure.
  • Add regression coverage for per-model context discovery, lazy selected-model patching, and fresh-cache bypass.

Verification

Ran bun -e "$REPRO_SCRIPT" and observed ctx-88k contextWindow=88832 plus ctx-65k contextWindow=65536; ran bun test packages/coding-agent/test/model-discovery.test.ts with 22 passing tests; ran bun check successfully. Fixes #3310

Read per-model llama.cpp meta.n_ctx values during discovery, refresh selected models after lazy load, and bypass fresh cache reuse for llama.cpp refreshes so server restarts update context windows.\n\nFixes #3310

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24adad2890

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +778 to +781
const override = this.#resolveLiveModelOverride(current);
const patch: ModelPatch = {};
if (override?.contextWindow === undefined && current.contextWindow !== contextWindow) {
patch.contextWindow = contextWindow;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect custom llama.cpp model limits

When a llama.cpp provider uses discovery and also defines a same-id custom model in models.yml, user-pinned contextWindow/maxTokens from that custom model are merged via #mergeCustomModels, not stored in #modelOverrides. This guard only checks modelOverrides (and the maxTokens guard below does the same), so simply selecting the model can replace the configured limits with /models metadata and mutate the registry until reload; please treat matching custom model fields as overrides here as well.

Useful? React with 👍 / 👎.

Treat same-id custom llama.cpp model contextWindow and maxTokens fields as pinned limits during selected-model metadata refresh.\n\nFixes #3310
@roboomp

roboomp commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Updated #3311:

  • Selected-model llama.cpp metadata refresh now treats same-id custom models.yml contextWindow and maxTokens as pinned limits, matching modelOverrides behavior.
  • Added a regression test covering a discovered llama.cpp provider with a same-id custom model whose limits must survive meta.n_ctx refresh.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15d0e95d6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if (!isLlamaCppDiscovery) {
return model;
}
const contextWindow = await discoverLlamaCppModelContextWindow(model, this.#discoveryContext());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid resolving keys while switching llama.cpp models

When the selected llama.cpp provider is configured with apiKey: "!..." (or an OAuth credential that needs refresh), this call goes through #discoveryContext(), whose resolver calls getApiKeyForProvider; command-backed keys run execSync with a 10s timeout before the model switch completes. setModel intentionally used hasConfiguredAuth so /model and role switches stayed non-blocking and credentials were resolved lazily per request, so selecting a protected llama.cpp model can now freeze the UI just to refresh metadata. Use a non-resolving auth probe here, or defer this metadata refresh to the request path where the key is already being resolved.

Useful? React with 👍 / 👎.

Comment thread packages/coding-agent/CHANGELOG.md Outdated

### Fixed

- Fixed llama.cpp discovery to prefer per-model `/v1/models` `meta.n_ctx`/`meta.n_ctx_train` values, refresh selected models after lazy load, and bypass fresh-cache reuse so server restarts update context windows. ([#3310](https://github.com/can1357/oh-my-pi/issues/3310))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move the changelog entry to Unreleased

This new entry is under the versioned ## [16.1.16] - 2026-06-23 section while ## [Unreleased] is empty. The root AGENTS.md says new entries must go under Unreleased and released sections are immutable, so the next release notes will omit this fix while also rewriting historical changelog content.

Useful? React with 👍 / 👎.

Use a non-resolving discovery context for selected-model llama.cpp metadata refresh so command-backed and OAuth credentials stay lazy during model switches.\n\nFixes #3310
@roboomp

roboomp commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Updated #3311:

  • Selected-model llama.cpp metadata refresh now uses a non-resolving discovery context, so /model switches do not execute command-backed keys or refresh OAuth just to read /v1/models metadata.
  • Added a regression test proving command-backed apiKey: "!..." is not invoked during selected-model metadata refresh.

Move the llama.cpp per-model context window entry from the released 16.1.16 section to [Unreleased] so the next release notes carry it.\n\nFixes #3310
@roboomp

roboomp commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Updated #3311:

  • Moved the llama.cpp per-model context window changelog entry from the released [16.1.16] section to [Unreleased] so the next release carries it.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e431f5caa6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if (!isLlamaCppDiscovery) {
return model;
}
const contextWindow = await discoverLlamaCppModelContextWindow(model, this.#nonResolvingDiscoveryContext());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve llama.cpp auth for metadata refresh

When a llama.cpp server is protected by a configured or stored API key, this new selected-model refresh path uses #nonResolvingDiscoveryContext, whose resolver always returns undefined, so discoverLlamaCppModelContextWindow calls /models without Authorization. A protected endpoint returns 401 and the helper silently falls back to the cached/default model, leaving the loaded model's context window unpatched; avoid resolving command/OAuth credentials without dropping already available bearer credentials.

Useful? React with 👍 / 👎.

@can1357 can1357 merged commit 94060e8 into main Jun 24, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

vouched Passed the vouch gate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

llama.cpp: context window stale and shared across all models — ignores per-model meta.n_ctx from /v1/models

2 participants