Skip to content

Support Gateways, Simplify subdomain logic and allow just using --site or DD_SITE (breaking: removes --subdomain)#576

Draft
platinummonkey wants to merge 10 commits into
mainfrom
feat/site-host-management
Draft

Support Gateways, Simplify subdomain logic and allow just using --site or DD_SITE (breaking: removes --subdomain)#576
platinummonkey wants to merge 10 commits into
mainfrom
feat/site-host-management

Conversation

@platinummonkey

@platinummonkey platinummonkey commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Removes the --subdomain login flag and makes a single site (from --site/DD_SITE) the one host pup records. Canonical Datadog sites still derive api./app. subdomains; everything else (vanity domains, custom gateways) is used verbatim for API requests and OAuth. Fixes #574.

Changes

  • One site value. Drop --subdomain; pass the full host directly (pup auth login --site mycompany.datadoghq.com). KNOWN_SITES marks canonical Datadog endpoints (which keep api./app. derivation); anything else is a literal host used verbatim (src/config.rs, src/commands/auth.rs, src/main.rs).
  • Single SDK routing path. make_dd_config routes the SDK at cfg.api_base_url() for every case (mock, canonical, literal), replacing the prior three-way server_index branch (src/client.rs).
  • Shared host classification. uses_datadog_subdomains() is the one predicate behind api_host_for/auth_host_for and login site resolution, so the API host, auth host, and token-exchange host can't diverge (src/config.rs).
  • Gateway logins stay pinned. For literal hosts the OAuth callback's domain is ignored so the token, session, and DCR credential stay keyed to the user's host. For canonical sites the callback region is adopted only if it's a Datadog-owned host (*.datadoghq.com, *.datadoghq.eu, *.ddog-gov.com, *.datad0g.com) — closing a pre-existing gap where an attacker-influenceable callback could redirect the code exchange (src/commands/auth.rs, src/config.rs).
  • Validation. All site sources are run through validate_site (rejects URL-smuggling, allows a DNS host + optional :port); docs updated for the DD_SITE literal-host form (docs/OAUTH2.md, docs/TROUBLESHOOTING.md).

Behavior changes

  • mycompany.datadoghq.com is now used verbatim instead of being routed to api.mycompany.datadoghq.com.
  • --site datadoghq.com --subdomain mycompany becomes --site mycompany.datadoghq.com.

Testing

  • cargo test --bin pup -- --test-threads=1 (1321 pass), cargo clippy -- -D warnings, cargo fmt --check
  • Positive + negative coverage for site classification, host routing, callback-domain adoption, and URL-smuggling rejection.

Closes #574


🤖 Generated with Claude Code

platinummonkey and others added 5 commits June 11, 2026 08:59
Fixes issue #574 and cleans up the SAML/SSO flag surface:

- Remove the `--subdomain` login-only flag. Users now pass the full
  vanity host directly: `pup auth login --site mycompany.datadoghq.com`
  (or `DD_SITE=mycompany.datadoghq.com pup ...` for API calls).

- Introduce site classification: `KNOWN_SITES` enumerates the canonical
  Datadog endpoints (datadoghq.com, us3/us5/ap1/ap2, datadoghq.eu,
  ddog-gov.com, datad0g.com staging). Canonical sites keep their
  `api.`/`app.` prefix behavior unchanged. Everything else is treated as
  a **literal host** used verbatim for both API requests and OAuth flows.

- `DcrClient` now derives `api_host`/`auth_host` via the shared free
  functions (`api_host_for`/`auth_host_for`) so literal hosts are honored
  in DCR registration, token exchange, and the authorize URL — fixing the
  token-refresh path through a gateway too.

- `make_dd_config` routes literal hosts through the SDK's mock-style
  template (server_index=1, `{protocol}://{name}`) so the SDK targets the
  host verbatim without prepending `api.`. Canonical sites use index 2
  (no enum restriction, same `https://api.{site}` template).

- `set_site_explicit` and `from_env` validate the normalized site with
  `validate_site`, which rejects URL-smuggling chars (`/`, `#`, `@`, etc.)
  while permitting DNS hostnames with an optional `:port` suffix.

- Fix TROUBLESHOOTING.md: document `DD_SITE` as the override mechanism
  (with literal-host and port examples); note `DD_HOST` is not recognized.

Behavior changes:
- Vanity `mycompany.datadoghq.com` was routed to `api.mycompany.datadoghq.com`;
  now used verbatim (app. and api. are the same Datadog infrastructure).
- The old `--site datad0g.com --subdomain dd` becomes `--site dd.datad0g.com`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Use cfg.auth_host() in acp.rs and bits.rs instead of hardcoding
  format!("https://app.{}", cfg.site"), so vanity/literal hosts are
  respected by the ACP and Bits commands.

- Fix set_site_explicit: validate raw input before normalize_site so
  an empty --site "" is rejected rather than silently falling through to
  "datadoghq.com" via normalize_site's empty-string fallback.

- Add explanatory comment on api_base_url documenting the intentional
  asymmetry with api_host for PUP_MOCK_SERVER (full URL vs. bare host).

- Add tests: set_site_explicit rejects smuggling values and empty string
  (site/site_explicit unchanged on failure); from_env rejects invalid
  DD_SITE; DcrClient api_host/auth_host fields verified for canonical,
  vanity, gateway, and staging sites.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ble bleed

Address issues found in the second review pass:

- Move validate_site to cover all site sources in from_env (explicit,
  session-derived, and default) instead of only the explicit DD_SITE/
  config-file branch — prevents a tampered sessions file from routing
  to an attacker-controlled host without any validation.
- Add validate_site with warn-and-skip in apply_org_override so
  session-derived sites are also checked when --org resolves a site
  from the session registry.
- Fix literal-host make_dd_config branch: remove the "site" server
  variable (set by Configuration::new() from DD_SITE env) before
  switching to server_index=1 so it cannot bleed into the
  {protocol}://{name} template. New test assertions confirmed the bug.
- Update OAUTH2.md: replace stale --subdomain example with --site
  literal-host form.
- Add from_params comment explaining the trusted browser caller context
  (why validate_site is intentionally omitted there).
- Add is_canonical_site comment documenting the oncall substring
  convention and its effect on prefix handling.
- Add tests: whitespace-only DD_SITE behavior, site variable absence
  for both literal-host make_dd_config paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… test

- Change apply_org_override to Result<()>: bail on invalid session site
  rather than warn-and-skip, matching from_env's behavior and preventing
  silent routing to the wrong (or attacker-controlled) endpoint when the
  sessions file is tampered. Update all callers (main.rs, extensions/mod.rs)
  with ? propagation; apply_to also becomes Result<()>.
- Reject whitespace-only DD_SITE in from_env (consistent with set_site_explicit
  which already trims-and-rejects): without this, DD_SITE="  " would silently
  normalize to "datadoghq.com" with site_explicit=true, blocking --org from
  correcting the site. Update test from "falls back" to "rejects".
- Add apply_org_override negative test: saves a malformed session site and
  asserts the function errors and leaves cfg.site unchanged.
- Add test_make_dd_config_literal_host_dd_site_env_does_not_bleed: sets
  DD_SITE in the environment before calling make_dd_config with a literal
  host, confirming the server_variables.remove("site") actually fires (the
  two prior tests called remove_var first, making the assertion vacuous).
- Add debug_assert on server_variables.remove("site") return value to catch
  a future SDK version that stops pre-populating this key.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dd-octo-sts-019303

Copy link
Copy Markdown

🐑 PR Shepherd is maintaining this PR

I watch your PR and automatically fix CI failures, rebase your branch, handle flaky tests, and push it to the merge queue when it's ready.

More about what I do → Guide

To pause me on this PR, add the flow-skip label.

Base automatically changed from revert/remove-direct-main-site-commits to main June 11, 2026 14:01
@datadog-datadog-prod-us1

This comment has been minimized.

@platinummonkey platinummonkey marked this pull request as draft June 11, 2026 14:07
@platinummonkey platinummonkey changed the title feat(auth): remove --subdomain, support literal hosts in --site/DD_SITE Support Gateways, Simplify subdomain logic and allow just using --site or DD_SITE (breaking: removes --subdomain) Jun 11, 2026
CI clippy (Rust 1.95) flags `get("site").is_none()` in the make_dd_config
tests. Replace with `!contains_key("site")` as the lint suggests; no
behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@srosenthal-dd

Copy link
Copy Markdown
Member

I'm working on reviewing this. So far I ran and tested it locally, and everything works, but it doesn't send all requests to the custom site, which I think was a main goal (besides simplification).

We'll need to update README.md:260-263 which still reflects the old way.

I'm looking into if there is a good way to get all traffic to go through the specified host.

@srosenthal-dd

Copy link
Copy Markdown
Member

Key question seems to be what you asked in #574, so let's wait a bit to see?

Independent of that particular user request, this does seem to simplify things, so I'm also OK with shipping this and following up about what they really want. This seems to work fine for regular Datadog subdomains (custom.datadoghq.com), but I'm not so sure about fully custom proxies (datadog-proxy.company.internal)

Which one(s) are we trying to support...
a) a transparent proxy that also includes the OAuth flow (it currently ends back at datadoghq.com)
b) a credential injecting proxy? Might need to give pup support for non-allowlisted hosts and an option skip validation of credentials
c) something else

@platinummonkey platinummonkey marked this pull request as ready for review June 12, 2026 17:30
platinummonkey and others added 2 commits June 12, 2026 12:32
Simplify the site/host model so one recorded `site` (from --site/DD_SITE)
drives everything, with derivation kept only where Datadog's topology
genuinely requires it:

- make_dd_config: replace the three-way server_index branch (mock /
  literal-host index 1 with the server_variables.remove("site") guard /
  canonical index 2) with a single path that routes the SDK at
  cfg.api_base_url() — split on "://" into protocol+name, always
  server_index=1. api_base_url() already encapsulates the mock override,
  api.{site} derivation for canonical sites, and verbatim literal hosts.
  Drop the now-dead Config::is_literal_host().

- Fix gateway login site: resolve_effective_site() keeps the user's
  --site/DD_SITE host for verbatim hosts (vanity/gateway/oncall) instead
  of adopting the OAuth callback's reported backend region, so the
  exchanged token, saved session, and DCR credential all stay keyed to
  the single host the user pinned. Canonical Datadog sites still adopt
  the callback region (datadoghq.com -> us3.datadoghq.com).

- Extract the shared uses_datadog_subdomains() predicate so api_host_for,
  auth_host_for, and resolve_effective_site classify hosts identically —
  fixing an oncall inconsistency and preventing future drift.

Tests rewritten to assert the resolved host (api.datadoghq.com,
verbatim gateway/vanity/oncall) rather than SDK internals, plus a
scheme-less mock case and resolve_effective_site coverage for canonical,
gateway, vanity, and oncall inputs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@platinummonkey platinummonkey marked this pull request as draft June 12, 2026 20:28
platinummonkey and others added 2 commits June 12, 2026 15:33
Harden a pre-existing trust gap in the login flow: the OAuth callback's
`domain` (attacker-influenceable via the stdin-paste fallback) was used
as the token-exchange host and storage key with no validation. On a
canonical-site login a tampered `domain=evil.com` would send the auth
code, PKCE verifier, and DCR client credentials to that host.

resolve_effective_site now adopts the callback domain only when it is a
Datadog-owned host (new config::is_datadog_owned_host), otherwise falling
back to the user's already-validated pinned site. The check is a
suffix-based allowlist over the Datadog parent domains (*.datadoghq.com,
*.datadoghq.eu, *.ddog-gov.com, *.datad0g.com) so new regions keep
working without a code change. A plain syntactic check (validate_site)
was insufficient — a well-formed evil.com passes it.

Tests: is_datadog_owned_host (parent + subdomain of each base owned;
foreign/look-alike/smuggling rejected); resolve_effective_site rejects a
well-formed foreign callback domain, not just malformed ones.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The suffix-only ownership check was bypassable: a callback domain like
`evil.com/x.datadoghq.com` survives strip_suffix("datadoghq.com") as
"evil.com/x." (ends with '.'), so it passed the allowlist — yet parses as
the URL host `evil.com` with the rest as path, re-opening the exact
credential-exfiltration vector. Folded validate_site into
is_datadog_owned_host as a mandatory first gate so the check is
self-contained: anything that isn't a bare DNS hostname (contains /, @,
#, whitespace, schemes) is rejected before the suffix allowlist runs.

Tests: added the smuggling-prefix payloads (evil.com/x.datadoghq.com,
foo@evil.com/.datadoghq.com, evil.com#x.datadoghq.com) to both the
is_datadog_owned_host and resolve_effective_site rejection cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@platinummonkey

Copy link
Copy Markdown
Collaborator Author

this PR is designed around a) a transparent proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Does not Respect DD_HOST per the docs

2 participants