Skip to content

TCP CRR workload#1906

Open
usamasaqib wants to merge 13 commits into
usama.saqib/rr-refactorfrom
usamasaqib/tcp-crr
Open

TCP CRR workload#1906
usamasaqib wants to merge 13 commits into
usama.saqib/rr-refactorfrom
usamasaqib/tcp-crr

Conversation

@usamasaqib

@usamasaqib usamasaqib commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds a new workload tcp_crr. This is similar to tcp_rr but with the difference that the client initiates a new connection for every request-response.

The workload stresses the tcp connection creation code path.

Motivation

Related issues

Additional Notes

PR stack
[1] #1905
[2] #1906 <-- This

@datadog-prod-us1-4

This comment has been minimized.

@usamasaqib usamasaqib changed the base branch from main to usama.saqib/rr-refactor June 2, 2026 10:08
@usamasaqib usamasaqib changed the title Usamasaqib/tcp crr TCP CRR workload Jun 2, 2026
@usamasaqib usamasaqib marked this pull request as ready for review June 2, 2026 10:14
@usamasaqib usamasaqib requested a review from a team as a code owner June 2, 2026 10:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9443fca431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

no_delay: self.config.no_delay,
backlog: self.config.backlog,
};
rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reuse server flow slots for CRR

For sustained tcp_crr runs this delegates the blackhole to the RR server path, but each CRR transaction creates a new server-side connection while rr::server_thread_main assigns a fresh monotonically increasing token on every accept and FlowMap explicitly never reuses removed slots (lading/src/neper/rr.rs:753-761, lading/src/neper/flow.rs:31-50). As a result the server's backing Vec<Option<Flow<_>>> grows by one slot per request/response cycle even after connections close, so a high-rate CRR experiment will steadily consume memory until lading itself becomes the bottleneck or OOMs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's double check this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fixed here a954228

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 048d10ffb0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/flow.rs Outdated
Comment on lines +58 to +61
if self.inner.len() < idx {
self.inner.resize_with(idx + 1, || None);
}
self.inner[idx] = Some(flow);
if self.inner[idx].is_none() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Grow the vector before indexing the first slot

With any positive flow count, FlowMap::new only reserves capacity and leaves inner.len() == 0; on the first insert idx is 0, this condition is false, and the next line indexes self.inner[0], panicking before either tcp_rr or tcp_crr can establish its first flow. The resize check needs to include the idx == len case.

Useful? React with 👍 / 👎.

Comment thread lading/src/neper/rr.rs Outdated
/// Total number of TCP flows the client should open. Sent to the client
/// over the control connection during startup.
pub(crate) flows: u16,
pub(crate) flows: u32,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the control handshake width consistent

Changing flows to u32 makes the server write four big-endian bytes via params.flows.to_be_bytes(), but the client still reads HANDSHAKE_LEN = 2 into a u16. For normal configurations such as the default flows: 1, the generator reads the high two bytes as 0 and then rejects threads > flows, so the new/default RR and CRR workloads fail to start.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf47a59ecf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs Outdated
Err(e) => {
trace!("reconnect to {addr} failed: {e}");
metrics.connections_failed.add(1);
let _ = flows.remove(token);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep CRR flows alive after transient reconnect errors

In CRR mode, immediate reconnect errors such as EADDRNOTAVAIL during ephemeral-port exhaustion are expected transient conditions, and the module docs explicitly call out that sustained CRR runs can hit this limit. Removing the flow here permanently reduces the configured active flow count; after enough transient failures the generator drains to zero flows and silently stops producing CRR load instead of retrying when ports become available again.

Useful? React with 👍 / 👎.

@usamasaqib usamasaqib Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We partially mitigate this by increasing the local port range for all sockets created by the generator in 69e43c0

This is not a complete solution but addressing this edge cases add more code than is needed right now. We can evaluate later whether users actually want to create a number of flows which can cause this to even occur.

@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from cf47a59 to 4ca89fa Compare June 2, 2026 11:47

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ca89fa859

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/flow.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac7aa5c284

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 301c265c80

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

///
/// # Panics
///
/// Panics if `addr` cannot be parsed as an IP address.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error for invalid tcp_crr addresses

For any tcp_crr generator config where addr is malformed, this expect panics the whole lading process before spin can return its Result, so callers cannot surface a normal configuration error or shut down cleanly. This new generator should parse the address into an error variant instead of aborting on user-provided config input.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21fc427004

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -0,0 +1,131 @@
//! TCP connect/request/response (`tcp_crr`) generator — the client side.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace non-ASCII documentation characters

/workspace/lading/AGENTS.md requires documentation to be US-ASCII only ("no Unicode characters in code or documentation"), but this added doc comment uses a Unicode em dash; the same pattern appears in several other new comments in this change, so the repository validation can reject the patch until these are replaced with ASCII punctuation.

Useful? React with 👍 / 👎.

@usamasaqib usamasaqib marked this pull request as draft June 3, 2026 06:44
In order to avoid EADDRNOTAVAIL failures due to race conditions when
creating new connections, we assign a dedicated port to each flow. This
prevents port collisions when multiple threads are attempting to create
a connection at the same time. To ensure that the port has been freed
before it is reused in the ClientAction::Reconnect pathway, we
forcefully trigger a RST packet to be sent so that the generator's
kernel frees the port when the tcp connection is dropped. This way the
upcoming new connection for the same flow will not fail due to
EADDRINUSE. To accomplish this we use a trick where we set the SO_LINGER
socket option to 0 to force an abortive close on close(fd). At the
application protocol level, when the client initiates an abortive close
the server has already transitioned into CLOSESTREAM state where it will
close drop the connection from its end anyway.
pinning ports to flows was changing the behavior on the bpf side since
the 5-tuple was becoming less variable. This meant that the effect of
flow tracking in bpf was different due to less overhead from managing
more new flows.
@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from 132cadd to d4e7bf4 Compare June 4, 2026 11:36
@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from 96db1de to d4e7bf4 Compare June 8, 2026 09:50
@usamasaqib usamasaqib marked this pull request as ready for review June 8, 2026 09:50

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4e7bf424e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs
// fall through
}
Ok(Some(e)) => {
warn!("connect failed: {e}");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid warning on every failed CRR reconnect

In tcp_crr mode, transient connect failures can happen repeatedly under the exact high-churn conditions this workload is meant to exercise (for example backlog pressure, refused connects, or ephemeral-port exhaustion). Emitting a warn! from the reconnect path for every failed nonblocking connect completion can make logging dominate the event loop and distort the experiment; this should stay at trace/debug level or be rate-limited while the counter records the failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not stress testing the receiver. Connect should not fail excessively.

@preinlein preinlein left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these new generators/blackholes, would it be possible to add any tests?

Could we also scan thru and make sure we do error bubbling up 🙏

no_delay: self.config.no_delay,
backlog: self.config.backlog,
};
rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's double check this.

Comment thread lading/src/neper/flow.rs
self.inner.get_mut(token.0).and_then(|slot| slot.as_mut())
let idx = token.0 % self.inner.capacity();
let flow = self.inner.get_mut(idx).and_then(|slot| slot.as_mut())?;
assert_eq!(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we asserting in non-test code?

This is going to cause some awkward panics. Can we bubble up the error instead?

@usamasaqib usamasaqib Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an invariant so this condition should never ever happen. Making it an error condition bakes it in the API as something which is possible.

This is why I am using an assert here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, asserts are dropped in release builds so this becomes a no-op.
We could use a panic here but I do believe we're trying to drive down the number of panics to 0 for lading.

@goxberry is currently working on something to put a linting rule around expect: #1882 (huge stack that removes EVERY expect)

We just don't want lading to crash in unexpected ways. Having an error here bubble up doesn't change anything apart from graceful handling. You can still have the invariant be present.

If possible, it would be really neat to have some tests that exercise the invariant. This might be too difficult to do but it would be good to try.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this standard is suitable for workloads in lading. It definitely make sense for the core, but in this case I just want a piece of software that does a very specific operation. If it cannot do it, then it should just blow up. I do not want it try to be graceful or fallback on anything.

It just seems that the effort of plumbing errors throughout the codebase is not worth it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know about assert though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this standard is suitable for workloads in lading. It definitely make sense for the core, but in this case I just want a piece of software that does a very specific operation. If it cannot do it, then it should just blow up. I do not want it try to be graceful or fallback on anything.

It just seems that the effort of plumbing errors throughout the codebase is not worth it.

What makes it unsuitable for these operations/generators/blackholes?

From what I've seen, you have a very clear entry point to the generator/blackhole execution that already often already follows the Result<(), Error> interface. What makes calling the logic inside of these not suitable for bubbling errors up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suitable may not be the right word here I guess. What I was trying to point out was that as per my understanding this is just a stylistic concern? The panics are irrecoverable conditions as far as I am concerned so they should never happen. There will not be any code in lading for handling these errors either.

But if it is going to be a project wide lint, then I can remove these cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit stylistic but it's also that in general, chaotically exiting the program is not desirable. Even if "it should never happen", doesn't mean it will unless we guarantee the invariants. Any future code change, could expose this code path as problematic without us realizing.

And then if we contrast graceful handling vs a panic.

  • If the panic is in the wrong place, Tokio can just swallow it. The program will continue to execute but in a weird state.
  • If we panic, it means normal exit code (any kind of cleanup/post execution code) won't run. That could lead the system to being in an expected state and it could make debugging the problem much harder.

Vs gracefully handling that error and exiting lading, we'd see lading exit with a failure/error code but at least we have guarantees that it does exit and it does cleanup after itself.


And then personally, I think that we need to be cognizant of the unknown unknowns - we don't know what we don't know. There's a possibility where we realize there are conditions where this invariant does not hold and/or there could be ways to recover. By wiring up error bubbling, it makes it so that the structure is in place already.

I would much prefer to realize "oh cool, there's something I need to do here and there's already error handling in place and I can add the logic there" vs "oh shoot, error bubbling isn't done, I need to add that first and add the error handling at the right place to fix the issue".


Having said all that, I'm sorry I dinged you on this one. We have traditionally "allowed" panics in lading code and it's just recently that we've determined that this is undesirable and we shouldn't be allowing that.

Comment thread lading/src/neper/rr.rs

/// Which neper-style protocol the client is driving.
#[derive(Clone, Copy, Debug)]
pub(crate) enum Mode {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than set a mode and then pass that across the whole code path, can we instead define different implementations for RR vs CRR for the behaviors that differ?

We could use a trait that clearly defines where these modes differ; the methods would have different implementations and the interface on the trait would inform the next engineer what it means for a mode to be different.

I also think this partially defeats the purpose of the refactor. We extracted "common" logic out and now we're putting exceptions back in. If we end up having more "mode" specific functionality, it will be much easier to reason about it if there's a trait there already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I can try to do that.

Comment thread lading/src/neper/rr.rs
///
/// Returns an error if configuration is invalid, the blackhole control port
/// is never reachable, or a worker thread panics.
#[allow(clippy::too_many_lines)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't reach for clippy ignore comments unless we really need to. What's preventing us from breaking this function into smaller pieces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll break it up

Comment thread lading/src/neper/rr.rs
let request_buf = vec![0u8; request_size];
let mut response_buf = vec![0u8; response_size];
let mut flows: FlowMap<ClientState> = FlowMap::new();
let mut flows: FlowMap<ClientState> = FlowMap::new(num_flows as usize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to cast this to usize instead of making FlowMap only accept u16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants