TCP CRR workload by usamasaqib · Pull Request #1906 · DataDog/lading

usamasaqib · 2026-06-02T09:48:44Z

What does this PR do?

This PR adds a new workload tcp_crr. This is similar to tcp_rr but with the difference that the client initiates a new connection for every request-response.

The workload stresses the tcp connection creation code path.

Motivation

Related issues

Additional Notes

PR stack
[1] #1905
[2] #1906 <-- This

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9443fca431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T10:16:42Z

+            no_delay: self.config.no_delay,
+            backlog: self.config.backlog,
+        };
+        rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?;


Reuse server flow slots for CRR

For sustained tcp_crr runs this delegates the blackhole to the RR server path, but each CRR transaction creates a new server-side connection while rr::server_thread_main assigns a fresh monotonically increasing token on every accept and FlowMap explicitly never reuses removed slots (lading/src/neper/rr.rs:753-761, lading/src/neper/flow.rs:31-50). As a result the server's backing Vec<Option<Flow<_>>> grows by one slot per request/response cycle even after connections close, so a high-rate CRR experiment will steadily consume memory until lading itself becomes the bottleneck or OOMs.

Useful? React with 👍 / 👎.

Let's double check this.

It is fixed here a954228

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 048d10ffb0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T11:38:53Z

+        if self.inner.len() < idx {
            self.inner.resize_with(idx + 1, || None);
        }
-        self.inner[idx] = Some(flow);
+        if self.inner[idx].is_none() {


Grow the vector before indexing the first slot

With any positive flow count, FlowMap::new only reserves capacity and leaves inner.len() == 0; on the first insert idx is 0, this condition is false, and the next line indexes self.inner[0], panicking before either tcp_rr or tcp_crr can establish its first flow. The resize check needs to include the idx == len case.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-02T11:38:53Z

    /// Total number of TCP flows the client should open. Sent to the client
    /// over the control connection during startup.
-    pub(crate) flows: u16,
+    pub(crate) flows: u32,


Keep the control handshake width consistent

Changing flows to u32 makes the server write four big-endian bytes via params.flows.to_be_bytes(), but the client still reads HANDSHAKE_LEN = 2 into a u16. For normal configurations such as the default flows: 1, the generator reads the high two bytes as 0 and then rejects threads > flows, so the new/default RR and CRR workloads fail to start.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf47a59ecf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T11:42:48Z

+                Err(e) => {
+                    trace!("reconnect to {addr} failed: {e}");
+                    metrics.connections_failed.add(1);
+                    let _ = flows.remove(token);


Keep CRR flows alive after transient reconnect errors

In CRR mode, immediate reconnect errors such as EADDRNOTAVAIL during ephemeral-port exhaustion are expected transient conditions, and the module docs explicitly call out that sustained CRR runs can hit this limit. Removing the flow here permanently reduces the configured active flow count; after enough transient failures the generator drains to zero flows and silently stops producing CRR load instead of retrying when ports become available again.

Useful? React with 👍 / 👎.

We partially mitigate this by increasing the local port range for all sockets created by the generator in 69e43c0

This is not a complete solution but addressing this edge cases add more code than is needed right now. We can evaluate later whether users actually want to create a number of flows which can cause this to even occur.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ca89fa859

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac7aa5c284

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…lls may fail with EADDRNOTAVAIL due to port exhaustion. We reduce chances of this by increase the local port range for sockets created by the generator.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 301c265c80

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T16:24:03Z

+    ///
+    /// # Panics
+    ///
+    /// Panics if `addr` cannot be parsed as an IP address.


Return an error for invalid tcp_crr addresses

For any tcp_crr generator config where addr is malformed, this expect panics the whole lading process before spin can return its Result, so callers cannot surface a normal configuration error or shut down cleanly. This new generator should parse the address into an error variant instead of aborting on user-provided config input.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21fc427004

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T16:30:33Z

@@ -0,0 +1,131 @@
+//! TCP connect/request/response (`tcp_crr`) generator — the client side.


Replace non-ASCII documentation characters

/workspace/lading/AGENTS.md requires documentation to be US-ASCII only ("no Unicode characters in code or documentation"), but this added doc comment uses a Unicode em dash; the same pattern appears in several other new comments in this change, so the repository validation can reject the patch until these are replaced with ASCII punctuation.

Useful? React with 👍 / 👎.

In order to avoid EADDRNOTAVAIL failures due to race conditions when creating new connections, we assign a dedicated port to each flow. This prevents port collisions when multiple threads are attempting to create a connection at the same time. To ensure that the port has been freed before it is reused in the ClientAction::Reconnect pathway, we forcefully trigger a RST packet to be sent so that the generator's kernel frees the port when the tcp connection is dropped. This way the upcoming new connection for the same flow will not fail due to EADDRINUSE. To accomplish this we use a trick where we set the SO_LINGER socket option to 0 to force an abortive close on close(fd). At the application protocol level, when the client initiates an abortive close the server has already transitioned into CLOSESTREAM state where it will close drop the connection from its end anyway.

pinning ports to flows was changing the behavior on the bpf side since the 5-tuple was becoming less variable. This meant that the effect of flow tracking in bpf was different due to less overhead from managing more new flows.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4e7bf424e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T09:55:26Z

+                // fall through
+            }
+            Ok(Some(e)) => {
+                warn!("connect failed: {e}");


Avoid warning on every failed CRR reconnect

In tcp_crr mode, transient connect failures can happen repeatedly under the exact high-churn conditions this workload is meant to exercise (for example backlog pressure, refused connects, or ephemeral-port exhaustion). Emitting a warn! from the reconnect path for every failed nonblocking connect completion can make logging dominate the event loop and distort the experiment; this should stay at trace/debug level or be rate-limited while the counter records the failures.

Useful? React with 👍 / 👎.

We are not stress testing the receiver. Connect should not fail excessively.

preinlein

With these new generators/blackholes, would it be possible to add any tests?

Could we also scan thru and make sure we do error bubbling up 🙏

preinlein · 2026-06-16T14:16:25Z

+            no_delay: self.config.no_delay,
+            backlog: self.config.backlog,
+        };
+        rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?;


Let's double check this.

preinlein · 2026-06-16T14:18:44Z

-        self.inner.get_mut(token.0).and_then(|slot| slot.as_mut())
+        let idx = token.0 % self.inner.capacity();
+        let flow = self.inner.get_mut(idx).and_then(|slot| slot.as_mut())?;
+        assert_eq!(


Why are we asserting in non-test code?

This is going to cause some awkward panics. Can we bubble up the error instead?

This is an invariant so this condition should never ever happen. Making it an error condition bakes it in the API as something which is possible.

This is why I am using an assert here.

Well, asserts are dropped in release builds so this becomes a no-op.
We could use a panic here but I do believe we're trying to drive down the number of panics to 0 for lading.

@goxberry is currently working on something to put a linting rule around expect: #1882 (huge stack that removes EVERY expect)

We just don't want lading to crash in unexpected ways. Having an error here bubble up doesn't change anything apart from graceful handling. You can still have the invariant be present.

If possible, it would be really neat to have some tests that exercise the invariant. This might be too difficult to do but it would be good to try.

I don't know if this standard is suitable for workloads in lading. It definitely make sense for the core, but in this case I just want a piece of software that does a very specific operation. If it cannot do it, then it should just blow up. I do not want it try to be graceful or fallback on anything.

It just seems that the effort of plumbing errors throughout the codebase is not worth it.

Good to know about assert though.

I don't know if this standard is suitable for workloads in lading. It definitely make sense for the core, but in this case I just want a piece of software that does a very specific operation. If it cannot do it, then it should just blow up. I do not want it try to be graceful or fallback on anything.

It just seems that the effort of plumbing errors throughout the codebase is not worth it.

What makes it unsuitable for these operations/generators/blackholes?

From what I've seen, you have a very clear entry point to the generator/blackhole execution that already often already follows the Result<(), Error> interface. What makes calling the logic inside of these not suitable for bubbling errors up?

Suitable may not be the right word here I guess. What I was trying to point out was that as per my understanding this is just a stylistic concern? The panics are irrecoverable conditions as far as I am concerned so they should never happen. There will not be any code in lading for handling these errors either.

But if it is going to be a project wide lint, then I can remove these cases.

It's a bit stylistic but it's also that in general, chaotically exiting the program is not desirable. Even if "it should never happen", doesn't mean it will unless we guarantee the invariants. Any future code change, could expose this code path as problematic without us realizing.

And then if we contrast graceful handling vs a panic.

If the panic is in the wrong place, Tokio can just swallow it. The program will continue to execute but in a weird state.

If we panic, it means normal exit code (any kind of cleanup/post execution code) won't run. That could lead the system to being in an expected state and it could make debugging the problem much harder.

Vs gracefully handling that error and exiting lading, we'd see lading exit with a failure/error code but at least we have guarantees that it does exit and it does cleanup after itself.

And then personally, I think that we need to be cognizant of the unknown unknowns - we don't know what we don't know. There's a possibility where we realize there are conditions where this invariant does not hold and/or there could be ways to recover. By wiring up error bubbling, it makes it so that the structure is in place already.

I would much prefer to realize "oh cool, there's something I need to do here and there's already error handling in place and I can add the logic there" vs "oh shoot, error bubbling isn't done, I need to add that first and add the error handling at the right place to fix the issue".

Having said all that, I'm sorry I dinged you on this one. We have traditionally "allowed" panics in lading code and it's just recently that we've determined that this is undesirable and we shouldn't be allowing that.

preinlein · 2026-06-16T14:23:48Z


+/// Which neper-style protocol the client is driving.
+#[derive(Clone, Copy, Debug)]
+pub(crate) enum Mode {


Rather than set a mode and then pass that across the whole code path, can we instead define different implementations for RR vs CRR for the behaviors that differ?

We could use a trait that clearly defines where these modes differ; the methods would have different implementations and the interface on the trait would inform the next engineer what it means for a mode to be different.

I also think this partially defeats the purpose of the refactor. We extracted "common" logic out and now we're putting exceptions back in. If we end up having more "mode" specific functionality, it will be much easier to reason about it if there's a trait there already.

That is a good point. I can try to do that.

preinlein · 2026-06-16T14:24:30Z

 ///
 /// Returns an error if configuration is invalid, the blackhole control port
 /// is never reachable, or a worker thread panics.
+#[allow(clippy::too_many_lines)]


We shouldn't reach for clippy ignore comments unless we really need to. What's preventing us from breaking this function into smaller pieces?

I'll break it up

preinlein · 2026-06-16T14:25:55Z

    let request_buf = vec![0u8; request_size];
    let mut response_buf = vec![0u8; response_size];
-    let mut flows: FlowMap<ClientState> = FlowMap::new();
+    let mut flows: FlowMap<ClientState> = FlowMap::new(num_flows as usize);


Does it make sense to cast this to usize instead of making FlowMap only accept u16?

introduce tcp crr

9443fca

This comment has been minimized.

Sign in to view

usamasaqib added the no-changelog label Jun 2, 2026

usamasaqib changed the base branch from main to usama.saqib/rr-refactor June 2, 2026 10:08

usamasaqib changed the title ~~Usamasaqib/tcp crr~~ TCP CRR workload Jun 2, 2026

usamasaqib marked this pull request as ready for review June 2, 2026 10:14

usamasaqib requested a review from a team as a code owner June 2, 2026 10:14

usamasaqib mentioned this pull request Jun 2, 2026

Split TCP RR infrastructure from the workload #1905

Open

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

fix flow management so that FlowMap is a ring buffer and reuses slots

a954228

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

fix off-by-one

4ca89fa

usamasaqib force-pushed the usamasaqib/tcp-crr branch from cf47a59 to 4ca89fa Compare June 2, 2026 11:47

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread lading/src/neper/flow.rs Outdated

use warn to report errors

ac7aa5c

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread lading/src/neper/rr.rs

usamasaqib added 4 commits June 2, 2026 17:33

increment connections closed when connection closes cleanly

cacb587

guard against token mismatch is FlowMap

e883997

for tcp_crr, when a very large number of flows are created connect ca…

69e43c0

…lls may fail with EADDRNOTAVAIL due to port exhaustion. We reduce chances of this by increase the local port range for sockets created by the generator.

remove comment about increasing local port range with sysctl

301c265

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

address clippy issues

21fc427

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

remove non ascii chars

935256d

usamasaqib marked this pull request as draft June 3, 2026 06:44

usamasaqib added 3 commits June 3, 2026 14:47

format

5d0b2fd

remove manual port management for flows

d4e7bf4

pinning ports to flows was changing the behavior on the bpf side since the 5-tuple was becoming less variable. This meant that the effect of flow tracking in bpf was different due to less overhead from managing more new flows.

usamasaqib force-pushed the usamasaqib/tcp-crr branch from 132cadd to d4e7bf4 Compare June 4, 2026 11:36

usamasaqib force-pushed the usamasaqib/tcp-crr branch from 96db1de to d4e7bf4 Compare June 8, 2026 09:50

usamasaqib marked this pull request as ready for review June 8, 2026 09:50

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

preinlein reviewed Jun 16, 2026

View reviewed changes

		@@ -0,0 +1,131 @@
		//! TCP connect/request/response (`tcp_crr`) generator — the client side.

Conversation

usamasaqib commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Related issues

Additional Notes

Uh oh!

This comment has been minimized.

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

usamasaqib Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preinlein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

usamasaqib commented Jun 2, 2026 •

edited

Loading

usamasaqib Jun 2, 2026 •

edited

Loading

usamasaqib Jun 16, 2026 •

edited

Loading