Skip to content

Feature "async" - "slave" execution of HW Components enabling synchronizaiton to a robot controller and therefore better communicaiton stability.#473

Open
destogl wants to merge 7 commits into
ros-controls:masterfrom
b-robotized-forks:feat/async-slave
Open

Feature "async" - "slave" execution of HW Components enabling synchronizaiton to a robot controller and therefore better communicaiton stability.#473
destogl wants to merge 7 commits into
ros-controls:masterfrom
b-robotized-forks:feat/async-slave

Conversation

@destogl

@destogl destogl commented Dec 23, 2025

Copy link
Copy Markdown
Member

We need something that enables us to synchronize the HW loops with the robot. Many robot RT interfaces have “blocking read” to send data when they expect new data. This is an approach to tackle this.

Note

will be rebased after on master after it is finished the testing on HW is successful.

@mergify

mergify Bot commented Dec 23, 2025

Copy link
Copy Markdown

This pull request is in conflict. Could you fix it @destogl?

@destogl destogl marked this pull request as draft December 23, 2025 13:06
@destogl destogl changed the base branch from master to jazzy December 23, 2025 13:09
@mergify

mergify Bot commented Jan 8, 2026

Copy link
Copy Markdown

This pull request is in conflict. Could you fix it @destogl?

@Nibanovic

Copy link
Copy Markdown

Problem description

Many robots have an internal clock and send a "heartbeat" signal signifying their control loop cycle has started and we can read/update/write.

In hardware interfaces, this manifests as a blocking read() call, where the interface waits on a heartbeat UDP packet to return.

This is in conflict with the current implementation of ros2_control_node, where we ONLY use sleep() to schedule the control cycle using a software clock.
Even if the hardware is async, we have a choice of synchronized mode (uses the main cm software clock) or deatached mode (the hw interface thread just has its own, separate, software clock)

examples in robot hardware interfaces:

  • Franka: read() is blocking. They integrate with ros2 using a custom controller manager where there is no sleep(), but cm relies on blocking read() from franka to scehdule the control cycle.
  • Kassow: blocking read similar to franka, but no mechanism to help this
  • KUKA: fzi kuka_rsi_driver has a solution that works well, where it runs the UDP listener in a separate thread (achieving the same thing as franka), and it has a kind of interpolation mechanism to iron out the latest ROS commands with the current time in the RSI clock is at.
  • UR: hardware interface has a non-blocking mode, where the read() is effectivelly skipped for that loop if there is no new data popped from the queue at that time.

Solution proposal: synchronization_policy="slave"

In this PR we propose a solution by introducing a "slave mode" for async hardware interface.

The basic solution is simple: the slave hardware interface does not sleep when executing its callback.

This is basically what Franka does in their custom controller manager, we'd just fold that functionality into async_function_handler.

@Nibanovic

Copy link
Copy Markdown

Validation

We measure two things: drift and jitter

Drift:

  • rate of change in phase_offset, that is, if the time difference between when read() starts and when the UDP read unblocks (when robot cycle starts + network latency)

In essence, the read() execution time can be a good proxy for this. If it is constant, that is good, if it is monotonely increasing/decreasing it means the robot cycle and cm cycle are drifting apart

Jitter:

  • time distance between two read() calls

Default jazzy controller: 4.42.1

Franka 1000hz:

image

Kassow 250hz:

image

Here we see the characteristic sawtooth pattern that appears when update loops are drifting.

Slave Async solution:

Franka 1000hz:

image

Kassow 250hz:

image

Here we see that allowing the robot to synchronize that update loop leads to elimination of the drift (The drift line is constant or step-like)

Slave Async on two Kassow robots

One is real, and another is simulated. Thre is a lot of network jitter, but the execution times are constant, meaning there is no drift
image

@saikishor

saikishor commented Jan 14, 2026

Copy link
Copy Markdown
Member

@Nibanovic What's the exact change here?. I'm going crazy with the amount of changes. Maybe you want to target it to master?

@Nibanovic

Copy link
Copy Markdown

Open questions

how do controller update() loops fit in here?

For example, If we're running update() for a JTC that is targeting async_slave franka and we have another robot running in sync with controller manager, the update() of the controller is on a different clock than the read/write.

What are the knock-on effects of this and how do we address it? How do we establish a test case where the problems the controller causes are clear? this is what I'm working on currently

What first comes to mind is to run a controller as async controller. Then we configure a controller to target a specific hardware interface. Then, from the hardware interface, at the end of the read() we signal to the async controller to execute its update(). This would be the simplest approach to implement.

In the end, this could cause a mess of threads for a bunch of controllers, so maybe it would be useful to re-frame the async approach from "we spawn a thread for each async component" to "we spawn a thread for each clock source that exists in our system.

Then, we'd then pin controllers update to be executed after a specific "clock source group" of hardware. This is a more extensive change, not so sure about it, but just and idea.

phase-matching between ROS side and robot side, like in kuka_rsi_driver

Even if we solve the above mentioned problem, we still cannot handle controllers that target multiple hardware running on different clocks.

For this, we could take the existing approach in kuka_rsi_driver for phase matching. This is more extensive and deserves a separate PR.

Housekeeping

  • overview overrun warnings. we probably want to suppress them in cm or in async_function_handler, as the robot clock we receive is subject to network jitter, depending on the network quality. This might also affect the total_time of execution we measure, as read() is basically blocking for the remainder of the total cycle time.
  • Currently we still need expected update_rate defined for async slave hardware interfaces, even though cm is not scheduling it, for statistics calculations. This seems acceptable.

}
}

void execute_slave_callback()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as detached? Do we really need separate method?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Synchronized is scheduled by the controller manager cycle thread with sleep()
  • deatached has its own, separate sleep in its thread, almost like running a ros2_control_node for that specific hardware interface. So its own clock, but still a software clock, so it does not spinlock
  • slave is completely detached from clocks, relies on blocking read(), no sleep()

yes, I overlooked this, I see now we can just sleep in deatached callback if we're in slave mode instead of a separate method.

@codecov-commenter

codecov-commenter commented Jan 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 63.63636% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.25%. Comparing base (f2bc7ef) to head (1c655f9).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
.../include/realtime_tools/async_function_handler.hpp 63.63% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
- Coverage   85.51%   85.25%   -0.26%     
==========================================
  Files          17       17              
  Lines        1408     1424      +16     
  Branches      128      128              
==========================================
+ Hits         1204     1214      +10     
- Misses        121      125       +4     
- Partials       83       85       +2     
Flag Coverage Δ
unittests 85.25% <63.63%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../include/realtime_tools/async_function_handler.hpp 73.54% <63.63%> (-0.98%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@saikishor saikishor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to go with the DETACHED methods as they have pretty much same code. Instead, you can add a parameter to the following AsyncFunctionHandlerParams like skip_sleep or function_handle_rate or something like that, so it is more maintainable IMO.

int thread_priority = 50;
std::vector<int> cpu_affinity_cores = {};
AsyncSchedulingPolicy scheduling_policy = AsyncSchedulingPolicy::SYNCHRONIZED;
unsigned int exec_rate = 0u;
rclcpp::Clock::SharedPtr clock = nullptr;
rclcpp::Logger logger = rclcpp::get_logger("AsyncFunctionHandler");
std::function<bool()> trigger_predicate = []() { return true; };
bool wait_until_initial_trigger = true;
bool print_warnings = true;

@github-actions

github-actions Bot commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open for 45 days with no activity. Please tag a maintainer for help on completing this PR, or close it if you think it has become obsolete.

@github-actions github-actions Bot added the stale label Mar 9, 2026
destogl and others added 3 commits March 31, 2026 10:15
* now this hardware's update loop is synchronized by the blocking read() of a robot it is targeting
merge with detached_callback
@Nibanovic

Copy link
Copy Markdown

I would recommend to go with the DETACHED methods as they have pretty much same code. Instead, you can add a parameter to the following AsyncFunctionHandlerParams like skip_sleep or function_handle_rate or something like that, so it is more maintainable IMO.

Looking into deduplicating code for clarity, but I don't think But I'd retain SLAVE mode as the main designator of this way of operating a hardware interface.

It is very clear for configuration that SLAVE is different than DETACHED, and asking of people to configure Detached + skip-sleep parameter seems very unintuitive.

Few more notes:

  • i now opted for a switch/case for AsyncSchedulingPolicy in start_thread, I find it clearer to if/else with long conditions here
  • In SLAVE mdoe, we're also skipping overrun calculation. This is because of the way statistics are currently calculated, we'll always have read_duration ~= rw_rate, as we're waiting in blocking read() for most of rw_rate, and overrun will always trigger. This is noted in following controller-focused PR Feature "async" - "slave" controller getting synchronized with a hardware #478 (comment) under "Open Questions"
  • After implementation is confirmed, I'll push a commit fleshing out docs which explains the difference between the three modes of hardware interface scheduling, for completeness

@mergify

mergify Bot commented Mar 31, 2026

Copy link
Copy Markdown

This pull request is in conflict. Could you fix it @destogl?

@destogl destogl changed the base branch from jazzy to master March 31, 2026 12:42
@github-actions github-actions Bot removed the stale label Mar 31, 2026
@urfeex

urfeex commented Apr 2, 2026

Copy link
Copy Markdown

@Nibanovic Is there a reason this is still in draft mode? Now that it is targeting master, it seems like this is ready to review?

I have tested this locally (version deae972 on jazzy) together with #478 and ros-controls/ros2_control#2971 with a UR where it seems to be working fantastic. I counted control loops on the robot where it didn't receive a new control package since the last loop. With both, the robot and ros2_control running their own loops at 500 Hz I get occasional bursts where the phase shift between the two loops results in a lot of missed commands:
ros_sync_wo_preempt_rt

With the changes from this PR and the other two mentioned, the same graph shows a flat 0.

The refactored deduplication seems to make a lot of sense and I would like to see this moving forward by giving it a proper review, once it's ready :-)

@destogl destogl requested a review from saikishor April 2, 2026 09:34
@destogl

destogl commented Apr 2, 2026

Copy link
Copy Markdown
Member Author

@urfeex this is ready :)

@destogl destogl changed the title Draft: Feat/async slave Feature "async" - "slave" execution of HW Components enabling synchronizaiton to a robot controller and therefore better communicaiton stability. Apr 2, 2026
* **Scheduling Control:** Internal software clock.
* **Sleep Mechanism:** Calculates the elapsed time and explicitly calls ``std::this_thread::sleep_until()``, same as ``ros2_control_node`` executable.

SLAVE

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure whether the term "slave" is that catchy. At least my intuition is that there is a "master" thread probably running in the controller_manager that handles scheduling. I think, a more descriptive name like "hardware_sync" or "blocking_read" would make it more clear.

I am fine with "slave", though, as well. It might just be my intuition.

I like it very much that you highlighted the two aspects of scheduling control and sleep mechanism for each policy.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is a good question. I'm already used to "slave". I'm mainly leaning towards it as it is descriptive (the loop is slave to a hardware clock), and the name is different enough to "stay" in your mind.

I'd love to hear others perspective on this.

Comment thread realtime_tools/doc/async_scheduling_policy.rst Outdated
Co-authored-by: Felix Exner (fexner) <git@fexner.de>

@saikishor saikishor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about using mode called SLAVE. IMHO, the async function handler should be independent

Given that it is doing exactly same as DETACHED mode, but just not maintaining any sleep cycle, whynot simply add a parameter called maintain_cycle_rate or something similar and then parse this from the other end.

@Nibanovic

Copy link
Copy Markdown

Given that it is doing exactly same as DETACHED mode, but just not maintaining any sleep cycle, whynot simply add a parameter called maintain_cycle_rate or something similar and then parse this from the other end.

My main point is that we need a named AsyncSchedulingPolicy for this feature, as when we use it in #478 and ros-controls/ros2_control#2971 to schedule controller updates in relation to blocking read, this enum is crucial.

I'd argue the users don't care if they are similar in implementation, they care about what it does for their hardware interface.

We can maybe rename SLAVE:
if SYNCHRONIZED is scheduled by the main thread, and DETACHED is scheduled by the hw-interface thread, we can maybe call it something like:

  • EXTERNAL - scheduled by an external source
  • BLOCKING - scheduled by the blocking read of the hardware interface
  • HARDWARE_DRIVEN - self explanatory

@destogl destogl requested a review from saikishor April 17, 2026 13:30
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open for 45 days with no activity. Please tag a maintainer for help on completing this PR, or close it if you think it has become obsolete.

@github-actions github-actions Bot added stale and removed stale labels Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants