CHDR control endpoint consumes a CPU core polling a socket with no timeout #514

benjaminogles · 2021-10-25T20:00:03Z

Issue Description

The X300 uses a thread owned by the chdr_ctrl_endpoint class to poll for control ACKs and asynchronous command responses.

uhd/host/lib/rfnoc/chdr_ctrl_endpoint.cpp

Line 121 in 748162e

void recv_worker()

Calls to receive UDP packets eventually reach a function that uses recv(..., MSG_DONTWAIT) and then poll(..., timeout_ms) to check for packets and then wait for packets if none were available.

uhd/host/lib/include/uhdlib/transport/udp_common.hpp

Line 101 in 748162e

UHD_INLINE size_t recv_udp_packet(

However, the thread always passes a timeout of 0 which causes poll to return as quickly as possible. If no packets are received, the thread attempts to sleep.

uhd/host/lib/rfnoc/chdr_ctrl_endpoint.cpp

Line 150 in 748162e

boost::this_thread::sleep_for(boost::chrono::nanoseconds(MIN_DUR));

If the system is under load, the kernel may not be able to sleep this thread in the time allotted (I'm guessing, see Additional Information). Which leaves this thread to wrap up a CPU core polling for UDP packets that arrive relatively infrequently.

The "right" solution probably consists of passing a non-zero timeout to poll that will let the kernel block this thread until data arrives or the timeout expires. But currently, poll is called while holding a mutex shared between other threads that need to communicate with the device, including those sending commands the receiving thread needs to respond to. Thus, passing a non-zero timeout causes device initialization, for example, to take several minutes. The mutex is owned by this class.

Comments throughout the code refer to a "threaded_io_service" that needs to be developed to solve this problem. Right now, the only work around I have found is to patch UHD to try and sleep this thread for a longer amount of time. Using a sleep time of 100us has worked for me and doesn't seem to affect functional behavior. I'm not 100% sure there aren't any side effects to this action though.

Setup Details

UHD v4.1.0.3
X300
UBX160
CentOS 8.4.2105
Linux 4.18.0-305.19.1.el8_4.x86_64
Intel(R) Xeon(R) W-10885M CPU (8 cores)

Expected Behavior

The thread named uhd_ctrl_ep_<id> should consume very little CPU time when executing the benchmark_rate example program.

Actual Behaviour

The thread named uhd_ctrl_ep_<id> consumes around up to 99% of CPU time when executing the benchmark_rate example program (and other applications).

Steps to reproduce the problem

Install a single UBX160 card in the X300
Connect host to X300 on SFP port 1
Run top -H
In another terminal, run benchmark_rate --args "addr=192.168.40.2" --rx_rate 200e6 --duration 60

Additional Information

I was able to reproduce the problem with lower sampling rates as well on the system described. But on another system (Ubuntu 20, Linux 5.11, Intel i9-9880H CPU, 16 cores) I was unable to reproduce the issue. I gave my guess above that the kernel is unable to sleep the thread in time on the "smaller" system when it is under load.

Questions

Is the right answer to develop this "threaded_io_service"?
What exactly needs to be "threaded"?

The text was updated successfully, but these errors were encountered:

benjaminogles · 2021-10-25T20:35:38Z

Actually, after going over the code once more, I believe this problem affects all RFNoC devices as the chdr_ctrl_endpoint is instantiated by the class that manages connections to nodes on an RFNoC graph.

uhd/host/lib/rfnoc/link_stream_manager.cpp

Line 115 in 748162e

chdr_ctrl_endpoint::make(_ctrl_xport, _pkt_factory, _my_mgmt_ctrl_epid);

This makes more sense too because on that Ubuntu system I mentioned, I have seen this issue streaming two channels with 160MHz of bandwidth from an N320 (XG firmware). It also leads me to believe that my guess is right that the control recv_worker isn't able to sleep when the system is under heavy load. I don't know how much the wasted CPU time is worth relative to the load that induces this problem but it would be nice to not waste it either way.

mbr0wn added bug RFNoC labels Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHDR control endpoint consumes a CPU core polling a socket with no timeout #514

CHDR control endpoint consumes a CPU core polling a socket with no timeout #514

benjaminogles commented Oct 25, 2021

benjaminogles commented Oct 25, 2021