Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHDR control endpoint consumes a CPU core polling a socket with no timeout #514

Open
benjaminogles opened this issue Oct 25, 2021 · 1 comment

Comments

@benjaminogles
Copy link

Issue Description

The X300 uses a thread owned by the chdr_ctrl_endpoint class to poll for control ACKs and asynchronous command responses.

void recv_worker()

Calls to receive UDP packets eventually reach a function that uses recv(..., MSG_DONTWAIT) and then poll(..., timeout_ms) to check for packets and then wait for packets if none were available.

UHD_INLINE size_t recv_udp_packet(

However, the thread always passes a timeout of 0 which causes poll to return as quickly as possible. If no packets are received, the thread attempts to sleep.

boost::this_thread::sleep_for(boost::chrono::nanoseconds(MIN_DUR));

If the system is under load, the kernel may not be able to sleep this thread in the time allotted (I'm guessing, see Additional Information). Which leaves this thread to wrap up a CPU core polling for UDP packets that arrive relatively infrequently.

The "right" solution probably consists of passing a non-zero timeout to poll that will let the kernel block this thread until data arrives or the timeout expires. But currently, poll is called while holding a mutex shared between other threads that need to communicate with the device, including those sending commands the receiving thread needs to respond to. Thus, passing a non-zero timeout causes device initialization, for example, to take several minutes. The mutex is owned by this class.

Comments throughout the code refer to a "threaded_io_service" that needs to be developed to solve this problem. Right now, the only work around I have found is to patch UHD to try and sleep this thread for a longer amount of time. Using a sleep time of 100us has worked for me and doesn't seem to affect functional behavior. I'm not 100% sure there aren't any side effects to this action though.

Setup Details

  • UHD v4.1.0.3
  • X300
  • UBX160
  • CentOS 8.4.2105
  • Linux 4.18.0-305.19.1.el8_4.x86_64
  • Intel(R) Xeon(R) W-10885M CPU (8 cores)

Expected Behavior

The thread named uhd_ctrl_ep_<id> should consume very little CPU time when executing the benchmark_rate example program.

Actual Behaviour

The thread named uhd_ctrl_ep_<id> consumes around up to 99% of CPU time when executing the benchmark_rate example program (and other applications).

Steps to reproduce the problem

  • Install a single UBX160 card in the X300
  • Connect host to X300 on SFP port 1
  • Run top -H
  • In another terminal, run benchmark_rate --args "addr=192.168.40.2" --rx_rate 200e6 --duration 60

Additional Information

I was able to reproduce the problem with lower sampling rates as well on the system described. But on another system (Ubuntu 20, Linux 5.11, Intel i9-9880H CPU, 16 cores) I was unable to reproduce the issue. I gave my guess above that the kernel is unable to sleep the thread in time on the "smaller" system when it is under load.

Questions

  • Is the right answer to develop this "threaded_io_service"?
  • What exactly needs to be "threaded"?
@benjaminogles
Copy link
Author

Actually, after going over the code once more, I believe this problem affects all RFNoC devices as the chdr_ctrl_endpoint is instantiated by the class that manages connections to nodes on an RFNoC graph.

chdr_ctrl_endpoint::make(_ctrl_xport, _pkt_factory, _my_mgmt_ctrl_epid);

This makes more sense too because on that Ubuntu system I mentioned, I have seen this issue streaming two channels with 160MHz of bandwidth from an N320 (XG firmware). It also leads me to believe that my guess is right that the control recv_worker isn't able to sleep when the system is under heavy load. I don't know how much the wasted CPU time is worth relative to the load that induces this problem but it would be nice to not waste it either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants