docs: Add additional interesting kernel parameters, explanations for …

…system tick rate and debugging system management interrupts
2b-t · Nov 5, 2023 · 365fcda · 365fcda
1 parent 38b495d
commit 365fcda
Showing 1 changed file with 78 additions and 5 deletions.
diff --git a/doc/RealTimeOptimizations.md b/doc/RealTimeOptimizations.md
@@ -6,21 +6,42 @@ Author: [Tobit Flatscher](https://github.com/2b-t) (August 2022 - February 2023)
 
 ## 1. Introduction
 
-The following sections will outline a few things to consider when setting up a real-time capable systems and optimizations that should help improve its real-time performance significantly. This guide is largely based on the exhaustive [**Red Hat optimisation guide**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html-single/optimizing_rhel_8_for_real_time_for_low_latency_operation/index), focusing on particular aspects of it. Additional information can be found in the [Ubuntu real-time kernel tuning guide](https://ubuntu.com/blog/real-time-kernel-tuning).
+The following sections will outline a few things to consider when setting up a real-time capable systems and optimizations that should help improve its real-time performance significantly and different techniques for benchmarking the real-time performance of a system. This guide is largely based on the exhaustive [**Red Hat optimisation guide**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html-single/optimizing_rhel_8_for_real_time_for_low_latency_operation/index), focusing on particular aspects of it. Additional information can be found in the [Ubuntu real-time kernel tuning guide](https://ubuntu.com/blog/real-time-kernel-tuning).
 
 ## 2. Selecting hardware
 
 The latency on most computers that are optimised for energy efficiency - like laptops - will be a magnitude or two larger than the one of a desktop system as can also clearly be seen [browsing the OSADL latency plots](https://www.osadl.org/Latency-plots.latency-plots.0.html). **It is therefore generally not advisable to use laptops for real-time tests** (with and without a Docker): Some might have decent performance most of the time but might result in huge latencies in the magnitude of milliseconds when run over an extended period. Some single-board computers like the Raspberry Pi, [seem to be surprisingly decent](https://metebalci.com/blog/latency-of-raspberry-pi-3-on-standard-and-real-time-linux-4.9-kernel/) but still can't compete with a desktop computer. The Open Source Automation Development Lab eG (OSADL) performs long-term tests on several systems that can be inspected [on their website](https://www.osadl.org/OSADL-QA-Farm-Real-time.linux-real-time.0.html) in case you want to compare the performance of your system to others.
 
 ## 3. Kernel parameters
 
-Real-time performance might be improved by **changing kernel parameters when recompiling the kernel**. On the [OSADL long-term test farm website](https://www.osadl.org/Real-time-optimization.qa-farm-latency-optimization.0.html) you can inspect and download the kernel configuration for a particular system (e.g. [here](https://www.osadl.org/?id=1312#kernel)). Important settings that might help reduce latency include disabling all irrelevant debugging feature. The parameters of your current configuration can be displayed with [`$ cat /boot/config-$(uname -r)`](https://www.baeldung.com/linux/kernel-config). One of the most significant parameters is arguably [setting the timer interrupt to 1000Hz](https://ubuntu.com/blog/industrial-embedded-systems-ii) with the options:
+Real-time performance might be improved by **changing kernel parameters when recompiling the kernel**. On the [OSADL long-term test farm website](https://www.osadl.org/Real-time-optimization.qa-farm-latency-optimization.0.html) you can inspect and download the kernel configuration for a particular system (e.g. [here](https://www.osadl.org/?id=1312#kernel)). Important settings that might help reduce latency include disabling all irrelevant debugging feature. The parameters of your current configuration can be displayed with [`$ cat /boot/config-$(uname -r)`](https://www.baeldung.com/linux/kernel-config).
+
+The operating system uses [system tick time](https://issuu.com/hibadweib/docs/open_source_for_you_-_october_2012/s/13663273) to generate interrupt requests for scheduling events that can cause a context switch for time-sliced/round-robin scheulding of tasks with the same priority. The resolution of this timer ticks can be adjusted improving the accuracy of timed events.
+
+For Linux the currently highest supported **tick rate** is **1000Hz** and can be set with the [kernel parameter](https://ubuntu.com/blog/industrial-embedded-systems-ii):
 
 ```shell
 CONFIG_HZ_1000=y # Requires to comment the other entries such as CONFIG_HZ=250
 CONFIG_HZ=1000
 ```
 
+In order to be more performant one can also stop the periodic ticks on idle and further omit the scheduling ticks on CPUs that only have one runnable task as described [here](https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt) as well as [here](https://www.suse.com/c/cpu-isolation-introduction-part-1/):
+
+```shell
+CONFIG_NO_HZ_IDLE=y
+CONFIG_NO_HZ_FULL=y
+```
+
+Latter allows to improve worst-case latency by the duration of the scheduling-clock interrupt. One has to explicitly mark the CPUs that this should be applied to with `nohz_full=6-8` but it is not possible to mark all of the CPUs as adaptive-tick CPUs! By using the kernel parameter `CONFIG_NO_HZ_FULL_ALL=y` one can activate this mode for all CPUs except the boot CPU.
+
+For lock-less programming Linux has a [read-copy update (RCU)](https://en.wikipedia.org/wiki/Read-copy-update) synchronization mechanism that avoids lock-based primitives when multiple threads concurrently read and update elements that are linked through pointers and belong to shared data structures avoiding inconsistencies. This mechanisms sometimes [queues callbacks on CPUs to be performed at a future moment](https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/offloading_rcu_callbacks). One can exclude certain CPUs from running RCU callbacks by compiling the kernel with
+
+```shell
+CONFIG_RCU_NOCB_CPU=y
+```
+
+and later specifying a list of CPUs that this should be applied to with e.g. `rcu_nocbs=6-8`.
+
 ## 4. CPU frequency scaling and sleep states
 
 During operation the **operating system may scale the CPU frequency up and down** in order to improve performance or save energy. Responsible for this are
@@ -93,7 +114,7 @@ Additionally there is [`cpupower`](https://manpages.ubuntu.com/manpages/impish/m
 
 ### 4.2 Cyclictest
 
-An important test for **evaluating the latencies of a real-time system caused by hardware, firmware and operating system as a whole** is the [`cyclictest`](https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest/start). It repeatedly measures the difference between a thread's intended and actual wake-up time (see [here](https://events.static.linuxfound.org/sites/events/files/slides/cyclictest.pdf) for more details). For real-time purposes the **worst-case latency** is actually far more important than the average latency as any violation of a deadline might be fatal.
+An important test for **evaluating the latencies of a real-time system caused by hardware, firmware and operating system as a whole** is the [`cyclictest`](https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest/start). It repeatedly measures the difference between a thread's intended and actual wake-up time (see [here](https://events.static.linuxfound.org/sites/events/files/slides/cyclictest.pdf) for more details). For real-time purposes the **worst-case latency** is actually far more important than the average latency as any violation of a deadline might be fatal. Ideally we would want to test it in combination with a **stress test** such as [`$ stress-ng -c $(nproc)`](https://wiki.ubuntu.com/Kernel/Reference/stress-ng) and run it for an **extended period of times** (hours, days or even weeks).
 
 The effectiveness of turning the C-states off is evaluated for two systems, a [**Lenovo ThinkStation P500 tower**](https://www.getech.co.uk/pdf/p5001.pdf) (with an [Intel(R) Xeon(R) E5-2680 v3 @ 2.50GHz twelve-core CPU](https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e52680-v3-30m-cache-2-50-ghz.html) and 32GB of ECC-DDR4-RAM) and a [**Lenovo ThinkPad T14 Gen1 notebook**](https://psref.lenovo.com/syspool/Sys/PDF/ThinkPad/ThinkPad_T14_Gen_1_Intel/ThinkPad_T14_Gen_1_Intel_Spec.PDF) (with an [Intel(R) Core(TM) i5-10210U quad-core CPU](https://www.intel.co.uk/content/www/uk/en/products/sku/195436/intel-core-i510210u-processor-6m-cache-up-to-4-20-ghz/specifications.html) and 16GB of DDR4-RAM). In both cases a `cyclictest` with the following parameters is performed:
 
@@ -117,10 +138,62 @@ Both systems use the same self-compiled version of an **Ubuntu 22.04 5.15.86-rt5
 
 While latencies in the sub 100us region are generally far than enough for robotics, we can see that the **latency is greatly reduced for both systems when de-activating the C-states**, even halfed in the case of the Lenovo ThinkStation P500. These differences are also visible when using a Docker, as the latencies from real-time capable code ran inside a Docker are virtually indistinguishable from processes ran on the host system. It should be mentioned that the T14 has a remarkable performance for a notebook (several notebooks I used suffered from severe and regular latency spikes in the 500us region) but when running the test for extended periods rare latency spikes of up to two milliseconds were observed.
 
+### 4.3 Hwlatdetect
+
+For detecting **system management interrupts** we can use the [`hwlatdetect` tool](https://manpages.ubuntu.com/manpages/xenial/en/man8/hwlatdetect.8.html). It works by consuming all the processor ("hogging", leaving other processes unable to run!), polling the CPU time stamp counter and looking for gaps which then is an indicator that in this time frame an SMI was processed.
+
+You can run it with the following command as described [here](https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_running-and-interpreting-hardware-and-firmware-latency-tests_optimizing-rhel8-for-real-time-for-low-latency-operation):
+
+```bash
+$ sudo hwlatdetect --duration=60s
+```
 
+It will then output a report at the end.
 
 ## 5. Isolating CPUs
 
-Typically one will reserve a CPU core for a particular real-time task as described for Grub and Linux [here](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime) and [here](http:https://doc.dpdk.org/spp-18.02/setup/performance_opt.html). If [hyper-threading](https://www.xmodulo.com/check-hyper-threading-enabled-linux.html) is activated in the BIOS also the corresponding virtual core has to be isolated. The indices of the second virtual cores follow the physical ones.
+Typically one will reserve a CPU core for a particular real-time task as described for Grub and Linux [here](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime) and [here](http:https://doc.dpdk.org/spp-18.02/setup/performance_opt.html). If [hyper-threading](https://www.xmodulo.com/check-hyper-threading-enabled-linux.html) is activated in the BIOS (not recommended) also the corresponding virtual core has to be isolated. The indices of the second virtual cores follow the physical ones.
+
+You can **isolate the CPUs** by adding the following option (the numbers correspond to the list of CPUs we want to isolate):
+
+```shell
+GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=1-3,5,7”
+```
+
+to `/etc/default/grub` and then update grub with `$ sudo update-grub` and then reboot.
+
+Additionally it makes sense to also **isolate the corresponding [hardware interrupts (IRQs)](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))** (on a dedicated CPU), disabling the irqbalance daemon and binding the process to a particular CPU as described [here](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html-single/optimizing_rhel_8_for_real_time_for_low_latency_operation/index#assembly_binding-interrupts-and-processes_optimizing-RHEL8-for-real-time-for-low-latency-operation). This is in particular crucial for applications that include network and EtherCAT communication.
+
+For this disable irqbalance with:
+
+```bash
+$ systemctl disable irqbalance
+$ systemctl stop irqbalance
+```
+
+Then check the interrupts with
+
+```bash
+$ cat /proc/interrupts
+```
+
+Note down the IRQ number at the beginning of the relevant line and then you can place the interrupt of the corresponding IRQ onto the core of your choice:
+
+```bash
+$ echo <cpu_core> > /proc/irq/<irq_id>/smp_affinity
+```
+
+Finally we can also activate **full dynticks** internals and **prohibit RCUs from running** on our isolated cores (as described [here](https://www.suse.com/c/cpu-isolation-full-dynticks-part2/) and [here](https://www.suse.com/c/cpu-isolation-nohz_full-part-3/)) by adding
+
+```shell
+nohz_full=1-3,5,7 rcu_nocbs=1-3,5,7
+```
+
+where the number corresponds to the CPUs we would like to isolate.
+
+Putting this together with the previous Grub configuration options this leaves us with:
+
+```shell
+GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=1-3,5,7 nohz_full=1-3,5,7 rcu_nocbs=1-3,5,7”
+```
 
-Additionally it makes sense to also **isolate the corresponding [hardware interrupts (IRQs)](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))** (on the same CPU), disabling the irqbalance daemon and binding the process to a particular CPU as described [here](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html-single/optimizing_rhel_8_for_real_time_for_low_latency_operation/index#assembly_binding-interrupts-and-processes_optimizing-RHEL8-for-real-time-for-low-latency-operation). This is in particular crucial for applications that include network and EtherCAT communication.