Report from the Embedded Open Source Summit 2024

Report from the 2024 Embedded Open Source Summit

Prepared by Alison Chaiken and offered under

Presentations

PREEMPT_RT - How Not to Break It by Sebastian Siewior, Linutronix

Semi-transcript of the talk.
Some takeaways:
- The RT patch set for x86_64 has been almost completely merged. We are about to enter a new era where compatibility with PREEMPT_RT is a criterion for acceptance of new patches into mainline Linux.
- The goal of the RT project is to make userspace applications run as quickly as possible. RT Linux developers make patches all over the kernel source, not just to the scheduler.
- When code within critical sections protected by locks is preempted, the runtime must guarantee that side-effects which are observable from other cores are unaltered. For example, the task state must be preserved, and the thread may not migrate.
- Timer interrupt and perf events are exceptions which run in atomic context even in PREEMPT_RT, where most IRQs run as regular kthreads in process context.
- Since v6.3.1-rt13/v6.5, it is no longer true that "Once ksoftirqd is active, it will process all following requests." If ksoftirqd has been preempted by a hard IRQ, then its follow-on softirq will run in ksoftirqd since local_bh_disable() will have run. However, if ksoftirqd is merely "raised," the softirq corresponding to a hard IRQ may run immediately in atomic context.
- Any code which runs in a "vanilla" (voluntary preemption) kernel in hard IRQ context that takes a spinlock must be moved to a softirq in PREEMPT_RT.
- clock_nanosleep() runs in atomic context if called from an RT thread, but runs in process context if invoked from a SCHED_OTHER task.
- The irq_work mechanism was originally developed to support callbacks for NMIs, which cannot sleep or spin. Later on printk adopted the mechanism.
- RCU works mostly the same in PREEMPT_RT as in a voluntary preemption kernel. Realtime kernels typically run RCU in separate kthreads rather than via a softirq. The RCU_BOOST mechanism was invented to prevent stalls by allowing RCU readers which were preempted in critical sections to finish their work.
- Slides 10-11 and discussion near 28:00 into video describe problems with open-coded spinlocks and semaphores. The open-coded replacement does not prevent preemption. The thread that is waiting on the release of the homemade lock can preempt the lock holder and trigger a deadlock.
- The function del_timer_sync() in core kernel code had a similar potential for deadlock. The thread which calls del_timer_sync() may preempt the thread running the timer. seq_count had a related problem. RT had to invent new synchronization primitives for these mechanisms to prevent the lockup.
- The RT developers have been replacing calls to preempt_disable() and local_irq_disable() with local_lock(). local_locks protect per-CPU data with named, scoped locks, thereby enabling static lockdep analysis. On non-RT kernels, local_locks simply wrap {preempt,irq}_{enable,disable} and have no effect. With PREEMPT_RT, preemption and interrupts are enabled, but inside a data section protected by an actual lock that serializes accesses on that CPU. RAID5 code offers a clarifying example given that {get,put}_cpu() are wrappers for preempt_{enable,disable}.
- Lock nesting rules are another important concept for writing PREEMPT_RT code. It would make no sense to have a preemptible lock protecting a code section which is inside a large code segment protected by a never-preemptible "raw" lock. "Lock ordering is from sleeping to may spin, to spinning:
  - Sleeping locks (mutex, semaphore, ...)
  - Sleeping locks on PREEMPT_RT spinlock_t, rwlock_t, local_lock.
  - Always spinning locks raw_spinlock_t and bit spinlocks.
  - Preempt and interrupt disabled sections (atomic sections) count as 3)."
- Note PI won't work with multiple-reader locks like rwsem and rwlock_t. The single writer could inherit a priority boost if need be, but that is rarely the case.
- Interrupt handlers which are threads running in process context may take a while to be scheduled in the first place. Disabling preemption under those conditions often scarcely makes sense.
- The difficulty with merging RT-safe printk support to mainline is that printing characters to a console is so slow, but messages can be added to the dmesg buffer from atomic context.
- On x86_64, cache flushes, often due to GPU drivers, are incredibly costly. A core may essentially be offline for 150-200 µs. Note also that problems with contended, unpreemptible raw spinlocks tend to get worse with the number of cores. A given lock may not be a problem with a small numbers of cores or tasks blocked on it, and then suddenly balloon into a significant source of latency.
- Latency-sensitive userspace code which rely on mutexes should be sure to call pthread_mutexattr_setprotocol() with protocol PTHREAD_PRIO_INHERIT. [Note: both glibc and the RT-specific librtpi support this capability. librtpi provides a helper function wrapper which automatically sets the correct options.] The PRIO_INHERIT protocol allows threads to jump the queue when they are contending on a lock with other waiters.
- Softirqs are a giant headache for PREEMPT_RT. local_bh_disable() is essentially a per-CPU Big Kernel Lock. The essential problem is that the code which run as a softirq and provides functionality like NET_TX, tasklet or timer is lockless. These callbacks have never needed locks since only one softirq could run at a time, back when that constraint was not a big performance limitation. If only the softirqs had local_lock() instead of local_bh_disable(), they could run to some extent concurrently and the locks could be PI-boosted. Siewior added:
  I've been working to resolve the issue and to add more locking with soft interrupts so that we don't have this one global lock per CPU, but identify resources which are shared and require explicit locking. . . . if you have like one networking card that is more important that needs to do things regularly, and you have another networking card which is handling byte-traffic like ssh copying, and those things run on the same CPU, then what basically happens is, the periodic traffic from the high important task gets locked until the byte traffic is completed. . . . The first version was not what the community expected, and now I'm working on a rework which should be more or less what they asked for on the last review.
- Siewior also called out workqueues as a source of trouble for PREEMPT_RT: "The problem is more or less that work originates in a specific context like a specific driver, once you hand it over to a workqueue or a soft interrupt, this work becomes anonymous. So you have your threaded interrupt running at a specific priority, but you cannot inherit this priority to your workqueue, and then your work stalls." He recommends plain old kthreads for execution of deferred work rather than either softirqs or kworkers.
- Summary:
  1. Merging of remaining RT patchset into mainline means that driver authors need to avoid practices, like open-coded locks, which will "break" it.
  2. Rely on actual locks rather than disabling preemption.
  3. Use RT-friendly locks like spinlock_t or local_lock when possible.
  4. Prefer kthreads to softirqs and workqueues.
  5. Pick librtpi rather than glibc's libpthread.
  6. Siewior has hope that local_lock() in the network stack will eventually allow removal of local_bh_disable() there.
  7. The RT wiki is the best source of documentation.
- Q&A:
  1. Q: Do driver authors need to change anything when IRQs are threaded?
    A.: No. Userspace must set the IRQ priority.
  2. Q.: Is QEMU useful for RT-testing?
    A.: No, due to the continuing stream of "sleeping while atomic" warnings caused by the intermittent execution of the VM.
  3. Q.: Do improvements with futex2() in v6.7 now mean that glibc now can support PI-aware condition variables without breaking ABI?
    A. with comments from Jan Kiska of Siemens and Gratian Crisan of NI: RT users should still choose Darren Hart's librtpi instead. librtpi is now available as a Debian package.
  4. Q.: What is the best source of RT documentation?
    A.: A long discussion ensued about the RT wiki and information present in it or missing from it. There was general agreement that improving the in-tree kernel Documentation was also desirable.
Do Existing Drivers Conflict with PREEMPT_RT? a 3-Year Survey by Ahmed S. Darwish, Linutronix
- Running a Linux device driver which was developed with a voluntary preemption kernel instead on PREEMPT_RT usually requires no modification. The porting effort is far less than for RTOSes.
- Darwish's talk answers the obvious follow-on question, why then has the RT patchset included so many patches to drivers/? He covers the three main areas where he patched drivers to remove sources of latency, namely sequence counters, tasklets and open-coded synchronization methods.
- 1. Sequence counters are a type of lockless pattern which notifies readers of infrequently modified data that they should briefly wait. As always, multiple writers need to be deconflicted with an actual lock, but readers of any number can just check if the sequence counter is even or odd, and wait in the latter case.
    One adaptation necessary for RT is to disable preemption of writers so that the potentially many readers need not wait. The second problem is common to all lockless synchronization patterns, namely that lock-dep analysis won't make it clear to the scheduler that a high-priority reader should never preempt the writer.
    
    The solution was simply to add a set of write-serialization locks only for PREEMPT_RT. Readers try to take the lock, and now writers can inherit their priority. A further problem though is with NMIs, which interrupt even when preemption is disabled. The solution there is the seqcount latch, which essentially means maintaining two copies of the data, not unlike RCU.
  2. The existing tasklet API had potential deadlock problems not unlike that of del_timer_sync(), as described by Siewior above. Just as a thread which wants to delete a timer may preempt the timer, just so code which wants to disable or delete a tasklet may preempt a running instance, resulting in a run-forever busy-loop.
    
    Linutronix surveyed over 400 call-sites where tasklet_delete() or tasklet_disable() was called. Of these, 10 were instances where the invocation of tasklet_disable() was in non-preemptible atomic context. To deal with these cases, Linutronix invented the new tasklet_disable_in_atomic(), while a bit comically simply enables and then disables softirqs. Then Peter Ziljstra fixed all the call-sites which formerly spun waiting for a running tasklet to exit to rely on condition variables instead. We hope that tasklets can be replaced in the future.
    
    Q.: Who is replacing tasklets?
    A. by Siewior: At Linus' suggestion, Tejun Heo created a "low-latency atomic workqueue" by relabelling tasklets as "bottom-half workqueues." They don't rely on the existing buggy tasklet API, but are still run by ksoftirqd.
    
    Darwish money quote: The most important thing about this talk is that, if you really use the standard kernel-locking APIs, and if you use them the way that they should be used, and do not depend on implementation details of locking APIs, then we do all the hard work for you. We only had problems when drivers are overly smart, doing something that they should not be doing. There were, for example, during the sequence counter work, a lot of drivers that open-coded certain locking because they thought that their own implementation is better than the standard kernel locking mechanisms, which is false not only from a performance perspective, but also from a lockdep perspective.When any driver break on RT, we hope that driver will not be merged because it will not be respecting the kernel locking APIs. The i915 graphics driver and DRM code still contain open-coded locks.
  3. A third major difficulty with drivers was misuse of context-detection macros like preemptible() and in_atomic(). The behavior of these macros in device drivers depends on details of the kernel configuration. They are fine in core scheduler code which is less dependent on configuration details, and whose developers understand execution context but well, but makes no sense in device drivers.
  4. The last major part of driver cleanup for RT involves printk and atomic consoles. When those changes merge to mainline, there will no longer be an "RT patchset" for device drivers.
- Summary:
  1. One learns from these presentations how much of a problem the lack of data encapsulation in C is. A main focus of both these presentations if on the headaches created by code depending on the internal implementation of locking and the scheduler.
  2. An entire class of bugs (now mostly fixed in drivers) has to do with cancellation functions preempting the running instance of the feature they mean to cancel.
- Q&A:
  - Q.: Is coccinelle or another static analysis tool applicable to find problems like open-coded lock, or are the the various weird reimplementations too different from one another?
    
    A by Darwish: Coccinnelle was useful for certain tasks, although the tasklet survey was completely manual since it was necessary to determine if the tasklet_delete() or tasklet_disable() call-site was preemptible or not. The sequence locks bugs could be caught only by dynamic analysis.
    
    A by Siewior: If in_interruptible() and friends are invoked in code and a preemption model is chosen where they are meaningless, compilation will now fail.
    
    A by Darwish: Unfortunately the dynamic behavior of seqlocks means that compile-time rejection of error-prone code is not possible.
  - Q.: As PREEMPT_RT moves into mainline, how do users who move all the RT stuff to another core with Zephyr decide whether that still needs to happen?
    
    A. by Darwish: The jailhouse hypervisor is quite suitable for partitioning systems. Whether or not it makes sense to partition depends on the use case and latency requirements.
    
    A. by Daniel Bristot de Oliviera: Let me add just one remark on top of this. When we talk about partitioning the system, an orthogonal feature of Linux that helps is CPU isolation. Isolcpus can deliver almost full isolation.
    
    A. by Darwish: Isolcpus is not only for performance, but for safety and standards compliance. There are multiple safety certifications, and there is a benefit but also safety validation and benefits.
  - Q.: What about the common use case of audio, where partitioning is still a common solution?
    
    A.: Audio is a classic use case for realtime Linux. Audio developers who run PREEMPT_RT systems often post bug reports to the linux-rt-users mailing list.
Journey of improving RT Performance on Single CPU equipped PREEMPT_RT Linux by Umesh Papadkar, Intel
Low-Latency Real-Time (RT) Kernel: The Cornerstone of Arm's 5G ORAN Strategy by Shiyou Huang, ARM, INC.
The presentation reported on a heroic effort to track down causes of latency as measured by the standard cyclictest tool. The goal of the work was to support development of IoT endpoint devices connected by ultralow latency 5G communication. The role of RT Linux would be to run on a controller for edge devices like sensors.

The investigation compared cyclictest results between an ARM Ampere Altra 80-core system, an Altra 24-core system and an Intel Cascade Lake 24-core system. All systems ran the Ubuntu 5.15-RT kernel. Initial measurements indicated classical good-RT performance for the Intel-based test system and many latency measurement up to 125 µs for the ARM hosts, with outliers up to 240 µs, 134 µs and 31 µs for the 80-core ARM host, the 24-core ARM host and the Intel host. What was wrong on ARM?

Huang presented the kernel tuning options that his team tested, many of which like pinning IRQs and workqueues, disabling the timer tick and setting isolcpus are common in RT systems. With the optimal choice of parameters, they reduced the max latency for 8 cores to 8 µs, for 48 cores to 12 µs,and for 64 cores to 143 µs. Finally then they set the cmdline parameter randomize_kstack_offset=0, which made the latency independent of core core. Why did this final tuning parameter have such a large impact?

Randomizing the stack requires calling _get_random_bytes(), which in turn. calls crng_make_state(), which calls crng_reseed() if the entropy is currently exhausted. disables IRQs and takes a spinlock. As LWN noted:
“The existing arm64 stack randomization uses the kernel rng to acquire 5 bits of address space randomization. This is problematic because it creates non- determinism in the syscall path when the rng needs to be generated or reseeded. This shows up as large tail latencies in some benchmarks and directly affects the minimum RT latencies as seen by cyclictest.”

alison@she-devel.com (Alison Chaiken)

PREEMPT_RT - How Not to Break It by Sebastian Siewior, Linutronix

Do Existing Drivers Conflict with PREEMPT_RT? a 3-Year Survey by Ahmed S. Darwish, Linutronix