Prepared by Alison Chaiken and offered under
Semi-transcript of the talk.
Some takeaways:I've been working to resolve the issue and to add more locking with soft interrupts so that we don't have this one global lock per CPU, but identify resources which are shared and require explicit locking. . . . if you have like one networking card that is more important that needs to do things regularly, and you have another networking card which is handling byte-traffic like ssh copying, and those things run on the same CPU, then what basically happens is, the periodic traffic from the high important task gets locked until the byte traffic is completed. . . . The first version was not what the community expected, and now I'm working on a rework which should be more or less what they asked for on the last review.
Sequence counters are a type of lockless pattern which notifies readers of infrequently modified data that they should briefly wait. As always, multiple writers need to be deconflicted with an actual lock, but readers of any number can just check if the sequence counter is even or odd, and wait in the latter case.
One adaptation necessary for RT is to disable preemption of writers so that the potentially many readers need not wait. The second problem is common to all lockless synchronization patterns, namely that lock-dep analysis won't make it clear to the scheduler that a high-priority reader should never preempt the writer.
The solution was simply to add a set of write-serialization locks only for PREEMPT_RT. Readers try to take the lock, and now writers can inherit their priority. A further problem though is with NMIs, which interrupt even when preemption is disabled. The solution there is the seqcount latch, which essentially means maintaining two copies of the data, not unlike RCU.
The existing tasklet API had potential deadlock problems not unlike that of del_timer_sync(), as described by Siewior above. Just as a thread which wants to delete a timer may preempt the timer, just so code which wants to disable or delete a tasklet may preempt a running instance, resulting in a run-forever busy-loop.
Linutronix surveyed over 400 call-sites where tasklet_delete() or tasklet_disable() was called. Of these, 10 were instances where the invocation of tasklet_disable() was in non-preemptible atomic context. To deal with these cases, Linutronix invented the new tasklet_disable_in_atomic(), while a bit comically simply enables and then disables softirqs. Then Peter Ziljstra fixed all the call-sites which formerly spun waiting for a running tasklet to exit to rely on condition variables instead. We hope that tasklets can be replaced in the future.
Q.: Who is replacing tasklets?
A. by
Siewior: At
Linus' suggestion, Tejun
Heo created a "low-latency
atomic workqueue" by relabelling tasklets as "bottom-half workqueues."
They don't rely on the existing buggy tasklet API, but are still run by
ksoftirqd.
Darwish money quote: The most important thing about this talk is that, if you really use the standard kernel-locking APIs, and if you use them the way that they should be used, and do not depend on implementation details of locking APIs, then we do all the hard work for you. We only had problems when drivers are overly smart, doing something that they should not be doing. There were, for example, during the sequence counter work, a lot of drivers that open-coded certain locking because they thought that their own implementation is better than the standard kernel locking mechanisms, which is false not only from a performance perspective, but also from a lockdep perspective.When any driver break on RT, we hope that driver will not be merged because it will not be respecting the kernel locking APIs. The i915 graphics driver and DRM code still contain open-coded locks.
A third major difficulty with drivers was misuse of context-detection macros like preemptible() and in_atomic(). The behavior of these macros in device drivers depends on details of the kernel configuration. They are fine in core scheduler code which is less dependent on configuration details, and whose developers understand execution context but well, but makes no sense in device drivers.
The last major part of driver cleanup for RT involves printk and atomic consoles. When those changes merge to mainline, there will no longer be an "RT patchset" for device drivers.
Q.: Is coccinelle or another static analysis tool applicable to find problems like open-coded lock, or are the the various weird reimplementations too different from one another?
A by Darwish: Coccinnelle was useful for certain tasks, although the tasklet survey was completely manual since it was necessary to determine if the tasklet_delete() or tasklet_disable() call-site was preemptible or not. The sequence locks bugs could be caught only by dynamic analysis.
A by Siewior: If in_interruptible() and friends are invoked in code and a preemption model is chosen where they are meaningless, compilation will now fail.
A by Darwish: Unfortunately the dynamic behavior of seqlocks means that compile-time rejection of error-prone code is not possible.
Q.: As PREEMPT_RT moves into mainline, how do users who move all the RT stuff to another core with Zephyr decide whether that still needs to happen?
A. by Darwish: The jailhouse hypervisor is quite suitable for partitioning systems. Whether or not it makes sense to partition depends on the use case and latency requirements.
A. by Daniel Bristot de Oliviera: Let me add just one remark on top of this. When we talk about partitioning the system, an orthogonal feature of Linux that helps is CPU isolation. Isolcpus can deliver almost full isolation.
A. by Darwish: Isolcpus is not only for performance, but for safety and standards compliance. There are multiple safety certifications, and there is a benefit but also safety validation and benefits.
Q.: What about the common use case of audio, where partitioning is still a common solution?
A.: Audio is a classic use case for realtime Linux. Audio developers who run PREEMPT_RT systems often post bug reports to the linux-rt-users mailing list.
The presentation reported on a heroic effort to track down causes of latency as measured by the standard cyclictest tool. The goal of the work was to support development of IoT endpoint devices connected by ultralow latency 5G communication. The role of RT Linux would be to run on a controller for edge devices like sensors.
The investigation compared cyclictest results between an ARM Ampere Altra 80-core system, an Altra 24-core system and an Intel Cascade Lake 24-core system. All systems ran the Ubuntu 5.15-RT kernel. Initial measurements indicated classical good-RT performance for the Intel-based test system and many latency measurement up to 125 µs for the ARM hosts, with outliers up to 240 µs, 134 µs and 31 µs for the 80-core ARM host, the 24-core ARM host and the Intel host. What was wrong on ARM?
Huang presented the kernel tuning options that his team tested, many of which like pinning IRQs and workqueues, disabling the timer tick and setting isolcpus are common in RT systems. With the optimal choice of parameters, they reduced the max latency for 8 cores to 8 µs, for 48 cores to 12 µs,and for 64 cores to 143 µs. Finally then they set the cmdline parameter randomize_kstack_offset=0, which made the latency independent of core core. Why did this final tuning parameter have such a large impact?
Randomizing the stack requires calling _get_random_bytes(), which in turn. calls crng_make_state(), which calls crng_reseed() if the entropy is currently exhausted. disables IRQs and takes a spinlock. As LWN noted:
“The existing arm64 stack randomization uses the kernel rng to acquire 5 bits of address space randomization. This is problematic because it creates non- determinism in the syscall path when the rng needs to be generated or reseeded. This shows up as large tail latencies in some benchmarks and directly affects the minimum RT latencies as seen by cyclictest.”