Thread SchedulingModern OSs execute multiple processes concurrently, by running each for a brief burst, then switching to the next: a feature called multiprogramming. Modern processors include multiple cores, each of which can run its own thread, known as multiprocessing. When these two features are combined, a new engineering challenge emerges: when should a thread run? How long should it run, and on what processor? This thread scheduling strategy is a complex problem, and can have a significant effect on performance. In particular, threads that don't get scheduled to run can suffer starvation, which can adversely affect user-visible latencies.
In an ideal system, a simple strategy of assigning chunks of CPU-time to threads in a round-robin manner would maximize fairness by ensuring all threads are equally starved. But, of course, real systems are far from ideal, and this view of fairness may not be an appropriate performance goal. Here are a few factors that make scheduling tricky:
- Not all threads are equally important. Each thread has a priority that specifies its importance relative to other threads. Thread priorities must be selected carefully, and the scheduler must honor those selections.
- Not all cores are equal. The structure of the memory hierarchy can make it costly to shift a thread from one core to another, especially if that shift moves it to a new NUMA node. Users can explicitly pin threads to a CPU or a set of CPUs, or can exclude threads from specific CPUs, using features like sched_setaffinity or cgroups. But such restrictions also make scheduling even tougher.
- Not all threads want to run all the time. Threads may sleep waiting for some event, yielding their core to other execution. When the event occurs, pending threads should be scheduled quickly.
Tracepoints and Kernel TracingThe Linux kernel is instrumented with hooks called tracepoints; when certain actions occur, any code hooked to the relevant tracepoint is called with arguments that describe the action. The kernel also provides a debug feature that can trace this data and stream it to a buffer for later analysis.
Hundreds of different tracepoints exist, arranged into families of related function. The sched family includes tracepoints that can reconstruct thread scheduling behavior—when threads switched in, blocked on some event, or migrated between cores. These sched tracepoints provide fine-grained and comprehensive detail about thread scheduling behavior over a short period of traced execution.
SchedViz: Visualize Thread Scheduling Over TimeSchedViz provides an easy way to gather kernel scheduling traces from hosts, and visualize those traces over time. Tracing is simple:
$ sudo ./trace.sh -capture_seconds 5 -out ~/tracesThen, importing the resulting collection into SchedViz takes just one click.
Once imported, a collection will always be available for later viewing, until you delete it.
The SchedViz UI displays collections in several ways. A zoomable and pannable heatmap shows system cores on the y-axis, and the trace duration on the x-axis. Each core in the system has a swim-lane, and each swim-lane shows CPU utilization (when that CPU is being kept busy) and wait-queue depth (how many threads are waiting to run on that CPU.) The UI also includes a thread list that displays which threads were active in the heatmap, along with how long they ran, waited to run, and blocked on some event, and how many times they woke up or migrated between cores. Individual threads can be selected to show their behavior over time, or expanded to see their details.
Using SchedViz to Identify Antagonisms: Not all threads are equally importantAntagonism describes the situation in which a victim thread is ready to run on some CPU, while another antagonist thread runs on that same CPU. Long antagonisms, or high cumulative duration of antagonisms, can degrade user experience or system efficiency by making a critical process unavailable at critical times.
Antagonist analysis is useful when threads are meant to have exclusive access to some core but don’t get it. In SchedViz, such antagonisms are listed in each thread’s summary, as well as being immediately visible as breaks in the victim thread's running bar. Zooming in reveals exactly what work is interfering.
In SchedViz, round-robin scheduling appears as a sequence of fixed-size intervals in which the running thread, and the set of waiting threads, changes with each interval. The SchedViz UI makes it easy to better understand what caused this phenomenon.
Using SchedViz to Identify NUMA Issues: Not all cores are equalLarger servers often have several NUMA nodes; a CPU can access a subset of memory (the DRAM local to its NUMA node) more quickly than other memory (other nodes' DRAMs). This non-uniformity is a practical consequence of growing core count, but it brings challenges.
On the one hand, a thread migrated away from the DRAM that holds most of its state will suffer, since it will then have to pay an extra tax for each DRAM access. SchedViz can help identify cases like this, making it clear when a thread has had to migrate across NUMA boundaries.
On the other hand, it is important to ensure that all NUMA nodes in a system are well-balanced, lest part of the machine is overloaded while another part of the machine sits idle.