Tag Archives: kernel

Understanding Scheduling Behavior with SchedViz

Linux kernel scheduling behavior can be a key factor in application responsiveness and system utilization. Today, we’re announcing SchedViz, a new tool for visualizing Linux kernel scheduling behavior. We’ve used it inside Google to discover many opportunities for better scheduling choices and to root-cause many latency issues.

Thread Scheduling

Modern OSs execute multiple processes concurrently, by running each for a brief burst, then switching to the next: a feature called multiprogramming. Modern processors include multiple cores, each of which can run its own thread, known as multiprocessing. When these two features are combined, a new engineering challenge emerges: when should a thread run? How long should it run, and on what processor? This thread scheduling strategy is a complex problem, and can have a significant effect on performance. In particular, threads that don't get scheduled to run can suffer starvation, which can adversely affect user-visible latencies.

In an ideal system, a simple strategy of assigning chunks of CPU-time to threads in a round-robin manner would maximize fairness by ensuring all threads are equally starved. But, of course, real systems are far from ideal, and this view of fairness may not be an appropriate performance goal. Here are a few factors that make scheduling tricky:
  • Not all threads are equally important. Each thread has a priority that specifies its importance relative to other threads. Thread priorities must be selected carefully, and the scheduler must honor those selections.
  • Not all cores are equal. The structure of the memory hierarchy can make it costly to shift a thread from one core to another, especially if that shift moves it to a new NUMA node. Users can explicitly pin threads to a CPU or a set of CPUs, or can exclude threads from specific CPUs, using features like sched_setaffinity or cgroups. But such restrictions also make scheduling even tougher.
  • Not all threads want to run all the time. Threads may sleep waiting for some event, yielding their core to other execution. When the event occurs, pending threads should be scheduled quickly.
SchedViz permits you to observe real scheduling behavior. Comparing this with the expected or desired behavior can point to specific problems and possible solutions.

Tracepoints and Kernel Tracing

The Linux kernel is instrumented with hooks called tracepoints; when certain actions occur, any code hooked to the relevant tracepoint is called with arguments that describe the action. The kernel also provides a debug feature that can trace this data and stream it to a buffer for later analysis.

Hundreds of different tracepoints exist, arranged into families of related function. The sched family includes tracepoints that can reconstruct thread scheduling behavior—when threads switched in, blocked on some event, or migrated between cores. These sched tracepoints provide fine-grained and comprehensive detail about thread scheduling behavior over a short period of traced execution.

SchedViz: Visualize Thread Scheduling Over Time

SchedViz provides an easy way to gather kernel scheduling traces from hosts, and visualize those traces over time. Tracing is simple:
$ sudo ./trace.sh -capture_seconds 5 -out ~/traces
Then, importing the resulting collection into SchedViz takes just one click.


Once imported, a collection will always be available for later viewing, until you delete it.

The SchedViz UI displays collections in several ways. A zoomable and pannable heatmap shows system cores on the y-axis, and the trace duration on the x-axis. Each core in the system has a swim-lane, and each swim-lane shows CPU utilization (when that CPU is being kept busy) and wait-queue depth (how many threads are waiting to run on that CPU.) The UI also includes a thread list that displays which threads were active in the heatmap, along with how long they ran, waited to run, and blocked on some event, and how many times they woke up or migrated between cores. Individual threads can be selected to show their behavior over time, or expanded to see their details.

Using SchedViz to Identify Antagonisms: Not all threads are equally important

Antagonism describes the situation in which a victim thread is ready to run on some CPU, while another antagonist thread runs on that same CPU. Long antagonisms, or high cumulative duration of antagonisms, can degrade user experience or system efficiency by making a critical process unavailable at critical times.

Antagonist analysis is useful when threads are meant to have exclusive access to some core but don’t get it. In SchedViz, such antagonisms are listed in each thread’s summary, as well as being immediately visible as breaks in the victim thread's running bar. Zooming in reveals exactly what work is interfering.

Several antagonisms affect a thread that wants its CPU exclusively.
Root-causing an antagonism via zooming in.

Round-robin queueing, in which two or more threads, each wanting to run most or all of the time, occupy a single CPU for a period of time, also yields antagonisms. In this case, the scheduler attempts to avert starvation by giving multiple threads short time-slices to run in a round-robin manner. This reduces the throughput of affected threads while introducing often-significant, repeating, latencies. It is a sign that some portion of the system is overloaded.

In SchedViz, round-robin scheduling appears as a sequence of fixed-size intervals in which the running thread, and the set of waiting threads, changes with each interval. The SchedViz UI makes it easy to better understand what caused this phenomenon.

An overloaded CPU with two threads engaged in round-robin queueing. Running intervals are shown as ovals at top; waiting intervals as rectangles at bottom.
Zooming out and viewing more CPUs reveals that round-robin queueing started when a thread migrated into the overtaxed CPU.

Using SchedViz to Identify NUMA Issues: Not all cores are equal

Larger servers often have several NUMA nodes; a CPU can access a subset of memory (the DRAM local to its NUMA node) more quickly than other memory (other nodes' DRAMs). This non-uniformity is a practical consequence of growing core count, but it brings challenges.

On the one hand, a thread migrated away from the DRAM that holds most of its state will suffer, since it will then have to pay an extra tax for each DRAM access. SchedViz can help identify cases like this, making it clear when a thread has had to migrate across NUMA boundaries.

On the other hand, it is important to ensure that all NUMA nodes in a system are well-balanced, lest part of the machine is overloaded while another part of the machine sits idle.

A thread (in yellow) risks higher-latency memory accesses as it migrates across NUMA nodes.
A system risks both under-utilization and increased latency due to NUMA imbalance.

Beyond Scheduling

Many issues can be identified and explored using only sched tracepoints. But, there are many tracepoints, reflecting a wide variety of phenomena. Many of these tracepoints go well with scheduling data. For example, irq events can reveal when thread running time is spent handling interrupts; sys events can help reveal when execution moves into the kernel, and what it’s doing there; and workqueue events can show when kernel work is underway, and what work is being done. SchedViz presently offers limited support for visualizing these non-sched tracepoint families, but improving that support is an active area of development for us.

Android Protected Confirmation: Taking transaction security to the next level

Posted by Janis Danisevskis, Information Security Engineer, Android Security

In Android Pie, we introduced Android Protected Confirmation, the first major mobile OS API that leverages a hardware protected user interface (Trusted UI) to perform critical transactions completely outside the main mobile operating system. This Trusted UI protects the choices you make from fraudulent apps or a compromised operating system. When an app invokes Protected Confirmation, control is passed to the Trusted UI, where transaction data is displayed and user confirmation of that data's correctness is obtained.

Once confirmed, your intention is cryptographically authenticated and unforgeable when conveyed to the relying party, for example, your bank. Protected Confirmation increases the bank's confidence that it acts on your behalf, providing a higher level of protection for the transaction.

Protected Confirmation also adds additional security relative to other forms of secondary authentication, such as a One Time Password or Transaction Authentication Number. These mechanisms can be frustrating for mobile users and also fail to protect against a compromised device that can corrupt transaction data or intercept one-time confirmation text messages.

Once the user approves a transaction, Protected Confirmation digitally signs the confirmation message. Because the signing key never leaves the Trusted UI's hardware sandbox, neither app malware nor a compromised operating system can fool the user into authorizing anything. Protected Confirmation signing keys are created using Android's standard AndroidKeyStore API. Before it can start using Android Protected Confirmation for end-to-end secure transactions, the app must enroll the public KeyStore key and its Keystore Attestation certificate with the remote relying party. The attestation certificate certifies that the key can only be used to sign Protected Confirmations.

There are many possible use cases for Android Protected Confirmation. At Google I/O 2018, the What's new in Android security session showcased partners planning to leverage Android Protected Confirmation in a variety of ways, including Royal Bank of Canada person to person money transfers; Duo Security, Nok Nok Labs, and ProxToMe for user authentication; and Insulet Corporation and Bigfoot Biomedical, for medical device control.

Insulet, a global leading manufacturer of tubeless patch insulin pumps, has demonstrated how they can modify their FDA cleared Omnipod DASH TM Insulin management system in a test environment to leverage Protected Confirmation to confirm the amount of insulin to be injected. This technology holds the promise for improved quality of life and reduced cost by enabling a person with diabetes to leverage their convenient, familiar, and secure smartphone for control rather than having to rely on a secondary, obtrusive, and expensive remote control device. (Note: The Omnipod DASH™ System is not cleared for use with Pixel 3 mobile device or Protected Confirmation).

This work is fulfilling an important need in the industry. Since smartphones do not fit the mold of an FDA approved medical device, we've been working with FDA as part of DTMoSt, an industry-wide consortium, to define a standard for phones to safely control medical devices, such as insulin pumps. A technology like Protected Confirmation plays an important role in gaining higher assurance of user intent and medical safety.

To integrate Protected Confirmation into your app, check out the Android Protected Confirmation training article. Android Protected Confirmation is an optional feature in Android Pie. Because it has low-level hardware dependencies, Protected Confirmation may not be supported by all devices running Android Pie. Google Pixel 3 and 3XL devices are the first to support Protected Confirmation, and we are working closely with other manufacturers to adopt this market-leading security innovation on more devices.

Control Flow Integrity in the Android kernel

Posted by Sami Tolvanen, Staff Software Engineer, Android Security

Android's security model is enforced by the Linux kernel, which makes it a tempting target for attackers. We have put a lot of effort into hardening the kernel in previous Android releases and in Android 9, we continued this work by focusing on compiler-based security mitigations against code reuse attacks.

Google's Pixel 3 will be the first Android device to ship with LLVM's forward-edge Control Flow Integrity (CFI) enforcement in the kernel, and we have made CFI support available in Android kernel versions 4.9 and 4.14. This post describes how kernel CFI works and provides solutions to the most common issues developers might run into when enabling the feature.

Protecting against code reuse attacks

A common method of exploiting the kernel is using a bug to overwrite a function pointer stored in memory, such as a stored callback pointer or a return address that had been pushed to the stack. This allows an attacker to execute arbitrary parts of the kernel code to complete their exploit, even if they cannot inject executable code of their own. This method of gaining code execution is particularly popular with the kernel because of the huge number of function pointers it uses, and the existing memory protections that make code injection more challenging.

CFI attempts to mitigate these attacks by adding additional checks to confirm that the kernel's control flow stays within a precomputed graph. This doesn't prevent an attacker from changing a function pointer if a bug provides write access to one, but it significantly restricts the valid call targets, which makes exploiting such a bug more difficult in practice.

Figure 1. In an Android device kernel, LLVM's CFI limits 55% of indirect calls to at most 5 possible targets and 80% to at most 20 targets.

Gaining full program visibility with Link Time Optimization (LTO)

In order to determine all valid call targets for each indirect branch, the compiler needs to see all of the kernel code at once. Traditionally, compilers work on a single compilation unit (source file) at a time and leave merging the object files to the linker. LLVM's solution to CFI is to require the use of LTO, where the compiler produces LLVM-specific bitcode for all C compilation units, and an LTO-aware linker uses the LLVM back-end to combine the bitcode and compile it into native code.

Figure 2. A simplified overview of how LTO works in the kernel. All LLVM bitcode is combined, optimized, and generated into native code at link time.

Linux has used the GNU toolchain for assembling, compiling, and linking the kernel for decades. While we continue to use the GNU assembler for stand-alone assembly code, LTO requires us to switch to LLVM's integrated assembler for inline assembly, and either GNU gold or LLVM's own lld as the linker. Switching to a relatively untested toolchain on a huge software project will lead to compatibility issues, which we have addressed in our arm64 LTO patch sets for kernel versions 4.9 and 4.14.

In addition to making CFI possible, LTO also produces faster code due to global optimizations. However, additional optimizations often result in a larger binary size, which may be undesirable on devices with very limited resources. Disabling LTO-specific optimizations, such as global inlining and loop unrolling, can reduce binary size by sacrificing some of the performance gains. When using GNU gold, the aforementioned optimizations can be disabled with the following additions to LDFLAGS:

LDFLAGS += -plugin-opt=-inline-threshold=0 \
           -plugin-opt=-unroll-threshold=0

Note that flags to disable individual optimizations are not part of the stable LLVM interface and may change in future compiler versions.

Implementing CFI in the Linux kernel

LLVM's CFI implementation adds a check before each indirect branch to confirm that the target address points to a valid function with a correct signature. This prevents an indirect branch from jumping to an arbitrary code location and even limits the functions that can be called. As C compilers do not enforce similar restrictions on indirect branches, there were several CFI violations due to function type declaration mismatches even in the core kernel that we have addressed in our CFI patch sets for kernels 4.9 and 4.14.

Kernel modules add another complication to CFI, as they are loaded at runtime and can be compiled independently from the rest of the kernel. In order to support loadable modules, we have implemented LLVM's cross-DSO CFI support in the kernel, including a CFI shadow that speeds up cross-module look-ups. When compiled with cross-DSO support, each kernel module contains information about valid local branch targets, and the kernel looks up information from the correct module based on the target address and the modules' memory layout.

Figure 3. An example of a cross-DSO CFI check injected into an arm64 kernel. Type information is passed in X0 and the target address to validate in X1.

CFI checks naturally add some overhead to indirect branches, but due to more aggressive optimizations, our tests show that the impact is minimal, and overall system performance even improved 1-2% in many cases.

Enabling kernel CFI for an Android device

CFI for arm64 requires clang version >= 5.0 and binutils >= 2.27. The kernel build system also assumes that the LLVMgold.so plug-in is available in LD_LIBRARY_PATH. Pre-built toolchain binaries for clang and binutils are available in AOSP, but upstream binaries can also be used.

The following kernel configuration options are needed to enable kernel CFI:

CONFIG_LTO_CLANG=y
CONFIG_CFI_CLANG=y

Using CONFIG_CFI_PERMISSIVE=y may also prove helpful when debugging a CFI violation or during device bring-up. This option turns a violation into a warning instead of a kernel panic.

As mentioned in the previous section, the most common issue we ran into when enabling CFI on Pixel 3 were benign violations caused by function pointer type mismatches. When the kernel runs into such a violation, it prints out a runtime warning that contains the call stack at the time of the failure, and the call target that failed the CFI check. Changing the code to use a correct function pointer type fixes the issue. While we have fixed all known indirect branch type mismatches in the Android kernel, similar problems may be still found in device specific drivers, for example.

CFI failure (target: [<fffffff3e83d4d80>] my_target_function+0x0/0xd80):
------------[ cut here ]------------
kernel BUG at kernel/cfi.c:32!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
…
Call trace:
…
[<ffffff8752d00084>] handle_cfi_failure+0x20/0x28
[<ffffff8752d00268>] my_buggy_function+0x0/0x10
…

Figure 4. An example of a kernel panic caused by a CFI failure.

Another potential pitfall are address space conflicts, but this should be less common in driver code. LLVM's CFI checks only understand kernel virtual addresses and any code that runs at another exception level or makes an indirect call to a physical address will result in a CFI violation. These types of failures can be addressed by disabling CFI for a single function using the __nocfi attribute, or even disabling CFI for entire code files using the $(DISABLE_CFI) compiler flag in the Makefile.

static int __nocfi address_space_conflict()
{
      void (*fn)(void);
 …
/* branching to a physical address trips CFI w/o __nocfi */
 fn = (void *)__pa_symbol(function_name);
      cpu_install_idmap();
      fn();
      cpu_uninstall_idmap();
 …
}

Figure 5. An example of fixing a CFI failure caused by an address space conflict.

Finally, like many hardening features, CFI can also be tripped by memory corruption errors that might otherwise result in random kernel crashes at a later time. These may be more difficult to debug, but memory debugging tools such as KASAN can help here.

Conclusion

We have implemented support for LLVM's CFI in Android kernels 4.9 and 4.14. Google's Pixel 3 will be the first Android device to ship with these protections, and we have made the feature available to all device vendors through the Android common kernel. If you are shipping a new arm64 device running Android 9, we strongly recommend enabling kernel CFI to help protect against kernel vulnerabilities.

LLVM's CFI protects indirect branches against attackers who manage to gain access to a function pointer stored in kernel memory. This makes a common method of exploiting the kernel more difficult. Our future work involves also protecting function return addresses from similar attacks using LLVM's Shadow Call Stack, which will be available in an upcoming compiler release.

Hardening the Kernel in Android Oreo

Posted by Sami Tolvanen, Senior Software Engineer, Android Security

The hardening of Android's userspace has increasingly made the underlying Linux kernel a more attractive target to attackers. As a result, more than a third of Android security bugs were found in the kernel last year. In Android 8.0 (Oreo), significant effort has gone into hardening the kernel to reduce the number and impact of security bugs.

Android Nougat worked to protect the kernel by isolating it from userspace processes with the addition of SELinux ioctl filtering and requiring seccomp-bpf support, which allows apps to filter access to available system calls when processing untrusted input. Android 8.0 focuses on kernel self-protection with four security-hardening features backported from upstream Linux to all Android kernels supported in devices that first ship with this release.

Hardened usercopy

Usercopy functions are used by the kernel to transfer data from user space to kernel space memory and back again. Since 2014, missing or invalid bounds checking has caused about 45% of Android's kernel vulnerabilities. Hardened usercopy adds bounds checking to usercopy functions, which helps developers spot misuse and fix bugs in their code. Also, if obscure driver bugs slip through, hardening these functions prevents the exploitation of such bugs.

This feature was introduced in the upstream kernel version 4.8, and we have backported it to Android kernels 3.18 and above.

int buggy_driver_function(void __user *src, size_t size)
{
    /* potential size_t overflow (don’t do this) */
    u8 *buf = kmalloc(size * N, GPF_KERNEL);
    …
    /* results in buf smaller than size, and a heap overflow */
    if (copy_from_user(buf, src, size))
    return -EFAULT;

    /* never reached with CONFIG_HARDENED_USERCOPY=y */
}

An example of a security issue that hardened usercopy prevents.

Privileged Access Never (PAN) emulation

While hardened usercopy functions help find and mitigate security issues, they can only help if developers actually use them. Currently, all kernel code, including drivers, can access user space memory directly, which can lead to various security issues.

To mitigate this, CPU vendors have introduced features such as Supervisor Mode Access Prevention (SMAP) in x86 and Privileged Access Never (PAN) in ARM v8.1. These features prevent the kernel from accessing user space directly and ensure developers go through usercopy functions. Unfortunately, these hardware features are not yet widely available in devices that most Android users have today.

Upstream Linux introduced software emulation for PAN in kernel version 4.3 for ARM and 4.10 in ARM64. We have backported both features to Android kernels starting from 3.18.

Together with hardened usercopy, PAN emulation has helped find and fix bugs in four kernel drivers in Pixel devices.

int buggy_driver_copy_data(struct mydata *src, void __user *ptr)
{
    /* failure to keep track of user space pointers */
    struct mydata *dst = (struct mydata *)ptr;
    …
    /* read/write from/to an arbitrary user space memory location */
    dst->field = … ;    /* use copy_(from|to)_user instead! */
    …
    /* never reached with PAN (emulation) or SMAP */
}

An example of a security issue that PAN emulation mitigates.

Kernel Address Space Layout Randomization (KASLR)

Android has included support for Address Space Layout Randomization (ASLR) for years. Randomizing memory layout makes code reuse attacks probabilistic and therefore more difficult for an attacker to exploit, especially remotely. Android 8.0 brings this feature to the kernel. While Linux has supported KASLR on x86 since version 3.14, KASLR for ARM64 has only been available upstream since Linux 4.6. Android 8.0 makes KASLR available in Android kernels 4.4 and newer.

KASLR helps mitigate kernel vulnerabilities by randomizing the location where kernel code is loaded on each boot. On ARM64, for example, it adds 13–25 bits of entropy depending on the memory configuration of the device, which makes code reuse attacks more difficult.

Post-init read-only memory

The final hardening feature extends existing memory protections in the kernel by creating a memory region that's marked read-only after the kernel has been initialized. This makes it possible for developers to improve protection on data that needs to be writable during initialization, but shouldn't be modified after that. Having less writable memory reduces the internal attack surface of the kernel, making exploitation harder.

Post-init read-only memory was introduced in upstream kernel version 4.6 and we have backported it to Android kernels 3.18 and newer. While we have applied these protections to some data structures in the core kernel, this feature is extremely useful for developers working on kernel drivers.

Conclusion

Android Oreo includes mitigations for the most common source of security bugs in the kernel. This is especially relevant because 85% of kernel security bugs in Android have been in vendor drivers that tend to get much less scrutiny. These updates make it easier for driver developers to discover common bugs during development, stopping them before they can reach end user devices.