Kubernetes Security is constantly evolving - keeping pace with enhanced functionality, usability and flexibility while also balancing the security needs of a wide and diverse set of use-cases.
Recently, the GKE Security team discovered a high severity vulnerability that allowed workloads to have access to parts of the host filesystem outside the mounted volumes boundaries. Although the vulnerability was patched back in September we thought it would be beneficial to write up a more in-depth analysis of the issue to share with the community.
We assessed the impact of the vulnerability as described in vulnerability management in open-source Kubernetes and worked closely with the GKE Storage team and the Kubernetes Security Response Committee to find a fix. In this post we’ll give some background on how the subpath storage system works, an overview of the vulnerability, the steps to find the root cause and the fix, and finally some recommendations for GKE and Anthos users.
Kubernetes Filesystems: Intro to Volume Subpath
The vulnerability, CVE-2021-25741, was caused by a race condition during the creation of a subpath bind mount inside a container, and allowed an attacker to gain unauthorized access to the underlying node filesystem and its sensitive files. We’ll describe how that system is supposed to work, and then talk about the vulnerability.
The volume subpath feature in Kubernetes enables sharing a volume in multiple containers inside a pod. For example, we could create a Pod with an InitContainer that creates directories with pre-populated data in a mounted filesystem volume. These directories can then be used by containers in the same Pod by mounting the same volume and optionally specifying a subpath field to limit what's visible inside the container.
While there are some great use cases for this feature, it’s an area that has had vulnerabilities discovered in the past. The kubelet must be extra cautious when handling user-owned subpaths because it operates with privileges in the host. One vulnerability that has been previously discovered involved the creation of a malicious workload where an InitContainer would create a symlink pointing to any location in the host. For example, the InitContainer could mount a volume in /mnt and create a symlink /mnt/attack inside the container pointing to /etc. Later in the Pod lifecycle, another container would attempt to mount the same volume with subpath attack. While preparing the volumes for the container, the kubelet would end up following the symlink to the host’s /etc instead of the container’s /etc, unknowingly exposing the host filesystem to the container. A previous fix made sure that the subpath mount location is resolved and validated to point to a location inside the base volume and that it's not changeable by the user in between the time the path was validated and when the container runtime bind mounts it. This race condition is known as time of check to time of use (TOCTOU) where the subject being validated changes after it has been validated.
These validations and others are summarized in the following container lifecycle sequence diagram.
Volume subpath validations before the container startup
A New TOCTOU Vulnerability: CVE-2021-25741
The latest vulnerability was discovered by performing a symlink attack similar to the one explained above, with the difference being that it constantly swapped the symlink with a directory in a tight loop, using the RENAME_EXCHANGE option with renameat(2). If the timing is just right, the kubelet will see the path as a directory and pass the validation check. Then the mount utility may find that the path is a symlink pointing to the host and follow it, exposing the host filesystem to the container. This is visualized in the following diagram:
The GKE Security and Storage teams worked closely to revise the fix done previously to find a solution. The previous fix takes several steps to ensure that the directory being mounted is safely opened and validated. After the file is opened and validated, the kubelet uses the magic-link path under /proc/[pid]/fd directory for all subsequent operations to ensure the file remains unchanged. However, we found out that all of the efforts were undone by the mount(8) linux utility which was dereferencing the procfs magic-link by default. Once the problem was understood, the fix involved making sure that the mount utility doesn't dereference the magic-links by using the --no-canonicalize flag in the mount command.
The fix is in
GKE released a Google Kubernetes Engine security bulletin on this vulnerability, which detailed what customers can do to immediately remediate this issue across GKE and Anthos. We also provided guidance to customers who manually manage their node versions, ensuring that fixed releases were available in every region for our Static and Release Channels.
Google continues to invest heavily in the security of GKE and Kubernetes. We encourage users interested in finding vulnerabilities to participate in the Kubernetes bug bounty program and in the Google Vulnerability Rewards Program (VRP) which was recently expanded to cover GKE vulnerabilities. For the latest guidance on security issues, please follow our GKE Security Bulletins.