Tag Archives: Kubernetes

Kubernetes 1.29 is available in the Regular channel of GKE

Kubernetes 1.29 is now available in the GKE Regular Channel since January 26th, and was available in the Rapid Channel January 11th, less than 30 days after the OSS release! For more information about the content of Kubernetes 1.29, read the Kubernetes 1.29 Release Notes.

New Features

Using CEL for Validating Admission Policy

Validating admission policies offer a declarative, in-process alternative to validating admission webhooks.

Validating admission policies use the Common Expression Language (CEL) to declare the validation rules of a policy. Validation admission policies are highly configurable, enabling policy authors to define policies that can be parameterized and scoped to resources as needed by cluster administrators. [source]

Validating Admission Policy graduates to beta in 1.29. We are especially excited about the work that Googlers Cici Huang, Joe Betz, and Jiahui Feng have led in this release to get to the beta milestone. As we move toward v1, we are actively working to ensure scalability and would appreciate any end-user feedback. [public doc here for those interested]

The beta of ValidatingAdmissionPolicy feature can be opted into by enabling the beta APIs.

InitContainers as a Sidecar

InitContainers can now be configured as sidecar containers and kept running alongside normal containers in a Pod. This is only supported by nodes running version 1.29 or later, so ensure all nodes in a cluster are at version 1.29 or later before using this feature in Pods. The feature was long awaited. This is evident by the fact that Istio has already widely tested it and the Istio community working hard to make sure that the enablement of it can be done early with minimal disruption for the clusters with older nodes. You can participate in the discussion here.

A big driver to deliver the feature is the growing number of AI/ML workloads which are often represented by Pods running to completion. Thos Pods need infrastructure sidecars - Istio and GCSFuse are examples of it, and Google recognizes this trend.

Implementation of sidecar containers is and continues to be the community effort. We are proud to highlight that Googler Sergey Kanzhelev is driving it via the Sidecar working group, and it was a great effort of many other Googlers to make sure this KEP landed so fast. John Howard made sure the early versions of implementation were tested with Istio, Wojciech Tyczyński made sure the safe rollout vie production readiness review, Tim Hockin spent many hours in API review of the feature, and Clayton Coleman gave advice and helped with code reviews.

New APIs

API Priority and Fairness/Flow Control

We are super excited to share that API Priority and Fairness graduated to Stable V1 / GA in 1.29! Controlling the behavior of the Kubernetes API server in an overload situation is a key task for cluster administrators, and this is what APF addresses. This ambitious project was initiated by Googler and founding API Machinery SIG lead Daniel Smith, and expanded to become a community-wide effort. Special thanks to Googler Wojciech Tyczyński and API Machinery members Mike Spreitzer from IBM and Abu Kashem from RedHat, for landing this critical feature in Kubernetes 1.29 (more details in the Kubernetes publication). In Google GKE we tested and utilized it early. In fact, any version above 1.26.4 is setting higher kubelet QPS values trusting the API server to handle it gracefully.

Deprecations and Removals

The previously deprecated v1beta2 Priority and Fairness APIs are no longer served in 1.29, so update usage to v1beta3 before upgrading to 1.29.
With the API Priority and Fairness graduation to v1, the v1beta3 Priority and Fairness APIs are newly deprecated in 1.29, and will no longer be served in 1.32.
In the Node API, take a look at the changes to the status.kubeProxyVersion field, which will not be populated starting in v1.33. The field is currently populated with the kubelet version, not the kube-proxy version, and might not accurately reflect the kube-proxy version in use. For more information, see KEP-4004.
1.29 removed support for the insecure SHA1 algorithm. To prevent impact on your clusters, you must replace incompatible certificates of webhook servers and extension API servers before upgrading your clusters to version 1.29.

GKE will not auto-upgrade clusters with webhook backends using incompatible certificates to 1.29 until you replace the certificates or until version 1.28 reaches end of life. For more information refer to the official GKE documentation.

The Ceph CephFS (kubernetes.io/cephfs) and RBD (kubernetes.io/rbd) volume plugins are deprecated since 1.28 and will be removed in a future release

For more information, refer to the OSS Kubernetes announcement and https://github.com/ceph/ceph-csi/

Shoutout to the Production Readiness Review (PRR) team

For each new Kubernetes Release, there is a dedicated sub group of SIG Architecture, composed of very senior contributors in the Kubernetes Community, that regularly conducts Production Readiness reviews for each new release, going through each feature.

OSS Production Readiness Reviews (PRR) reduce toil for all the different Cloud Providers, by shifting the effort onto OSS developers.
OSS Production Readiness Reviews surface production safety, observability, and scalability issues with OSS features at design time, when it is still possible to affect the outcomes.
By ensuring feature gates, solid enable → disable → enable testing, and attention to upgrade and rollout considerations, OSS Production Readiness Reviews enable rapid mitigation of failures in new features.

As part of this group, we want to thank Googlers John Belamaric and Wojciech Tyczyński for doing this remarkable, heavy lifting on non shiny, and often invisible work. Additionally, we’d like to congratulate Googler Joe Betz who recently graduated as a new PRR reviewer, after shadowing during all 2023 the process.

By Jordan Liggitt, Jago Macleod, Sergey Kanzhelev, and Federico Bongiovanni – Google Kubernetes Kernel team

Source: Google Open Source Blog

Kubernetes CRD Validation Using CEL

Motivation

CRDs was used to support two major categories of built-in validation:

CRD structural schemas: Provide type checking of custom resources against schemas.
OpenAPIv3 validation rules: Provide regex ('pattern' property), range limits ('minimum' and 'maximum' properties) on individual fields and size limits on maps and lists ('minItems', 'maxItems').

For use cases that cannot be covered by build-in validation support:

Admission Webhooks: have validating admission webhook for further validation
Custom validators: write custom checks in several languages such as Rego

While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

To provide an self-contained, in-process validation, an inline expression language - Common Expression Language (CEL), is introduced into CRDs such that a much larger portion of validation use cases can be solved without the use of webhooks.

It is sufficiently lightweight and safe to be run directly in the kube-apiserver, has a straight-forward and unsurprising grammar, and supports pre-parsing and typechecking of expressions, allowing syntax and type errors to be caught at CRD registration time.

CRD validation rule

CRD validation rules are promoted to GA in Kubernetes 1.29 to validate custom resources based on validation rules.

Validation rules use the Common Expression Language (CEL) to validate custom resource values. Validation rules are included in CustomResourceDefinition schemas using the x-kubernetes-validations extension.

The Rule is scoped to the location of the x-kubernetes-validations extension in the schema. And self variable in the CEL expression is bound to the scoped value.

All validation rules are scoped to the current object: no cross-object or stateful validation rules are supported.

For example:

...
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        x-kubernetes-validations:
          - rule: "self.minReplicas <= self.replicas"
            message: "replicas should be greater than or equal to minReplicas."
          - rule: "self.replicas <= self.maxReplicas"
            message: "replicas should be smaller than or equal to maxReplicas."
        properties:
          ...
          minReplicas:
            type: integer
          replicas:
            type: integer
          maxReplicas:
            type: integer
        required:
          - minReplicas
          - replicas
          - maxReplicas

will reject a request to create this custom resource:

apiVersion: "stable.example.com/v1"


kind: CronTab


metadata:


 name: my-new-cron-object


spec:


 minReplicas: 0


 replicas: 20


 maxReplicas: 10

with the response:

The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: replicas should be smaller than or equal to maxReplicas.

x-kubernetes-validations could have multiple rules. The rule under x-kubernetes-validations represents the expression which will be evaluated by CEL. The message represents the message displayed when validation fails.

Note: You can quickly test CEL expressions in CEL Playground.

Validation rules are compiled when CRDs are created/updated. The request of CRDs create/update will fail if compilation of validation rules fail. Compilation process includes type checking as well.

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule	Purpose
self.minReplicas <= self.replicas	Validate an integer field is less than or equal to another integer field
'Available' in self.stateCounts	Validate an entry with the 'Available' key exists in a map
self.set1.all(e, !(e in self.set2))	Validate that the elements of two sets are disjoint
self == oldSelf	Validate that a required field is immutable once it is set
self.created + self.ttl < self.expired	Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.

CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition Rule	Purpose
self == oldSelf	For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) has(self.field) == has(oldSelf.field) on field: self == oldSelf	Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
self.all(x, x in oldSelf)	Only allow adding items to a field that represents a set (prevent removals).
self >= oldSelf	Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

CEL standard functions, defined in the list of standard definitions
CEL standard macros
CEL extended string function library
Kubernetes CEL extension library which includes supplemental functions for lists, regex, and URLs.

Examples of function libraries in use:

Validation Rule	Purpose
!(self.getDayOfWeek() in [0, 6]	Validate that a date is not a Sunday or Saturday.
isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']	Validate that a URL has an allowed hostname.
self.map(x, x.weight).sum() == 1	Validate that the weights of a list of objects sum to 1.
int(self.find('^[0-9]*')) < 100	Validate that a string starts with a number less than 100
self.isSorted()	Validate that a list is sorted

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.

Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n²) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.

Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.

Adoption and Related work

This feature has been turned on by default since Kubernetes 1.25 and finally graduated to GA in Kubernetes 1.29. It raised a lot of interest and is now widely used in the Kubernetes ecosystem. We are excited to share that the Gateway API was able to replace all the validating webhook previously used with this feature.

After CEL was introduced into Kubernetes, we are excited to expand the power to multiple areas including the Admission Chain and authorization config. We will have a separate blog to introduce further.

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in upcoming Kubernetes releases.

Acknowledgements

Special thanks to Joe Betz, Kermit Alexander, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to CRD Validation Rules!

By Cici Huang – Software Engineer

Source: Google Open Source Blog

Kubernetes CRD Validation Using CEL

Motivation

CRDs was used to support two major categories of built-in validation:

CRD structural schemas: Provide type checking of custom resources against schemas.
OpenAPIv3 validation rules: Provide regex ('pattern' property), range limits ('minimum' and 'maximum' properties) on individual fields and size limits on maps and lists ('minItems', 'maxItems').

For use cases that cannot be covered by build-in validation support:

Admission Webhooks: have validating admission webhook for further validation
Custom validators: write custom checks in several languages such as Rego

While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

CRD validation rule

CRD validation rules are promoted to GA in Kubernetes 1.29 to validate custom resources based on validation rules.

The Rule is scoped to the location of the x-kubernetes-validations extension in the schema. And self variable in the CEL expression is bound to the scoped value.

All validation rules are scoped to the current object: no cross-object or stateful validation rules are supported.

For example:

...
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        x-kubernetes-validations:
          - rule: "self.minReplicas <= self.replicas"
            message: "replicas should be greater than or equal to minReplicas."
          - rule: "self.replicas <= self.maxReplicas"
            message: "replicas should be smaller than or equal to maxReplicas."
        properties:
          ...
          minReplicas:
            type: integer
          replicas:
            type: integer
          maxReplicas:
            type: integer
        required:
          - minReplicas
          - replicas
          - maxReplicas

will reject a request to create this custom resource:

apiVersion: "stable.example.com/v1"


kind: CronTab


metadata:


 name: my-new-cron-object


spec:


 minReplicas: 0


 replicas: 20


 maxReplicas: 10

with the response:

The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: replicas should be smaller than or equal to maxReplicas.

Note: You can quickly test CEL expressions in CEL Playground.

Validation rules are compiled when CRDs are created/updated. The request of CRDs create/update will fail if compilation of validation rules fail. Compilation process includes type checking as well.

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule	Purpose
self.minReplicas <= self.replicas	Validate an integer field is less than or equal to another integer field
'Available' in self.stateCounts	Validate an entry with the 'Available' key exists in a map
self.set1.all(e, !(e in self.set2))	Validate that the elements of two sets are disjoint
self == oldSelf	Validate that a required field is immutable once it is set
self.created + self.ttl < self.expired	Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.

CRD transition rules

Transition rule examples:

Transition Rule	Purpose
self == oldSelf	For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) has(self.field) == has(oldSelf.field) on field: self == oldSelf	Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
self.all(x, x in oldSelf)	Only allow adding items to a field that represents a set (prevent removals).
self >= oldSelf	Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

CEL standard functions, defined in the list of standard definitions
CEL standard macros
CEL extended string function library
Kubernetes CEL extension library which includes supplemental functions for lists, regex, and URLs.

Examples of function libraries in use:

Validation Rule	Purpose
!(self.getDayOfWeek() in [0, 6]	Validate that a date is not a Sunday or Saturday.
isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']	Validate that a URL has an allowed hostname.
self.map(x, x.weight).sum() == 1	Validate that the weights of a list of objects sum to 1.
int(self.find('^[0-9]*')) < 100	Validate that a string starts with a number less than 100
self.isSorted()	Validate that a list is sorted

Resource use and limits

Estimated cost limit

Good practice:

Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

Adoption and Related work

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in upcoming Kubernetes releases.

Acknowledgements

Special thanks to Joe Betz, Kermit Alexander, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to CRD Validation Rules!

By Cici Huang – Software Engineer

Source: Google Open Source Blog

The Story of Gateway API

Earlier this week, Gateway API v1.0 was released, marking the significant milestone of General Availability. This Kubernetes API represents the future of load balancing, routing, and service mesh configuration. It already has more than 20 implementations, including GKE and Istio. In this post, we’ll take a look back at some of the key moments that led to this point, starting with the proposal that started it all.

Initial Proposal

The core ideas for this new API were initially proposed by Bowei Du (Software Engineer, Google) at KubeCon San Diego as “Ingress v2”, the next generation of the Ingress API for Kubernetes. This proposal came as the shortcomings of the original Ingress API were becoming apparent. The community had started to develop alternative APIs, notably including Istio’s VirtualService API and Contour’s IngressRoute API. We had reached an inflection point where the Kubernetes ecosystem was diverging, and Bowei believed it was important to develop a new standard that would expose all these advanced features in a portable way.

The initial proposal for this API provided a great foundation to build on. Specifically, this proposal focused on a role-oriented model that split capabilities into resources that were aligned with 3 different personas. It emphasized both expressiveness and extensibility as core design principles. This early sketch from that proposal closely resembles the API today:

Sketch of early proposal for Ingress API

One of the key limitations of the Ingress API was that it was designed with the lowest common denominator in mind: every feature included in the API needed to be implemented by everyone. This meant that the API surface was very small, and implementations that wanted to support more advanced features either relied on long lists of implementation-specific annotations or developing new custom APIs.

Bowei proposed that this new API could introduce a concept of “support levels.” This would allow us to add features to the API even if not every implementation could support them, for example “Extended” features would be fully portable if they were supported. Diagram showing proposed Custom API support levels

Evolution of the API

Since that initial proposal, the API has evolved significantly, benefiting from the expertise of many in the community. Gateway has been referred to as the “most collaborative API in Kubernetes history” due to the hundreds of contributors representing dozens of companies that have helped refine the API over the years.

One of the things that makes this API unique is that it is built on top of Custom Resource Definitions (CRDs). This has meant that Gateway API is developed and released outside of the main Kubernetes project, enabling broader collaboration and shorter feedback loops. For example, each new release of this API supports the 5 most recent versions of Kubernetes, covering the vast majority of clusters in use today. So, instead of waiting until you can upgrade to the latest version of Kubernetes, most will be able to try out these APIs close to the time they’re released.

As the first official Kubernetes API to take this approach, it has developed several unique concepts along the way:

GEPs

Similar to Kubernetes Enhancement Proposals (KEPs), Gateway Enhancement Proposals provide a streamlined approach for proposing significant new enhancements to Gateway API. As the API grew and attracted more contributors, it became critical to have a better way to document key design decisions. The concept of GEPs was initially proposed by Bowei in 2021.

More than 30 of these have already merged, with many more in progress right now. This pattern has been invaluable in keeping track of when and why key design decisions were made. All key parts of the API now have GEPs documenting when and why they were proposed, along with alternatives considered.

Release Channels

In 2021 we proposed a simplified approach to versioning that would introduce the concept of release channels, our own version of Kubernetes’ “feature gates”, which denote the stability of individual fields and features.

All new resources, fields, and features start in the “Experimental” release channel. As the name implies, this channel provides no stability guarantees and can include breaking changes to enable us to iterate more quickly on APIs.

As these experimental APIs stabilize, individual resources, fields, and features can graduate to the “Standard” release channel when they meet predefined graduation criteria. These two release channels enable us to both provide a stable and predictable API with the “Standard” release channel while still iterating on experimental concepts with the “Experimental” release channel.

Conformance Tests

We added the first conformance tests in 2022, before this API reached beta, and since then these tests have become a key part of every new feature in Gateway API, ensuring that implementations were truly providing a portable experience. Before a feature can graduate to the “Standard” release channel, thorough conformance tests need to be developed, and multiple implementations need to pass them.

Service Mesh Support

Earlier this year, mesh support launched its “Experimental” version, marking the first time a Kubernetes API has ever officially underpinned the concept of Service Mesh. In 2022, key Service Mesh projects came together to form the GAMMA initiative (Gateway API for Mesh Management and Administration). The core idea was that the Gateway API was sufficiently modular that the Routing and Policy layers could be used for both mesh and ingress use cases.

Trying it Out

Gateway API enables great new features on GKE, such as advanced multi-cluster routing. Yesterday GKE announced GA support for multi-cluster Gateways. In the coming weeks, GKE will also be rolling out the v1.0 CRDs for all customers that have enabled Gateway API in their clusters. In the meantime, you can access all of the same features with the v1beta1 CRDs already supported by GKE. For more information on how to get started with the Gateway API on GKE, refer to the GKE Gateway documentation.

If you’re interested in Gateway API’s support for Service Mesh, you can try it out with Anthos Service Mesh.

Alternatively, if you’d like to use this API with another implementation, refer to the open source project’s Getting Started documentation.

By Rob Scott – GKE Networking

Source: Google Open Source Blog

Optimizing gVisor filesystems with Directfs

gVisor is a sandboxing technology that provides a secure environment for running untrusted code. In our previous blog post, we discussed how gVisor performance improves with a root filesystem overlay. In this post, we'll dive into another filesystem optimization that was recently launched: directfs. It gives gVisor’s application kernel (the Sentry) secure direct access to the container filesystem, avoiding expensive round trips to the filesystem gofer.

Origins of the Gofer

gVisor is used internally at Google to run a variety of services and workloads. One of the challenges we faced while building gVisor was providing remote filesystem access securely to the sandbox. gVisor’s strict security model and defense in depth approach assumes that the sandbox may get compromised because it shares the same execution context as the untrusted application. Hence the sandbox cannot be given sensitive keys and credentials to access Google-internal remote filesystems.

To address this challenge, we added a trusted filesystem proxy called a "gofer". The gofer runs outside the sandbox, and provides a secure interface for untrusted containers to access such remote filesystems. For architectural simplicity, gofers were also used to serve local filesystems as well as remote.

Gofer process intermediates filesystem operations

Isolating the Container Filesystem in runsc

When gVisor was open sourced as runsc, the same gofer model was copied over to maintain the same security guarantees. runsc was configured to start one gofer process per container which serves the container filesystem to the sandbox over a predetermined protocol (now LISAFS). However, a gofer adds a layer of indirection with significant overhead.

This gofer model (built for remote filesystems) brings very few advantages for the runsc use-case, where all the filesystems served by the gofer (like rootfs and bind mounts) are mounted locally on the host. The gofer directly accesses them using filesystem syscalls.

Linux provides some security primitives to effectively isolate local filesystems. These include, mount namespaces, pivot_root and detached bind mounts¹. Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner. The sandbox’s view of the filesystem tree is limited to just the container filesystem. The sandbox process is not given access to anything mounted on the broader host filesystem. Even if the sandbox gets compromised, these mechanisms provide additional barriers to prevent broader system compromise.

Directfs

In directfs mode, the gofer still exists as a cooperative process outside the sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate bind mounts to create the container filesystem in a new directory and then pivot_root(2)s into that directory. Similarly, the sandbox process enters new user and mount namespaces and then pivot_root(2)s into an empty directory to ensure it cannot access anything via path traversal. But instead of making RPCs to the gofer to access the container filesystem, the sandbox requests the gofer to provide file descriptors to all the mount points via SCM_RIGHTS messages. The sandbox then directly makes file-descriptor-relative syscalls (e.g. fstatat(2), openat(2), mkdirat(2), etc) to perform filesystem operations.

Earlier when the gofer performed all filesystem operations, we could deny all these syscalls in the sandbox process using seccomp. But with directfs enabled, the sandbox process's seccomp filters need to allow the usage of these syscalls. Most notably, the sandbox can now make openat(2) syscalls (which allow path traversal), but with certain restrictions: O_NOFOLLOW is required, no access to procfs and no directory FDs from the host. We also had to give the sandbox the same privileges as the gofer (for example CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH), so it can perform the same filesystem operations.

It is noteworthy that only the trusted gofer provides FDs (of the container filesystem) to the sandbox. The sandbox cannot walk backwards (using ‘..’) or follow a malicious symlink to escape out of the container filesystem. In effect, we've decreased our dependence on the syscall filters to catch bad behavior, but correspondingly increased our dependence on Linux's filesystem isolation protections.

Performance

Making RPCs to the gofer for every filesystem operation adds a lot of overhead to runsc. Hence, avoiding gofer round trips significantly improves performance. Let's find out what this means for some of our benchmarks. We will run the benchmarks using our newly released systrap platform on bind mounts (as opposed to rootfs). This would simulate more realistic use cases because bind mounts are extensively used while configuring filesystems in containers. Bind mounts also do not have an overlay (like the rootfs mount), so all operations go through goferfs / directfs mount.

Let's first look at our stat micro-benchmark, which repeatedly calls stat(2) on a file.

Stat benchmark improvement with directfs

The stat(2) syscall is more than 2x faster! However, since this is not representative of real-world applications, we should not extrapolate these results. So let's look at some real-world benchmarks.

We see a 12% reduction in the absolute time to run these workloads and 17% reduction in Ruby load time!

Conclusion

The gofer model in runsc was overly restrictive for accessing host files. We were able to leverage existing filesystem isolation mechanisms in Linux to bypass the gofer without compromising security. Directfs significantly improves performance for certain workloads. This is part of our ongoing efforts to improve gVisor performance. You can learn more about gVisor at gvisor.dev. You can also use gVisor in GKE with GKE Sandbox. Happy sandboxing!

¹Detached bind mounts can be created by first creating a bind mount using mount(MS_BIND) and then detaching it from the filesystem tree using umount(MNT_DETACH).

By Ayush Ranjan, Software Engineer – Google

Source: Google Open Source Blog

gVisor improves performance with root filesystem overlay

Overview

Container technology is an integral part of modern application ecosystems, making container security an increasingly important topic. Since containers are often used to run untrusted, potentially malicious code it is imperative to secure the host machine from the container.

A container's security depends on its security boundaries, such as user namespaces (which isolate security-related identifiers and attributes), seccomp rules (which restrict the syscalls available), and Linux Security Module configuration. Popular container management products like Docker and Kubernetes relax these and other security boundaries to increase usability, which means that users need additional container security tools to provide a much stronger isolation boundary between the container and the host.

The gVisor open source project, developed by Google, provides an OCI compatible container runtime called runsc. It is used in production at Google to run untrusted workloads securely. Runsc (run sandbox container) is compatible with Docker and Kubernetes and runs containers in a gVisor sandbox. gVisor sandbox has an application kernel, written in Golang, that implements a substantial portion of the Linux system call interface. All application syscalls are intercepted by the sandbox and handled in the user space kernel.

Although gVisor does not introduce large fixed overheads, sandboxing does add some performance overhead to certain workloads. gVisor has made several improvements recently that help containerized applications run faster inside the sandbox, including an improvement to the container root filesystem, which we will dive deeper into.

Costly Filesystem Access in gVisor

gVisor uses a trusted filesystem proxy process (“gofer”) to access the filesystem on behalf of the sandbox. The sandbox process is considered untrusted in gVisor’s security model. As a result, it is not given direct access to the container filesystem and its seccomp filters do not allow filesystem syscalls.

In gVisor, the container rootfs and bind mounts are configured to be served by a gofer.

When the container needs to perform a filesystem operation, it makes an RPC to the gofer which makes host system calls and services the RPC. This is quite expensive due to:

RPC cost: This is the cost of communicating with the gofer process, including process scheduling, message serialization and IPC system calls.

To ameliorate this, gVisor recently developed a purpose-built protocol called LISAFS which is much more efficient than its predecessor.
gVisor is also experimenting with giving the sandbox direct access to the container filesystem in a secure manner. This would essentially nullify RPC costs as it avoids the gofer being in the critical path of filesystem operations.

Syscall cost: This is the cost of making the host syscall which actually accesses/modifies the container filesystem. Syscalls are expensive, because they perform context switches into the kernel and back into userspace.

To help with this, gVisor heavily caches the filesystem tree in memory. So operations like stat(2) on cached files are serviced quickly. But other operations like mkdir(2) or rename(2) still need to make host syscalls.

Container Root Filesystem

In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on the filesystem packaged with the image. The image’s filesystem is immutable. Any change a container makes to the rootfs is stored separately and is destroyed with the container. This way, the image’s filesystem can be shared efficiently with all containers running the same image. This is different from bind mounts, which allow containers to access the bound host filesystem tree. Changes to bind mounts are always propagated to the host and persist after the container exits.

Docker and Kubernetes both use the overlay filesystem by default to configure container rootfs. Overlayfs mounts are composed of one upper layer and multiple lower layers. The overlay filesystem presents a merged view of all these filesystem layers at its mount location and ensures that lower layers are read-only while all changes are held in the upper layer. The lower layer(s) constitute the “image layer” and the upper layer is the “container layer”. When the container is destroyed, the upper layer mount is destroyed as well, discarding the root filesystem changes the container may have made. Docker’s overlayfs driver documentation has a good explanation.

Rootfs Configuration Before

Let’s consider an example where the image has files foo and baz. The container overwrites foo and creates a new file bar. The diagram below shows how the root filesystem used to be configured in gVisor earlier. We used to go through the gofer and access/mutate the overlaid directory on the host. It also shows the state of the host overlay filesystem.

Opportunity! Sandbox Internal Overlay

Given that the upper layer is destroyed with the container and that it is expensive to access/mutate a host filesystem from the sandbox, why keep the upper layer on the host at all? Instead we can move the upper layer into the sandbox.

The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can use a tmpfs upper (container) layer and a read-only lower layer served by the gofer client. Any changes to rootfs would be held in tmpfs (in-memory). Accessing/mutating the upper layer would not require any gofer RPCs or syscalls to the host. This really speeds up filesystem operations on the upper layer, which contains newly created or copied-up files and directories.

Using the same example as above, the following diagram shows what the rootfs configuration would look like using a sandbox-internal overlay.

Rootfs configuration in gVisor with internal overlay

Host-Backed Overlay

The tmpfs mount by default will use the sandbox process’s memory to back all the file data in the mount. This can cause sandbox memory usage to blow up and exhaust the container’s memory limits, so it’s important to store all file data from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on the host filesystem. Using the example from above, this filestore on the host will store file data for foo and bar.

This would essentially flatten all regular files in tmpfs into one host file. The sandbox can mmap(2) the filestore into its address space. This allows it to access and mutate the filestore very efficiently, without incurring gofer RPCs or syscalls overheads.

Self-Backed Overlay

In Kubernetes, you can set local ephemeral storage limits. The upper layer of the rootfs overlay (writeable container layer) on the host contributes towards this limit. The kubelet enforces this limit by traversing the entire upper layer, stat(2)-ing all files and summing up their stat.st_blocks*block_size. If we move the upper layer into the sandbox, then the host upper layer is empty and the kubelet will not be able to enforce these limits.

To address this issue, we introduced “self-backed” overlays, which create the filestore in the host upper layer. This way, when the kubelet scans the host upper layer, the filestore will be detected and its stat.st_blocks should be representative of the total file usage in the sandbox-internal upper layer. It is also important to hide this filestore from the containerized application to avoid confusing it. We do so by creating a whiteout in the sandbox-internal upper layer, which blocks this file from appearing in the merged directory.

The following diagram shows what rootfs configuration would finally look like today in gVisor.

Rootfs configuration in gVisor with self-backed internal overlay

Performance Gains

Let’s look at some filesystem-intensive workloads to see how rootfs overlay impacts performance. These benchmarks were run on a gLinux desktop with KVM platform.

Micro Benchmark

Linux Test Project provides a fsstress binary. This program performs a large number of filesystem operations concurrently, creating and modifying a large filesystem tree of all sorts of files. We ran this program on the container's root filesystem. The exact usage was:

sh -c "mkdir /test && time fsstress -d /test -n 500 -p 20 -s 1680153482 -X -l 10"

You can use the -v flag (verbose mode) to see what filesystem operations are being performed.

The results were astounding! Rootfs overlay reduced the time to run this fsstress program from 262.79 seconds to 3.18 seconds! However, note that such microbenchmarks are not representative of real-world applications and we should not extrapolate these results to real-world performance.

Real-world Benchmark

Build jobs are very filesystem intensive workloads. They read a lot of source files, compile and write out binaries and object files. Let’s consider building the abseil-cpp project with bazel. Bazel performs a lot of filesystem operations in rootfs; in bazel’s cache located at ~/.cache/bazel/.

This is representative of the real-world because many other applications also use the container root filesystem as scratch space due to the handy property that it disappears on container exit. To make this more realistic, the abseil-cpp repo was attached to the container using a bind mount, which does not have an overlay.

When measuring performance, we care about reducing the sandboxing overhead and bringing gVisor performance as close as possible to unsandboxed performance. Sandboxing overhead can be calculated using the formula overhead = (s-n)/n where ‘s’ is the amount of time taken to run a workload inside gVisor sandbox and ‘n’ is the time taken to run the same workload natively (unsandboxed). The following graph shows that rootfs overlay halved the sandboxing overhead for abseil build!

The impact of rootfs overlay on sandboxing overhead for abseil build

Conclusion

Rootfs overlay in gVisor substantially improves performance for many filesystem-intensive workloads, so that developers no longer have to make large tradeoffs between performance and security. We recently made this optimization the default in runsc. This is part of our ongoing efforts to improve gVisor performance. You can learn more about gVisor at gvisor.dev. You can also use gVisor in GKE with GKE Sandbox. Happy sandboxing!

Source: Google Open Source Blog

Explore the new Learn Kubernetes with Google website!

As Kubernetes has become a mainstream global technology, with 96% of organizations surveyed by the CNCF¹ using or evaluating Kubernetes for production use, it is now estimated that 31%² of backend developers worldwide are Kubernetes developers. To add to the growing popularity, the 2021 annual report¹ also listed close to 60 enhancements by special interest and working groups to the Kubernetes project. With so much information in the ecosystem, how can Kubernetes developers stay on top of the latest developments and learn what to prioritize to best support their infrastructure?

The new website Learn Kubernetes with Google brings together under one roof the guidance of Kubernetes experts—both from Google and across the industry—to communicate the latest trends in building your Kubernetes infrastructure. You can access knowledge in two formats.

One option is to participate in scheduled live events, which consist of virtual panels that allow you to ask questions to experts via a Q&A forum. Virtual panels last for an hour, and happen once quarterly. So far, we’ve hosted panels on building a multi-cluster infrastructure, the Dockershim deprecation, bringing High Performance Computing (HPC) to Kuberntes, and securing your services with Istio on Kubernetes. The other option is to pick one of the multiple on-demand series available. Series are made up of several 5-10 minute episodes and you can go through them at your own leisure. They cover different topics, including the Kubernetes Gateway API, the MCS API, Batch workloads, and Getting started with Kubernetes. You can use the search bar on the top right side of the website to look up specific topics.

As the cloud native ecosystem becomes increasingly complex, this website will continue to offer evergreen content for Kubernetes developers and users. We recently launched a new content category for ecosystem projects, which started by covering how to run Istio on Kubernetes. Soon, we will also launch a content category for developer tools, starting with Minikube.

Join hundreds of developers that are already part of the Learn Kubernetes with Google community! Bookmark the website, sign up for an event today, and be sure to check back regularly for new content.

By María Cruz, Program Manager – Google Open Source Programs Office

CNCF Annual Survey 2021 ↩
The State of Cloud Native Development report by Slash Data, as cited on CNCF Annual Survey 2021 ↩

Source: Google Open Source Blog

Google Cloud & Kotlin GDE Kevin Davin helps others learn in the face of challenges

Posted by Kevin Hernandez, , Developer Relations Community Manager

Kevin Davin speaking at the SnowCamp Conference in 2019

Kevin Davin has always had a passion for learning and helping others learn, no matter their background or unique challenges they may face. He explains, “I want to learn something new every day, I want to help others learn, and I’m addicted to learning.” This mantra is evident in everything he does from giving talks at numerous conferences to helping people from underrepresented groups overcome imposter syndrome and even helping them become GDEs. In addition to learning, Kevin is also passionate about diversity and inclusion efforts, partly inspired by navigating the world with partial blindness.

Kevin has been a professional programmer for 10 years now and has been in the field of Computer Science for about 20 years. Through the years, he has emphasized the importance of learning how and where to learn. For example, while he learned a lot while he was studying at a university, he was able to learn just as much through his colleagues. In fact, it was through his colleagues that he picked up lessons in teamwork and the ability to learn from people with different points of view and experience. Since he was able to learn so much from those around him, Kevin also wanted to pay it forward and started volunteering at a school for people with disabilities. Guided by the Departmental Centers for People with Disabilities, the aim of the program is to teach coding languages and reintegrate students into a technical profession. During his time at this center, Kevin helped students practice what they learned and ultimately successfully transition into a new career.

During these experiences, Kevin was always involved in the developer community through open-source projects. It was through these projects that he learned about the GDE program and was connected to Google Developer advocates. Kevin was drawn to the GDE program because he wanted to share his knowledge with others and have direct access to Google in order to become an advocate on behalf of developers. In 2016, he discovered Kubernetes and helped his company at the time move to Google Cloud. He always felt like this model was the right solution and invested a lot of time to learn it and practice it. “Google Cloud is made for developers. It’s like a Lego set because you can take the parts you want and put it together,” he remarked.

The GDE program has given him access to the things he values most: being a part of a developer community, being an advocate for developers, helping people from all backgrounds feel included, and above all, an opportunity to learn something new every day. Kevin’s parting advice for hopeful GDEs is: “Even if you can’t reach the goal of being a GDE now, you can always get accepted in the future. Don’t be afraid to fail because without failure, you won’t learn anything.” With his involvement in the program, Kevin hopes to continue connecting with the developer community and learning while supporting diversity efforts.

Learn more about Kevin on Twitter & LinkedIn.

The Google Developer Experts (GDE) program is a global network of highly experienced technology experts, influencers, and thought leaders who actively support developers, companies, and tech communities by speaking at events and publishing content.

Source: Google Developers Blog

Gateway API Graduates to Beta

For many years, Kubernetes users have wanted more advanced routing capabilities to be configurable in a Kubernetes API. With Google leadership, Gateway API has been developed to dramatically increase the number of features available. This API enables many new features in Kubernetes, including traffic splitting, header modification, and forwarding traffic to backends in different namespaces, just to name a few.

Since the project was originally proposed, Googlers have helped lead the open source efforts. Two of the top contributors to the project are from Google, while more than 10 engineers from Google have contributed to the API.

This week, the API has graduated from alpha to beta. This marks a significant milestone in the API and reflects new-found stability. There are now over a dozen implementations of the API and many are passing a comprehensive set of conformance tests. This ensures that users will have a consistent experience when using this API, regardless of environment or underlying implementation.

A Simple Example

To highlight some of the new features this API enables, it may help to walk through an example. We’ll start with a Gateway:

apiVersion: gateway.networking.k8s.io/v1beta1

kind: Gateway

metadata:

spec:

gatewayClassName: gke-l7-gxlb

listeners:

- name: http

protocol: HTTP

port: 80

This Gateway uses the gke-l7-gxlb Gateway Class, which means a new external load balancer will be provisioned to serve this Gateway. Of course, we still need to tell the load balancer where to send traffic. We’ll use an HTTPRoute for this:

apiVersion: gateway.networking.k8s.io/v1beta1

kind: HTTPRoute

metadata:

spec:

parentRefs:

- name: store-xlb

rules:

- matches:

- path:

type: PathPrefix

value: /

backendRefs:

- name: store-svc

port: 3080

weight: 9

- name: store-canary-svc

port: 3080

weight: 1

This simple HTTPRoute tells the load balancer to route traffic to one of the “store-svc” or “store-canary-svc” on port 3080. We’re using weights to do some basic traffic splitting here. That means that approximately 10% of requests will be routed to our canary Service.

Now, imagine that you want to provide a way for users to opt in or out of the canary service. To do that, we’ll add an additional HTTPRoute with some header matching configuration:

apiVersion: gateway.networking.k8s.io/v1beta1

kind: HTTPRoute

metadata:

spec:

parentRefs:

- name: store-xlb

rules:

- matches:

- header:

value: stable

backendRefs:

- name: store-svc

port: 3080

- matches:

- header:

value: canary

backendRefs:

- name: store-canary-svc

port: 3080

This HTTPRoute works in conjunction with the first route we created. If any requests set the env header to “stable” or “canary” they’ll be routed directly to their preferred backend.

Getting Started

Unlike previous Kubernetes APIs, you don’t need to have the latest version of Kubernetes to use this API. Instead, this API is built with Custom Resource Definitions (CRDs) that can be installed in any Kubernetes cluster, as long as it is version 1.16 or newer (released almost 3 years ago).

To try this API on GKE, refer to the GKE specific installation instructions. Alternatively, if you’d like to use this API with another implementation, refer to the OSS getting started page.

What’s next for Gateway API?

As the core capabilities of Gateway API are stabilizing, new features and concepts are actively being explored. Ideas such as Route Delegation and a new GRPCRoute are deep in the design process. A new service mesh workstream has been established specifically to build consensus among mesh implementations for how this API can be used for service-to-service traffic. As with many open source projects, we’re trying to find the right balance between enabling new use cases and achieving API stability. This API has already accomplished a lot, but we’re most excited about what’s ahead.

By Rob Scott – GKE Networking