Author Archives: Open Source Programs Office

Bounds Checking Flexible Array Members

Buffer overflows are the cause of many security issues, and are a persistent thorn in programmers' sides. C is particularly susceptible to them. The advent of sanitizers mitigates some security issues by automatically inserting bounds checking, but they're not able to do so in all situations—in particular for flexible array members, because their size is known only at runtime.

The size of a flexible array member is typically opaque to the compiler. The alloc_size attribute on malloc() may be used for bounds checking flexible array members within the same function as the allocation. But the attribute's information isn't carried with the allocated object, making it impossible to perform bounds checking elsewhere.

To mitigate this drawback, Clang and GCC are introducing1 the counted_by attribute for flexible array members.


Specifying a flexible array member's element count

The number of elements allocated for a flexible array member is frequently stored in another field within the same structure. When applied to the flexible array member, the counted_by attribute is used by the sanitizer—enabled by -fsanitize=array-bounds—by explicitly referencing the field that stores the number of elements. The attribute creates an implicit relationship between the flexible array member and the count field enabling the array bounds sanitizer to verify flexible array operations.

There are some rules to follow when using this feature. For this structure:

struct foo {
	/* ... */
	size_t count; /* Number of elements in array */
	int array[] __attribute__((counted_by(count)));
};
  • The count field must be within the same non-anonymous, enclosing struct as the flexible array member.
  • The count field must be set before any array access.
  • The array field must have at least count number of elements available at all times.
  • The count field may change, but must never be larger than the number of elements originally allocated.

An example allocation of the above structure:

struct foo *foo_alloc(size_t count) {
  struct foo *ptr = NULL;
  size_t size = MAX(sizeof(struct foo),
                    offsetof(struct foo, array[0]) +
                        count * sizeof(p->array[0]));

  ptr = calloc(1, size);
  ptr->count = count;
  return ptr;
}

Uses for fortification

Fortification (enabled by the _FORTIFY_SOURCE macro) is an ongoing project to make the Linux kernel more secure. Its main focus is preventing buffer overflows on memory and string operations.

Fortification uses the __builtin_object_size() and __builtin_dynamic_object_size() builtins to try to determine if input passed into a function is valid (i.e. "safe"). A call to __builtin_dynamic_object_size() generally isn't able to take the size of a flexible array member into account. But with the counted_by attribute, we're able to calculate the size and improve safety.


Uses in the Linux kernel

The counted_by attribute is already in use in the Linux kernel, and will be instrumental in catching issues like integer overflows, which led to a heap buffer overflow. We want to expand its use to more flexible array members, and enforce its use in the future.


Conclusion

The counted_by attribute helps address a long-standing fortification road block where the memory bounds of a flexible array member couldn't be determined by the compiler, thus making Linux, and other hardened applications, less exploitable.

1In Clang v18.0.0 and GCC v15.0.0.

By Bill Wendling, Staff Software Engineer

Google Open Source Peer Bonus program recognizes first group of 2024 recipients

We are excited to announce the first group of winners for the 2024 Google Open Source Peer Bonus Program! This program is designed to recognize open source contributors who have been nominated by Google employees for their exceptional contributions to open source projects.

It is worth reiterating the importance of open source to solving real-world problems for everyone. Open collaboration via open source allows for increased innovation and sustainability and provides infrastructure that countless projects and companies rely on. It is also important to note that oftentimes, open source progress made by the work of contributors and maintainers is overlooked. To the recipients of this award: whether this work is a part of your salaried role or a free time passion project, we hope that this small token of appreciation for your commitment spurs you to continue this work.

This cycle’s Open Source Peer Bonus Program received 159 nominations and the 130 winners reside in 25 different countries with work spanning 78 projects. As always, this reflects the program’s global reach and just a fraction of the enormous impact of open source software.

We would like to extend our congratulations to the winners! Included below are those who have agreed to be named publicly.

Winner

Open Source Project

Brandon Roberts

Analogjs

Matthieu Riegler

Angular

Thomas Laforge

Angular Challenges OSS project

Sergio Sastre

AOSP and Android-screenshot-testing-playground

Dongjoon Hyun

Apache Spark

Ellie Huxtable

Atuin

Fabian Meumertzheim

Bazel

Mark Friedman

Blockly

Kevin O'Reilly

CAPEv2

Ed Page

cargo

Daniel Olabemiwo

chakraUI - Vue

Sylvain Hellegouarch

chaostoolkit

Himanshu Singh

Charty

Daniel Kolesa

Chimera Linux

Helmut Januschka

Chromium

Botond Ballo

clangd

Best Ibitoye-Rotimi

ClassroomIO.com

Vamsi Avula

CLI Mate

Dinko Osrecki

Cloud SQL Node.js Connector

Kevin Petit

clvk

Max Heller

Comprehensive Rust

Derek McGowan

containerd

Toru Komatsu

containerd/runwasi

Pierre Rousselin

Coq

Pierre Roux

Coq

Chris O'Haver

CoreDNS

Daniel Stenberg

curl

Pokey Rule

Cursorless

Martin Prpic

cvelint

Jerry Gamblin

cvelint-action

Simon Binder

Dart

Niraj Kumar

Data Ingestion Framework (DIF)

David Vallejo

Data Layer Helper

Hajime Hoshi

Ebitengine

Dario Nieuwenhuis

Embassy

Hans de Goede

Fedora IPU6 camera support

Kate Hsuan

Fedora IPU6 camera support

Lukas Klingsbo

Flame

Alex Li

Flutter

Bartek Pacia

Flutter

Dario Klepoch

Flutter TensorFlow Lite Plugin

Richard Rausch

flutter/flutter - clean up memory leaks

Kostiantyn Sokolovskyi

flutter/flutter - clean up memory leaks

Jaydip Gabani

Gatekeeper

Qiusheng Wu

geemap

Sam James

Gentoo Hardened

Sean Liao

Go

Quim Muntal

Go

Rhys Hiltner

Go

Nick Ripley

Go

Felix Geisendörfer

Go

Cuong Manh Le

Go

Dominik Honnef

Go

Tomislav Turek

Go Linux nftables package

Mauri de Souza Meneguzzo

golang/go

KJ Tsanaktsidis

Google Protobuf

Viktor Blomqvist

Gopls, the Go language server

Antoine Tollenaere

gRPC

Jay Lee

gRPC-rs

Ridwan Hoq

GUAC

Naveen Srinivasan

GUAC

Marco Rizzi

GUAC

Jonathan Behrens

image-rs/image-png

Kilian Schulte

Jaspr

Kevin Hannon

JobSet

Yuya Nishihara

Jujutsu VCS (jj)

Diego Calleja

kernelnewbies.org

Kai-Hsun Chen

kuberay

Filip Křepinský

Kubernetes

David Eads

Kubernetes

Marko Mudrinić

Kubernetes

Yuki Iwai

Kueue

Dmytro Liubarskyi

LangChain4j

Iryna Liubarska

LangChain4j

Volodymyr Agafonkin

Leaflet

Yuan Tong

libavif

Laurent Pinchart

libcamera

Eli Schwartz

liblc3

Jason Gunthorpe

Linux Kernel

Christian Brauner

Linux Kernel

Jan Kara

Linux Kernel

Kuniyuki Iwashima

Linux Kernel

Sebastien Anceau

linuxloops and brunch

Yifan Zhu

llvm-libc

Anton Korobeynikov

llvm/llvm-project

Pierre-Yves David

Mercurial

Marcus Ilgner

MockK

Palak Sharma

moja-global/ui-library

Pony Cui

MPFlutter

Jim Crist-Harif

msgspec

Amadeusz Winogrodzki

next-firebase-auth-edge

Amarachi Johnson-Ubah

Open Terms Archive

Mehdi Saligane

OpenFASOC

Franco Fichtner

OPNsense core

Justin Nguyen

Oppia

Remko Popma

PicoCLI

Andres Freund

postgres

Caitlin Hudon

R-Ladies Austin, Data Mishaps Night (data + ML community online organizing)

Thulio Ferraz Assis

rules_python

Ignas Anikevicius

rules_python

Fridtjof Stoldt

Rust / Clippy

Caleb Zulawski

rust-lang/portable-simd

Dmitry Dygalo

Schemathesis

Pratim Bhosale

SurrealDB

Geoff Rich

SvelteKit

Lonami (unknown)

Telethon

Mayank Raunak

TensorFlow

Svetlana Novikova

The Good Docs Project

Saul Pwanson

VisiData

Etienne Dilocker

Weaviate

Mitsunari Shigeo

xbyak

Tomáš Janoušek

XMonad

Lasse Collin

xz

Zeeshan Ali Khan

zbus


We are incredibly proud of all of the nominees for their outstanding contributions to open source, and we look forward to seeing even more amazing contributions in the remainder of 2024 and years to come!

By Mike Bufano, Google Open Source Peer Bonus Program Lead

Common Expressions For Portable Policy and Beyond


I am thrilled to introduce Common Expression Language, a simple expression language that's great for writing small snippets of logic which can be run anywhere with the same semantics in nanoseconds to microseconds evaluation speed. The launch of cel.dev marks a major milestone in the growth and stability of the language.

It powers several well known products such as Kubernetes where it's used to protect against costly production misconfigurations:

    object.spec.replicas <= 5

Cloud IAM uses CEL to enable fine-grained authorization:

    request.path.startsWith("/finance")

Versatility

CEL is both open source and openly governed, making it well adopted both inside and outside Google. CEL is used by a number of large tech companies, either internally or as part of their public product offering. As a reflection of this open governance, cel.dev has been launched to share more about the language and the community around it.

So, what makes CEL a good choice for these applications? Why is CEL unique or different from Amazon's Cedar Policy or Open Policy Agent's Rego? These are great questions, and common ones (no pun intended):

  • Highly optimized evaluation O(ns) - O(μs)
  • Portable with stacks supported in C++, Java, and Go
  • Thousands of conformance tests ensure consistent behavior across stacks
  • Supports extension and subsetting

Subsetting is crucial for preserving predictable compute / memory impacts, and it only exists in CEL. As any latency-critical service maintainer will tell you, it's vital to have a clear understanding of compute and memory implications of any new feature. Imagine you've chosen an expression language, validated its functionality meets your security, serving, and scaling goals, but after launching an update to the library introduces new functionality which can't be disabled and leaves your product vulnerable to attack. Your alternatives are to fork the library and accept the maintenance costs, introduce custom validation logic which is likely to be insufficient to prevent abuse, or to redesign your service. CEL supported subsetting allows you to ensure that what was true at the initial product launch will remain true until you decide when to expose more of its functionality to customers.

Cedar Policy language was developed by Amazon. It is open source, implemented in Rust, and offers formal verification. Formal verification allows Cedar to validate policy correctness at authoring time. CEL is not just a policy language, but a broader expression language. It is used within larger policy projects in a way that allows users to augment their existing systems rather than adopt an entirely new one.

Formal verification often has challenges scaling to large amounts of data and is based on the belief that a formal proof of behavior is sufficient evidence of compliance. However, regulations and policies in natural language are often ambiguous. This means that logical translations of these policies are inherently ambiguous as well. CEL's approach to this challenge of compliance and reasoning about potentially ambiguous behaviors is to support evaluation over partial data at a very high speed and with minimal resources.

CEL is fast enough to be used in networking policies and expressive enough to be used in detailed application policies. Having a greater coverage of use cases improves your ability to reason about behavior holistically. But, what if you don't have all the data yet to know exactly how the policy will behave? CEL supports partial evaluation using four-valued logic which makes it possible to determine cases which are definitely allowed, denied, or where policy behavior is conditional on additional inputs. This allows for what-if analysis against historical data as well as against speculative data from new use cases or proposed policy changes.

Open Policy Agent's Rego is also open source, implemented in Golang and based on Datalog, which makes it possible to offer similar proofs as Cedar. However, the language is much more powerful than Cedar, and more powerful than CEL. This expressiveness means that OPA Rego is fantastic for non-latency critical, single-tenant solutions, but difficult to safely embed in existing offerings.


Four-valued Logic

CEL uses commutative logical operators that can render a true, false, error, or unknown status. This is a scalable alternative to formal verification and the expressiveness of Datalog. Four-valued logic allows CEL to evaluate over a partial set of inputs to deliver either a definitive result or communicate that more information is necessary to complete the evaluation.

What is four-valued logic?

True, false, and error outcomes are considered definitive: no additional information will change the outcome of the expression. Consider the following case:

    1/0 != 0 && false

In traditional programming languages, this expression would be an error; however, in CEL the outcome is false.

Now consider the following case where an input variable, called unknown_var is marked as unknown:

    unknown_var && true

The outcome of this expression is UNKNOWN{unknown_var} indicating that once the variable is known, the evaluation can be safely completed. An unknown indicates what was missing, and alerts the user to fix the outcome with more information. This technique both borrows from and extends SQL three-valued predicate logic which uses TRUE, FALSE, and NULL with commutative logical operators. From a CEL perspective, the error state is akin to SQL NULL that arises when there is an absence of information.


CEL compatibility with SQL

CEL leverages SQL semantics to ensure that it can be seamlessly translated to SQL. SQL optimizers perform significantly better over large data sets, making it possible to evaluate over data at rest. Imagine trying to scale a single expression evaluation over tens of millions of records. Even if the expression evaluates within a single microsecond, the evaluation would still take tens of seconds. The more complex the expression, the greater the latency. SQL excels at this use case, so translation from CEL to SQL was an important factor in the design in order to unlock the possibility of performant policy checks both online with CEL and offline with SQL.


Thank you CEL Community!

We’re proud to announce cel.dev as a major milestone in the maturity and adoption of the language, and we look forward to working with you to make CEL the best building block for writing latency-critical, portable logic. Feel free to contact us at [email protected]

By Tristan Swadell – Senior Staff Software Engineer

Driving etcd Stability and Kubernetes Success


Introduction: The Critical Role of etcd in Cloud-Native Infrastructure

Imagine a cloud-native world without Kubernetes. It's hard, right? But have you ever considered the unsung hero that makes Kubernetes tick? Enter etcd, the distributed key-value store that serves as the central nervous system for Kubernetes. Etcd's ability to consistently store and replicate critical cluster state data is essential for maintaining the health and harmony of distributed systems.


etcd: The Backbone of Kubernetes

Think of Kubernetes as a magnificent vertebrate animal, capable of complex movements and adaptations. In this analogy, etcd is the animal's backbone – a strong, flexible structure that supports the entire system. Just as a backbone protects the spinal cord (which carries vital information), etcd safeguards the critical data that defines the Kubernetes environment. And just as a backbone connects to every other part of the body, etcd facilitates communication and coordination between all the components of Kubernetes, allowing it to move, adapt, and thrive in the dynamic world of distributed systems.

ALT TEXT
Credit: Original image xkcd.com/2347, alterations by Josh Berkus.

Google's Deep-Rooted Commitment to Open Source

Google has a long history of contributing to open source projects, and our commitment to etcd is no exception. As the initiator of Kubernetes, Google understands the critical role that etcd plays in its success. Google engineers consistently invest in etcd to enhance its functionality and reliability, driven by their extensive use of etcd for their own internal systems.


Google's Collaborative Impact on etcd Reliability

Google engineers have actively contributed to the stability and resilience of etcd, working alongside the wider community to address challenges and improve the project. Here are some key areas where their impact has been felt:

Post-Release Support: Following the release of etcd v3.5.0, Google engineers quickly identified and addressed several critical issues, demonstrating their commitment to maintaining a stable and production-ready etcd for Kubernetes and other systems.

Data Consistency: Early Detection and Swift Action: Google engineers led efforts to identify and resolve data inconsistency issues in etcd, advocating for public awareness and mitigation strategies. Drawing from their Site Reliability Engineering (SRE) expertise, they fostered a culture of "blameless postmortems" within the etcd community—a practice where the focus is on learning from incidents rather than assigning blame. Their detailed postmortem of the v3.5 data inconsistency issue and a co-presented KubeCon talk served to share these valuable lessons with the broader cloud-native community.

Refocusing on Stability and Testing: The v3.5 incident highlighted the need for more comprehensive testing and documentation. Google engineers took action on multiple fronts:

  • Improving Documentation: They contributed to creating "The Implicit Kubernetes-ETCD Contract," which formalizes the interactions between the two systems, guiding development and troubleshooting.
  • Prioritizing Stability and Testing: They developed the "etcd Robustness Tests," a rigorous framework simulating extreme scenarios to proactively identify inconsistency and correctness issues.

These contributions have fostered a collaborative environment where the entire community can learn from incidents and work together to improve etcd's stability and resilience. The etcd Robustness Tests have been particularly impactful, not only reproducing all the data inconsistencies found in v3.5 but also uncovering other bugs introduced in that version. Furthermore, they've found previously unnoticed bugs that existed in earlier etcd versions, some dating back to the original v3 implementation. These results demonstrate the effectiveness of the robustness tests and highlight how they've made etcd the most reliable it has ever been in the history of the project.


etcd Robustness Tests: Making etcd the Most Reliable It's Ever Been

The "etcd Robustness Tests," inspired by the Jepsen methodology, subject etcd to rigorous simulations of network partitions, node failures, and other real-world disruptions. This ensures etcd's data consistency and correctness even under extreme conditions. These tests have proven remarkably effective, identifying and addressing a variety of issues:

For deeper insights into ensuring etcd's data consistency, Marek Siarkowicz's talk, "On the Hunt for Etcd Data Inconsistencies," offers valuable information about distributed systems testing and the innovative approaches used to build these tests. To foster transparency and collaboration, the etcd community holds bi-weekly meetings to discuss test results, open to engineers, researchers, and other interested parties.


The Kubernetes-etcd Contract: A Partnership Forged in Rigorous Testing

To solidify the Kubernetes-etcd partnership, Google engineers formally defined the implicit contract between the two systems. This shared understanding guided development and troubleshooting, leading to improved testing strategies and ensuring etcd meets Kubernetes' demanding requirements.

When subtle issues were discovered in how Kubernetes utilized etcd watch, the value of this formal contract became clear. These issues could lead to missed events under specific conditions, potentially impacting Kubernetes' operation. In response, Google engineers are actively working to integrate the contract directly into the etcd Robustness Tests to proactively identify and prevent such compatibility issues.


Conclusion: Google's Continued Commitment to etcd and the Cloud-Native Community

Google's ongoing investment in etcd underscores their commitment to the stability and success of the cloud-native ecosystem. Their contributions, along with the wider community's efforts, have made etcd a more reliable and performant foundation for Kubernetes and other critical systems. As the ecosystem evolves, etcd remains a critical linchpin, empowering organizations to build and deploy distributed applications with confidence. We encourage all etcd and Kubernetes contributors to continue their active participation and contribute to the project's ongoing success.

By Marek Siarkowicz – GKE etcd

The Kubernetes ecosystem is a candy store


For the 10th anniversary of Kubernetes, I wanted to look at the ecosystem we created together.

I recently wrote about the pervasiveness and magnitude of the Kubernetes and CNCF ecosystem. This was the result of a deliberate flywheel. This is a diagram I used several years ago:

Flywheel diagram of Kubernetes and CNCF ecosystem

Because Kubernetes runs on public clouds, private clouds, on the edge, etc., it is attractive to developers and vendors to build solutions targeting its users. Most tools built for Kubernetes or integrated with Kubernetes can work across all those environments, whereas integrating directly with cloud providers directly entails individual work for each one. Thus, Kubernetes created a large addressable market with a comparatively lower cost to build.

We also deliberately encouraged open source contribution, to Kubernetes and to other projects. Many tools in the ecosystem, not just those in CNCF, are open source. This includes many tools built by Kubernetes users and tools built by vendors but were too small to be products, as well as those intended to be the cores of products. Developers built and/or wrote about solutions to problems they experienced or saw, and shared them with the community. This made Kubernetes more usable and more visible, which likely attracted more users.

Today, the result is that if you need a tool, extension, or off-the-shelf component for pretty much anything, you can probably find one compatible with Kubernetes rather than having to build it yourself, and it’s more likely that you can find one that works out of the box with Kubernetes than for your cloud provider. And often there are several options to choose from. I’ll just mention a few. Also, I want to give a shout out to Kubetools, which has a great list of Kubernetes tools that helped me discover a few new ones.

For example, if you’re an application developer whose application runs on Kubernetes, you can build and deploy with Skaffold, test it on Kubernetes locally with Minikube, or connect to Kubernetes remotely with Telepresence, or sync to a preview environment with Gitpod or Okteto. When you need to debug multiple instances, you can use kubetail to view the logs in real time.

To deploy to production, you can use GitOps tools like FluxCD, ArgoCD, or Google Cloud’s Config Sync. You can perform database migrations with Schemahero. To aggregate logs from your production deployments, you can use fluentbit. To monitor them, you have your pick of observability tools, including Prometheus, which was inspired by Google’s Borgmon tool similar to how Kubernetes was inspired by Borg, and which was the 2nd project accepted into the CNCF.

If your application needs to receive traffic from the Internet, you can use one of the many Ingress controllers or Gateway implementations to configure HTTPS routing, and cert-manager to obtain and renew the certificates. For mutual TLS and advanced routing, you can use a service mesh like Istio, and take advantage of it for progressive delivery using tools like Flagger.

If you have a more specialized type of workload to run, you can run event-driven workloads using Knative, batch workloads using Kueue, ML workflows using Kubeflow, and Kafka using Strimzi.

If you’re responsible for operating Kubernetes workloads, to monitor costs, there’s kubecost. To enforce policy constraints, there’s OPA Gatekeeper and Kyverno. For disaster recovery, you can use Velero. To debug permissions issues, there are RBAC tools. And, of course, there are AI-powered assistants.

You can manage infrastructure using Kubernetes, such as using Config Connector or Crossplane, so you don’t need to learn a different syntax and toolchain to do that.

There are tools with a retro experience like K9s and Ktop, fun tools like xlskubectl, and tools that are both retro and fun like Kubeinvaders.

If this makes you interested in migrating to Kubernetes, you can use a tool like move2kube or kompose.

This just scratched the surface of the great tools available for Kubernetes. I view the ecosystem as more of a candy store than as a hellscape. It can take time to discover, learn, and test these tools, but overall I believe they make the Kubernetes ecosystem more productive. To develop any one of these tools yourself would require a significant time investment.

I expect new tools to continue to emerge as the use cases for Kubernetes evolve and expand. I can’t wait to see what people come up with.

By Brian Grant, Distinguished Engineer, Google Cloud Developer Experience

Anomaly detection with few labeled samples under distribution mismatch


SPADE: Semi-Supervised Anomaly Detection under Distribution Mismatch


What is SPADE?

Recently, we have open-sourced SPADE (Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling), a semi-supervised framework for anomaly detection that overcomes some of the drawbacks of alternative anomaly detection methods.


What Problem does SPADE Solve?

Anomaly detection is the process of identifying samples in a dataset that diverge from some expected pattern. This process has wide applications in several industries such as API security, financial fraud and manufacturing defect detection. SPADE is especially designed for semi-supervised settings where we have a handful of labeled data and a large number of unlabeled data.


When is SPADE better for your Use Case?

Creating a large labeled set of anomalous and non-anomalous samples for supervised learning can be time-consuming, expensive and error-prone. So unsupervised and semi-supervised methods have become an active area of research.

Most of these semi-supervised methods make the assumption that the labeled and unlabeled data come from the same distribution, that is, they are generated by the same underlying process—physical, financial, manufacturing or other process. This assumption is often violated in different ways—the labeled data could contain one type of anomaly while the unlabeled data contains other types of anomalies; or the labeled data could only contain samples that were easy to label. In these and potentially other cases, SPADE has been shown to have better performance than alternatives.


How does it Work?

SPADE constructs an ensemble of One-Class Classifiers (OCCs); each OCC is a Gaussian Mixture Model trained in a self-supervised manner on a disjoint subset of the unlabeled samples and non-anomalous samples.

moving image of the process of SPADE training an ensemble of OCC, providing pseudo-labels, and then using both labeled and pseudo-labeled sampled to train a supervised model for anomaly detection
Figure 1. SPADE first trains an ensemble of OCC to provide pseudo-labels to the unlabeled samples. Then, both labeled and pseudo-labeled samples are used to train a supervised model for the anomaly detection.

The ensemble is used to obtain pseudo-labels for the unlabeled data. A pseudo-label of is-anomalous or not-anomalous is assigned only if all the members of the ensemble agree. The pseudo-labels and any original labels are used together to train a supervised anomaly detector model. In the version of SPADE that we are open-sourcing, this model is a Tensorflow Random Forest that is trained with a binary cross-entropy loss. Once trained on the labels and pseudo-labels, the detector model can be used for online or batch prediction.


Example Use Cases

The above described benefits of SPADE are highlighted in our experiments as detailed in the published paper (in TMLR with feature certification). Here we present some results on a selection of datasets that demonstrate SPADE performance when (a) there are new types of anomalies in the unlabeled dataset, (b) when the labeled anomalies are easy to label, and (c) when the dataset contains only positively labeled and unlabeled samples.

Graph showing SPADE performance compared against other supervised, semi-supervised and unsupervised methods.
Figure 2. SPADE performance compared against other supervised, semi-supervised and unsupervised methods. Details about the datasets and the methods can be found in our paper.

As shown in Figure 2, SPADE consistently outperforms alternative methods. The CoverType and Thyroid datasets have Creative Commons Attribution 4.0 International (CC BY 4.0) licenses and are present in the SPADE repository.


How to use SPADE

We have just open-sourced SPADE. The repository contains scripts that build a Docker container and push the container, then run the container as a Vertex Custom Job on Google Cloud Platform. The dataset is read from BigQuery. Metrics such as AUC, Precision and Recall can currently be tracked in the job logs. The job launch script is configured with a default set of hyperparameters as described in the documentation. Users may need to adjust the hyperparameters to obtain optimal performance. The final trained anomaly detection model artifact is written to Google Cloud Storage (GCS). This artifact can be deployed as a Vertex Endpoint to serve predictions (not demonstrated in this repository).


Ways to Help

By open sourcing SPADE, we hope to foster more usage of this innovative anomaly detection method in the community, as well as invite contributions to improve the method. The SPADE model and code is freely available on Github under the Apache-2.0 license. SPADE is currently set up to run in a Docker container as a Vertex Custom Job on Google Cloud Platform. It can also be run by installing from PyPi using pip install spade-anomaly-detection. Users can upload their dataset to BigQuery, and run the training job on Vertex, or on a local machine from the PyPi installation.

More detailed usage instructions are available in the documentation.

By Raj Sinha and Jinsung Yoon, Cloud AI Research Team

Kubernetes 1.30 is now available in GKE in record time

Kubernetes 1.30 is now available in the Google Kubernetes Engine (GKE) Rapid Channel less than 20 days after the OSS release! For more information about the content of Kubernetes 1.30, read the Kubernetes 1.30 Release Notes and the specific GKE 1.30 Release Notes.


Control Plane Improvements

We're excited to announce that ValidatingAdmissionPolicy graduates to GA in 1.30. This is an exciting feature that enables many admission webhooks to be replaced with policies defined using the Common Expression Language (CEL) and evaluated directly in the kube-apiserver. This feature benefits both extension authors and cluster administrators by dramatically simplifying the development and operation of admission extensions. Many existing webhooks may be migrated to validating admission policies. For webhooks not ready or able to migrate, Match Conditions may be added to webhook configurations using CEL rules to pre-filter requests to reduce webhooks invocations.

Validation Ratcheting makes CustomResourceDefinitions even safer and easier to manage. Prior to Kubernetes 1.30, when updating a custom resource, validation was required to pass for all fields, even fields not changed by the update. Now, with this feature, only fields changed in the custom resource by an update request must pass validation. This limits validation failures on update to the changed portion of the object, and reduces the risk of controllers getting stuck when a CustomResourceDefinition schema is changed, either accidentally or as part of an effort to increase the strictness of validation.

Aggregated Discovery graduates to GA in 1.30, dramatically improving the performance of clients, particularly kubectl, when fetching the API information needed for many common operations. Aggregated discovery reduces the fetch to a single request and allows caches to be kept up-to-date by offering ETags that clients can use to efficiently poll the server for changes.


Data Plane Improvements

Dynamic Resource Allocation (DRA) is an alpha Kubernetes feature added in 1.26 that enables flexibility in configuring, selecting, and allocating specialized devices for pods. Feedback from SIG Scheduling and SIG Autoscaling revealed that the design needed revisions to reduce scheduling latency and fragility, and to support cluster autoscaling. In 1.30, the community introduced a new alpha design, DRA Structured Parameters, which takes the first step towards these goals. This is still an alpha feature with a lot of changes expected in upcoming releases. The newly formed WG Device Management has a charter to improve device support in Kubernetes - with a focus on GPUs and similar hardware - and DRA is a key component of that support. Expect further enhancements to the design in another alpha in 1.31. The working group has a goal of releasing some aspects to beta in 1.32.


Kubernetes continues the effort of eliminating perma-beta features: functionality that has long been used in production, but still wasn’t marked as generally available. With this release, AppArmor support got some attention and got closer to the final being marked as GA.

There are also quality of life improvements in Kubernetes Data Plane. Many of them will be only noticeable for system administrators and not particularly helpful for GKE users. This release, however, a notable Sleep Action KEP entered beta stage and is available on GKE. It will now be easier to use slim images while allowing graceful connections draining, specifically for some flavors of nginx images.

Acknowledgements

We want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. From the features mentioned in this blog, we would like to mention especially: Googlers Cici Huang, Joe Betz, Jiahui Feng, Alex Zielenski, Jeffrey Ying, John Belamaric, Tim Hockin, Aldo Culquicondor, Jordan Liggitt, Kuba Tużnik, Sergey Kanzhelev, and Tim Allclair.

Posted by Federico Bongiovanni – Google Kubernetes Engine

OpenXLA Dev Lab 2024: Building Groundbreaking ML Systems Together


AMD, Arm, AWS, Google, NVIDIA, Intel, Tesla, SambaNova, and more come together to crack the code for colossal AI workloads

As AI models grow increasingly complex and compute-intensive, the need for efficient, scalable, and hardware-agnostic infrastructure has never been greater. OpenXLA is a deep learning compiler framework that makes it easy to speed up and massively scale AI models on a wide range of hardware types—from GPUs and CPUs to specialized chips like Google TPUs and AWS Trainium. It is compatible with popular modeling frameworks—JAX, PyTorch, and TensorFlow—and delivers leading performance. OpenXLA is the acceleration infrastructure of choice for global-scale AI-powered products like Amazon.com Search, Google Gemini, Waymo self-driving vehicles, and x.AI's Grok.


The OpenXLA Dev Lab

On April 25th, the OpenXLA Dev Lab played host to over 100 expert ML practitioners from 10 countries, representing industry leaders like AMD, Arm, AWS, ByteDance, Cerebras, Cruise, Google, NVIDIA, Intel, Tesla, SambaNova, and more. The full-day event, tailored to AI hardware vendors and infrastructure engineers, broke the mold of previous OpenXLA Summits by focusing purely on “Lab Sessions”, akin to office hours for developers, and hands-on Tutorials. The energy of the event was palpable as developers worked side-by-side, learning and collaborating on both practical challenges and exciting possibilities for AI infrastructure.

World map showing where developers come from across countries to the OpenXLA Dev Lab
Figure 1: Developers from around the world congregated at the OpenXLA Dev Lab.

The Dev Lab was all about three key things:

  • Educate and Empower: Teach developers how to implement OpenXLA's essential workflows and advanced features through hands-on tutorials.
  • Offer Expert Guidance: Provide personalized office hours led by OpenXLA experts to help developers refine their ideas and contributions.
  • Foster Community: Encourage collaboration, knowledge-sharing, and lasting connections among the brilliant minds in the OpenXLA community.

Tutorials

The Tutorials included:

Integrating an AI Compiler & Runtime into PJRT

  • Learn how PJRT connects ML frameworks to AI accelerators, standardizing their interaction for easy model deployment on diverse hardware.
  • Explore the PJRT C API for framework-hardware communication.
  • Implement a PJRT Plugin, a Python package that implements the C API.
  • Discover plugin examples for Apple Metal, CUDA, Intel GPU, and TPU.

Led by Jieying Luo and Skye Wanderman-Milne


Extracting StableHLO Graphs + Intro to StableHLO Quantizer

  • Learn to export StableHLO from JAX, PyTorch, and TensorFlow using static/dynamic shapes and SavedModel format.
  • Hack along with the tutorial using the JAX, PyTorch, and TensorFlow Colab notebooks provided on OpenXLA.org.
  • Simplify quantization with StableHLO Quantizer; a framework and device-agnostic tool.
  • Explore streamlined parameter selection and model rewriting for lower precision.

Led by Kevin Gleason Jen Ha, and Xing Liu


Optimizing PyTorch/XLA Auto-sharding for Your Hardware

  • Discover this experimental feature that automates distributing large-scale PyTorch models across XLA devices.
  • Learn how it partitions and distributes for out-of-the-box performance without manual intervention
  • Explore future directions such as customizable cost models for different hardware

Led by Yeounoh Chung and Pratik Fegade


Optimizing Compute and Communication Scheduling with XLA

  • Scale ML models on multi-GPUs with SPMD partitioning, collective communication, HLO optimizations.
  • Explore tensor parallelism, latency hiding scheduler, pipeline parallelism.
  • Learn collective optimizations, pipeline parallelism for efficient large-scale training.

Led by Frederik Gossen, TJ Xu, and Abhinav Goel


Lab Sessions

Lab Sessions featured use case-specific office hours for AMD, Arm, AWS, ByteDance, Intel, NVIDIA, SambaNova, Tesla, and more. OpenXLA engineers were on hand to provide development teams with dedicated support and walkthrough specific pain points and designs. In addition, Informational Roundtables that covered broader topics like GPU ML Performance Optimization, JAX, and PyTorch-XLA GPU were available for those without specific use cases. This approach led to productive exchanges and fine-grained exploration of critical contribution areas for ML hardware vendors.

four photos of participants and vendors at OpenXLA Dev Lab

Don’t just take our word for it – here’s some of the feedback we received from developers:

"PJRT is awesome, we're looking forward to building with it. We are very grateful for the support we are getting." 
      — Mark Gottscho, Senior Manager and Technical Lead at SambaNova
"Today I learned a lot about Shardonnay and about some of the bugs I found in the GSPMD partitioner, and I got to learn a lot of cool stuff." 
      — Patrick Toulme, Machine Learning Engineer at AWS
“I learned a lot, a lot about how XLA is making tremendous progress in building their community.” 
      — Tejash Shah, Product Manager at NVIDIA
“Loved the format this year - please continue … lots of learning, lots of interactive sessions. It was great!” 
      — Om Thakkar, AI Software Engineer at Intel

Technical Innovations and The Bold Road Ahead

The event kicked off with a keynote by Robert Hundt, Distinguished Engineer at Google, who outlined OpenXLA's ambitious plans for 2024, particularly three major areas of focus:,/p>

  • Large-scale training
  • GPU and PyTorch compute performance
  • Modularity and extensibility

Empowering Large-Scale Training

OpenXLA is introducing powerful features to enable model training at record-breaking scales. One of the most notable additions is Shardonnay, a tool coming soon to OpenXLA that automates and optimizes how large AI workloads are divided across multiple processing units, ensuring efficient use of resources and faster time to solution. Building on the success of its predecessor, SPMD, Shardonnay empowers developers with even more fine-grained control over partitioning decisions, all while maintaining the productivity benefits that SPMD is known for.

Diagram of sharding representation with a simple rank 2 tensor and 4 devices.
Figure 2: Sharding representation example with a simple rank 2 tensor and 4 devices.

In addition to Shardonnay, developers can expect a suite of features designed to optimize computation and communication overlap, including:

  • Automatic profile-guided latency estimation
  • Collective pipelining
  • Heuristics-based collective combiners

These innovations will enable developers to push the boundaries of large-scale training and achieve unprecedented performance and efficiency.


OpenXLA Delivers on TorchBench Performance

OpenXLA has also made significant strides in enhancing performance, particularly on GPUs with key PyTorch-based generative AI models. PyTorch-XLA GPU is now neck and neck with TorchInductor for TorchBench Full Graph Models and has a TorchBench pass rate within 5% of TorchInductor.

A bar graph showing a performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs
Figure 3: Performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs. “Full graph models” represent all TorchBench models that can be fully represented by StableHLO

Behind these impressive gains lies XLA GPU's global cost model, a game-changer for developers. In essence, this cost model acts as a sophisticated decision-making system, intelligently determining how to best optimize computations for specific hardware. The cost model delivers state-of-the-art performance through a priority-based queue for fusion decisions and is highly extensible, allowing third-party developers to seamlessly integrate their backend infrastructure for both general-purpose and specialized accelerators. The cost model's adaptability ensures that computation optimizations are tailored to specific accelerator architectures, while less suitable computations can be offloaded to the host or other accelerators.

OpenXLA is also breaking new ground with novel kernel programming languages, Pallas and Mosaic, which empower developers to write highly optimized code for specialized hardware. Mosaic demonstrates remarkable efficiency in programming key AI accelerators, surpassing widely used libraries in GPU code generation efficiency for models with 64, 128, and 256 Q head sizes, as evidenced by its enhanced utilization of TensorCores.

A bar graph showing a performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs
Figure 4: Performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs.

Modular and Extensible AI Development

In addition to performance enhancements, OpenXLA is committed to making the entire stack more modular and extensible. Several initiatives planned for 2024 include:

  • Strengthening module interface contracts
  • Enhancing code sharing between platforms
  • Enabling a shared high-level compiler flow through runtime configuration and component registries

A flow diagram showing modules and subcomponents of the OpenXLA stack.
Figure 5: Modules and subcomponents of the OpenXLA stack.

These improvements will make it easier for developers to build upon and extend OpenXLA.

Alibaba's success with PyTorch XLA FSDP within their TorchAcc framework is a prime example of the benefits of OpenXLA's modularity and extensibility. By leveraging these features, Alibaba achieved state-of-the-art performance for the LLaMa 2 13B model, surpassing the previous benchmark set by Megatron. This demonstrates the power of the developer community in extending OpenXLA to push the boundaries of AI development.

A bar graph showing a performance comparison of TorchAcc and Megatron for  LLaMa 2 13B at different number of GPUs.
Figure 6: Performance comparison of TorchAcc and Megatron for LLaMa 2 13B at different numbers of GPUs.

Join the OpenXLA Community

If you missed the Dev Lab, don't worry! You can still access StableHLO walkthroughs on openxla.org, as well as the GitHub Gist for the PJRT session. Additionally, the recorded keynote and tutorials are available on our YouTube channel. Explore these resources and join our global community – whether you're an AI systems expert, model developer, student, or just starting out, there's a place for you in our innovative ecosystem.

four photos of participants and vendors at OpenXLA Dev Lab

Acknowledgements

Adam Paske, Allen Hutchison, Amin Vahdat, Andrew Leaver, Andy Davis, Artem Belevich, Abhinav Goel, Benjamin Kramer, Berkin Illbeyi, Bill Jia, Eugene Zhulenev, Florian Reichl, Frederik Gossen, George Karpenkov, Gunhyun Park, Han Qi, Jack Cao, Jaesung Chung, Jen Ha, Jianting Cao, Jieying Luo, Jiewin Tan, Jini Khetan, Kevin Gleason, Kyle Lucke, Kuy Mainwaring, Lauren Clemens, Manfei Bai, Marisa Miranda, Michael Levesque-Dion, Milad Mohammadi, Nisha Miriam Johnson, Penporn Koanantakool, Robert Hundt, Sandeep Dasgupta, Sayce Falk, Shauheen Zahirazami, Skye Wanderman-milne, Yeounoh Chung, Pratik Fegade, Peter Hawkins, Vaibhav Singh, Tamás Danyluk, Thomas Jeorg, Adam Paszke and TJ Xu.

By James Rubin, Aditi Joshi, and Elliot English on behalf of the OpenXLA Project

Google Summer of Code 2024 accepted contributors announced!


We are celebrating our 20th anniversary of Google Summer of Code (GSoC) and we are thrilled to announce the new 1,220 Contributors that will be writing code for 195 open source mentoring organizations starting May 27. Over the last few weeks, our mentoring organizations have read through applications, had discussions with applicants, and made the difficult decision of selecting the GSoC Contributors they will be mentoring this summer.

Highlighting significant results from this year’s application period:

    • 43,984 applicants from 172 countries
    • 9,107 proposals submitted by 6,518 applicants
    • 1,220 GSoC contributors accepted from 73 countries
    • Over 2,800 mentors and organization administrators
    • 34 mentoring organizations are participating in their 16th-20th GSoC!

Starting today, our GSoC 2024 Contributors will actively engage with their new open source community and become familiar with their organizations during the Community Bonding period. Mentors will guide the GSoC Contributors through documentation and introduce them to community norms and processes while helping plan their milestones and projects for the summer. Coding begins May 27th and while most folks will wrap up September 2nd, GSoC Contributors have the opportunity to request a longer coding period and wrap up their projects between early September and early November based on their schedules and availability.

We’d like to express our gratitude to the thousands of applicants who took the time and effort to reach out to our mentoring organizations and submit proposals this year. The experience of researching, asking questions and becoming more familiar with open source communities has hopefully helped you feel excited about open source and maybe you found a great community that you want to contribute to outside of Google Summer of Code! Communication is key to GSoC and open source, and staying connected with the community or reaching out to other organizations is an exceptional way to set the stage for future opportunities. Open source communities are always looking for new and eager collaborators to join their projects.

A huge thank you to all of our mentors and organization administrators who make this program so special and impactful for thousands of developers each year. Google Summer of Code continues because of the dedication of mentors to keep the open source ecosystem thriving by supporting new developers and their exciting perspectives and ideas. Google is honored to support the open source ecosystem (and 1,000+ open source projects and 20,000+ developers) over these past 20 years.

GSoC Contributors — have fun this summer, keep learning and enjoy becoming part of the open source community! Your mentors and community members have dozens, and in some cases hundreds, of years of combined experience so let them share their knowledge with you to help you become phenomenal open source contributors.

By Stephanie Taylor – Program Lead and Lucila Ortiz – Program Administrator

Get ready for Google I/O: Program lineup revealed


Developers, get ready! Google I/O is just around the corner, kicking off live from Mountain View with the Google keynote on Tuesday, May 14 at 10 am PT, followed by the Developer keynote at 1:30 pm PT.

But the learning doesn’t stop there. Mark your calendars for May 16 at 8 am PT when we’ll be releasing over 150 technical deep dives, demos, codelabs, and more on-demand. If you register online, you can start building your 'My I/O' agenda today.

Here's a sneak peek at some of the exciting highlights from the I/O program preview:

Unlocking the power of AI: The Gemini era unlocks a new frontier for developers. We'll showcase the newest features in the Gemini API, Google AI Studio, and Gemma. Discover cutting-edge pre-trained models from Kaggle, and delve into Google's open-source libraries like Keras and JAX.

Android: A developer's playground: Get the latest updates on everything Android! We'll cover groundbreaking advancements in generative AI, the highly anticipated Android 15, innovative form factors, and the latest tools and libraries in the Jetpack and Compose ecosystem. Plus, discover how to optimize performance and streamline your development workflow.

Building beautiful and functional web experiences: We’ll cover Baseline updates, a revolutionary tool that empowers developers with a clear understanding of web features and API interoperability. With Baseline, you'll have access to real-time information on popular developer resource sites like MDN, Can I Use, and web.dev.

The future of ChromeOS: Get a glimpse into the exciting future of ChromeOS. We'll discuss the developer-centric investments we're making in distribution, app capabilities, and operating system integrations. Discover how our partners are shaping the future of Chromebooks and delivering world-class user experiences.

This is just a taste of what's in store at Google I/O. Stay tuned for more updates, and get ready to be a part of the future.

Don't forget to mark your calendars and register for Google I/O today!

Posted by Timothy Jordan – Director, Developer Relations and Open Source