Author Archives: Open Source Programs Office

Common Expressions For Portable Policy and Beyond

I am thrilled to introduce Common Expression Language, a simple expression language that's great for writing small snippets of logic which can be run anywhere with the same semantics in nanoseconds to microseconds evaluation speed. The launch of cel.dev marks a major milestone in the growth and stability of the language.

It powers several well known products such as Kubernetes where it's used to protect against costly production misconfigurations:

object.spec.replicas <= 5

Cloud IAM uses CEL to enable fine-grained authorization:

request.path.startsWith("/finance")

Versatility

CEL is both open source and openly governed, making it well adopted both inside and outside Google. CEL is used by a number of large tech companies, either internally or as part of their public product offering. As a reflection of this open governance, cel.dev has been launched to share more about the language and the community around it.

So, what makes CEL a good choice for these applications? Why is CEL unique or different from Amazon's Cedar Policy or Open Policy Agent's Rego? These are great questions, and common ones (no pun intended):

Highly optimized evaluation O(ns) - O(μs)
Portable with stacks supported in C++, Java, and Go
Thousands of conformance tests ensure consistent behavior across stacks
Supports extension and subsetting

Subsetting is crucial for preserving predictable compute / memory impacts, and it only exists in CEL. As any latency-critical service maintainer will tell you, it's vital to have a clear understanding of compute and memory implications of any new feature. Imagine you've chosen an expression language, validated its functionality meets your security, serving, and scaling goals, but after launching an update to the library introduces new functionality which can't be disabled and leaves your product vulnerable to attack. Your alternatives are to fork the library and accept the maintenance costs, introduce custom validation logic which is likely to be insufficient to prevent abuse, or to redesign your service. CEL supported subsetting allows you to ensure that what was true at the initial product launch will remain true until you decide when to expose more of its functionality to customers.

Cedar Policy language was developed by Amazon. It is open source, implemented in Rust, and offers formal verification. Formal verification allows Cedar to validate policy correctness at authoring time. CEL is not just a policy language, but a broader expression language. It is used within larger policy projects in a way that allows users to augment their existing systems rather than adopt an entirely new one.

Formal verification often has challenges scaling to large amounts of data and is based on the belief that a formal proof of behavior is sufficient evidence of compliance. However, regulations and policies in natural language are often ambiguous. This means that logical translations of these policies are inherently ambiguous as well. CEL's approach to this challenge of compliance and reasoning about potentially ambiguous behaviors is to support evaluation over partial data at a very high speed and with minimal resources.

CEL is fast enough to be used in networking policies and expressive enough to be used in detailed application policies. Having a greater coverage of use cases improves your ability to reason about behavior holistically. But, what if you don't have all the data yet to know exactly how the policy will behave? CEL supports partial evaluation using four-valued logic which makes it possible to determine cases which are definitely allowed, denied, or where policy behavior is conditional on additional inputs. This allows for what-if analysis against historical data as well as against speculative data from new use cases or proposed policy changes.

Open Policy Agent's Rego is also open source, implemented in Golang and based on Datalog, which makes it possible to offer similar proofs as Cedar. However, the language is much more powerful than Cedar, and more powerful than CEL. This expressiveness means that OPA Rego is fantastic for non-latency critical, single-tenant solutions, but difficult to safely embed in existing offerings.

Four-valued Logic

CEL uses commutative logical operators that can render a true, false, error, or unknown status. This is a scalable alternative to formal verification and the expressiveness of Datalog. Four-valued logic allows CEL to evaluate over a partial set of inputs to deliver either a definitive result or communicate that more information is necessary to complete the evaluation.

What is four-valued logic?

True, false, and error outcomes are considered definitive: no additional information will change the outcome of the expression. Consider the following case:

1/0 != 0 && false

In traditional programming languages, this expression would be an error; however, in CEL the outcome is false.

Now consider the following case where an input variable, called unknown_var is marked as unknown:

unknown_var && true

The outcome of this expression is UNKNOWN{unknown_var} indicating that once the variable is known, the evaluation can be safely completed. An unknown indicates what was missing, and alerts the user to fix the outcome with more information. This technique both borrows from and extends SQL three-valued predicate logic which uses TRUE, FALSE, and NULL with commutative logical operators. From a CEL perspective, the error state is akin to SQL NULL that arises when there is an absence of information.

CEL compatibility with SQL

CEL leverages SQL semantics to ensure that it can be seamlessly translated to SQL. SQL optimizers perform significantly better over large data sets, making it possible to evaluate over data at rest. Imagine trying to scale a single expression evaluation over tens of millions of records. Even if the expression evaluates within a single microsecond, the evaluation would still take tens of seconds. The more complex the expression, the greater the latency. SQL excels at this use case, so translation from CEL to SQL was an important factor in the design in order to unlock the possibility of performant policy checks both online with CEL and offline with SQL.

Thank you CEL Community!

We’re proud to announce cel.dev as a major milestone in the maturity and adoption of the language, and we look forward to working with you to make CEL the best building block for writing latency-critical, portable logic. Feel free to contact us at [email protected]

By Tristan Swadell – Senior Staff Software Engineer

Source: Google Open Source Blog

Driving etcd Stability and Kubernetes Success

Introduction: The Critical Role of etcd in Cloud-Native Infrastructure

Imagine a cloud-native world without Kubernetes. It's hard, right? But have you ever considered the unsung hero that makes Kubernetes tick? Enter etcd, the distributed key-value store that serves as the central nervous system for Kubernetes. Etcd's ability to consistently store and replicate critical cluster state data is essential for maintaining the health and harmony of distributed systems.

etcd: The Backbone of Kubernetes

Think of Kubernetes as a magnificent vertebrate animal, capable of complex movements and adaptations. In this analogy, etcd is the animal's backbone – a strong, flexible structure that supports the entire system. Just as a backbone protects the spinal cord (which carries vital information), etcd safeguards the critical data that defines the Kubernetes environment. And just as a backbone connects to every other part of the body, etcd facilitates communication and coordination between all the components of Kubernetes, allowing it to move, adapt, and thrive in the dynamic world of distributed systems.

Credit: Original image xkcd.com/2347, alterations by Josh Berkus.

Google's Deep-Rooted Commitment to Open Source

Google has a long history of contributing to open source projects, and our commitment to etcd is no exception. As the initiator of Kubernetes, Google understands the critical role that etcd plays in its success. Google engineers consistently invest in etcd to enhance its functionality and reliability, driven by their extensive use of etcd for their own internal systems.

Google's Collaborative Impact on etcd Reliability

Google engineers have actively contributed to the stability and resilience of etcd, working alongside the wider community to address challenges and improve the project. Here are some key areas where their impact has been felt:

Post-Release Support: Following the release of etcd v3.5.0, Google engineers quickly identified and addressed several critical issues, demonstrating their commitment to maintaining a stable and production-ready etcd for Kubernetes and other systems.

Data Consistency: Early Detection and Swift Action: Google engineers led efforts to identify and resolve data inconsistency issues in etcd, advocating for public awareness and mitigation strategies. Drawing from their Site Reliability Engineering (SRE) expertise, they fostered a culture of "blameless postmortems" within the etcd community—a practice where the focus is on learning from incidents rather than assigning blame. Their detailed postmortem of the v3.5 data inconsistency issue and a co-presented KubeCon talk served to share these valuable lessons with the broader cloud-native community.

Refocusing on Stability and Testing: The v3.5 incident highlighted the need for more comprehensive testing and documentation. Google engineers took action on multiple fronts:

Improving Documentation: They contributed to creating "The Implicit Kubernetes-ETCD Contract," which formalizes the interactions between the two systems, guiding development and troubleshooting.

Automating Detection: They spearheaded the development of a more efficient data corruption detection mechanism for etcd v3.5, significantly reducing detection latency.

Prioritizing Stability and Testing: They developed the "etcd Robustness Tests," a rigorous framework simulating extreme scenarios to proactively identify inconsistency and correctness issues.

These contributions have fostered a collaborative environment where the entire community can learn from incidents and work together to improve etcd's stability and resilience. The etcd Robustness Tests have been particularly impactful, not only reproducing all the data inconsistencies found in v3.5 but also uncovering other bugs introduced in that version. Furthermore, they've found previously unnoticed bugs that existed in earlier etcd versions, some dating back to the original v3 implementation. These results demonstrate the effectiveness of the robustness tests and highlight how they've made etcd the most reliable it has ever been in the history of the project.

etcd Robustness Tests: Making etcd the Most Reliable It's Ever Been

The "etcd Robustness Tests," inspired by the Jepsen methodology, subject etcd to rigorous simulations of network partitions, node failures, and other real-world disruptions. This ensures etcd's data consistency and correctness even under extreme conditions. These tests have proven remarkably effective, identifying and addressing a variety of issues:

For deeper insights into ensuring etcd's data consistency, Marek Siarkowicz's talk, "On the Hunt for Etcd Data Inconsistencies," offers valuable information about distributed systems testing and the innovative approaches used to build these tests. To foster transparency and collaboration, the etcd community holds bi-weekly meetings to discuss test results, open to engineers, researchers, and other interested parties.

The Kubernetes-etcd Contract: A Partnership Forged in Rigorous Testing

To solidify the Kubernetes-etcd partnership, Google engineers formally defined the implicit contract between the two systems. This shared understanding guided development and troubleshooting, leading to improved testing strategies and ensuring etcd meets Kubernetes' demanding requirements.

When subtle issues were discovered in how Kubernetes utilized etcd watch, the value of this formal contract became clear. These issues could lead to missed events under specific conditions, potentially impacting Kubernetes' operation. In response, Google engineers are actively working to integrate the contract directly into the etcd Robustness Tests to proactively identify and prevent such compatibility issues.

Conclusion: Google's Continued Commitment to etcd and the Cloud-Native Community

Google's ongoing investment in etcd underscores their commitment to the stability and success of the cloud-native ecosystem. Their contributions, along with the wider community's efforts, have made etcd a more reliable and performant foundation for Kubernetes and other critical systems. As the ecosystem evolves, etcd remains a critical linchpin, empowering organizations to build and deploy distributed applications with confidence. We encourage all etcd and Kubernetes contributors to continue their active participation and contribute to the project's ongoing success.

By Marek Siarkowicz – GKE etcd

Source: Google Open Source Blog

The Kubernetes ecosystem is a candy store

For the 10th anniversary of Kubernetes, I wanted to look at the ecosystem we created together.

I recently wrote about the pervasiveness and magnitude of the Kubernetes and CNCF ecosystem. This was the result of a deliberate flywheel. This is a diagram I used several years ago:

Flywheel diagram of Kubernetes and CNCF ecosystem

Because Kubernetes runs on public clouds, private clouds, on the edge, etc., it is attractive to developers and vendors to build solutions targeting its users. Most tools built for Kubernetes or integrated with Kubernetes can work across all those environments, whereas integrating directly with cloud providers directly entails individual work for each one. Thus, Kubernetes created a large addressable market with a comparatively lower cost to build.

We also deliberately encouraged open source contribution, to Kubernetes and to other projects. Many tools in the ecosystem, not just those in CNCF, are open source. This includes many tools built by Kubernetes users and tools built by vendors but were too small to be products, as well as those intended to be the cores of products. Developers built and/or wrote about solutions to problems they experienced or saw, and shared them with the community. This made Kubernetes more usable and more visible, which likely attracted more users.

Today, the result is that if you need a tool, extension, or off-the-shelf component for pretty much anything, you can probably find one compatible with Kubernetes rather than having to build it yourself, and it’s more likely that you can find one that works out of the box with Kubernetes than for your cloud provider. And often there are several options to choose from. I’ll just mention a few. Also, I want to give a shout out to Kubetools, which has a great list of Kubernetes tools that helped me discover a few new ones.

For example, if you’re an application developer whose application runs on Kubernetes, you can build and deploy with Skaffold, test it on Kubernetes locally with Minikube, or connect to Kubernetes remotely with Telepresence, or sync to a preview environment with Gitpod or Okteto. When you need to debug multiple instances, you can use kubetail to view the logs in real time.

To deploy to production, you can use GitOps tools like FluxCD, ArgoCD, or Google Cloud’s Config Sync. You can perform database migrations with Schemahero. To aggregate logs from your production deployments, you can use fluentbit. To monitor them, you have your pick of observability tools, including Prometheus, which was inspired by Google’s Borgmon tool similar to how Kubernetes was inspired by Borg, and which was the 2nd project accepted into the CNCF.

If your application needs to receive traffic from the Internet, you can use one of the many Ingress controllers or Gateway implementations to configure HTTPS routing, and cert-manager to obtain and renew the certificates. For mutual TLS and advanced routing, you can use a service mesh like Istio, and take advantage of it for progressive delivery using tools like Flagger.

If you have a more specialized type of workload to run, you can run event-driven workloads using Knative, batch workloads using Kueue, ML workflows using Kubeflow, and Kafka using Strimzi.

If you’re responsible for operating Kubernetes workloads, to monitor costs, there’s kubecost. To enforce policy constraints, there’s OPA Gatekeeper and Kyverno. For disaster recovery, you can use Velero. To debug permissions issues, there are RBAC tools. And, of course, there are AI-powered assistants.

You can manage infrastructure using Kubernetes, such as using Config Connector or Crossplane, so you don’t need to learn a different syntax and toolchain to do that.

There are tools with a retro experience like K9s and Ktop, fun tools like xlskubectl, and tools that are both retro and fun like Kubeinvaders.

If this makes you interested in migrating to Kubernetes, you can use a tool like move2kube or kompose.

This just scratched the surface of the great tools available for Kubernetes. I view the ecosystem as more of a candy store than as a hellscape. It can take time to discover, learn, and test these tools, but overall I believe they make the Kubernetes ecosystem more productive. To develop any one of these tools yourself would require a significant time investment.

I expect new tools to continue to emerge as the use cases for Kubernetes evolve and expand. I can’t wait to see what people come up with.

By Brian Grant, Distinguished Engineer, Google Cloud Developer Experience

Source: Google Open Source Blog

Anomaly detection with few labeled samples under distribution mismatch

SPADE: Semi-Supervised Anomaly Detection under Distribution Mismatch

What is SPADE?

Recently, we have open-sourced SPADE (Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling), a semi-supervised framework for anomaly detection that overcomes some of the drawbacks of alternative anomaly detection methods.

What Problem does SPADE Solve?

Anomaly detection is the process of identifying samples in a dataset that diverge from some expected pattern. This process has wide applications in several industries such as API security, financial fraud and manufacturing defect detection. SPADE is especially designed for semi-supervised settings where we have a handful of labeled data and a large number of unlabeled data.

When is SPADE better for your Use Case?

Creating a large labeled set of anomalous and non-anomalous samples for supervised learning can be time-consuming, expensive and error-prone. So unsupervised and semi-supervised methods have become an active area of research.

Most of these semi-supervised methods make the assumption that the labeled and unlabeled data come from the same distribution, that is, they are generated by the same underlying process—physical, financial, manufacturing or other process. This assumption is often violated in different ways—the labeled data could contain one type of anomaly while the unlabeled data contains other types of anomalies; or the labeled data could only contain samples that were easy to label. In these and potentially other cases, SPADE has been shown to have better performance than alternatives.

How does it Work?

SPADE constructs an ensemble of One-Class Classifiers (OCCs); each OCC is a Gaussian Mixture Model trained in a self-supervised manner on a disjoint subset of the unlabeled samples and non-anomalous samples.

moving image of the process of SPADE training an ensemble of OCC, providing pseudo-labels, and then using both labeled and pseudo-labeled sampled to train a supervised model for anomaly detection

Figure 1. SPADE first trains an ensemble of OCC to provide pseudo-labels to the unlabeled samples. Then, both labeled and pseudo-labeled samples are used to train a supervised model for the anomaly detection.

The ensemble is used to obtain pseudo-labels for the unlabeled data. A pseudo-label of is-anomalous or not-anomalous is assigned only if all the members of the ensemble agree. The pseudo-labels and any original labels are used together to train a supervised anomaly detector model. In the version of SPADE that we are open-sourcing, this model is a Tensorflow Random Forest that is trained with a binary cross-entropy loss. Once trained on the labels and pseudo-labels, the detector model can be used for online or batch prediction.

Example Use Cases

The above described benefits of SPADE are highlighted in our experiments as detailed in the published paper (in TMLR with feature certification). Here we present some results on a selection of datasets that demonstrate SPADE performance when (a) there are new types of anomalies in the unlabeled dataset, (b) when the labeled anomalies are easy to label, and (c) when the dataset contains only positively labeled and unlabeled samples.

Graph showing SPADE performance compared against other supervised, semi-supervised and unsupervised methods.

Figure 2. SPADE performance compared against other supervised, semi-supervised and unsupervised methods. Details about the datasets and the methods can be found in our paper.

As shown in Figure 2, SPADE consistently outperforms alternative methods. The CoverType and Thyroid datasets have Creative Commons Attribution 4.0 International (CC BY 4.0) licenses and are present in the SPADE repository.

How to use SPADE

We have just open-sourced SPADE. The repository contains scripts that build a Docker container and push the container, then run the container as a Vertex Custom Job on Google Cloud Platform. The dataset is read from BigQuery. Metrics such as AUC, Precision and Recall can currently be tracked in the job logs. The job launch script is configured with a default set of hyperparameters as described in the documentation. Users may need to adjust the hyperparameters to obtain optimal performance. The final trained anomaly detection model artifact is written to Google Cloud Storage (GCS). This artifact can be deployed as a Vertex Endpoint to serve predictions (not demonstrated in this repository).

Ways to Help

By open sourcing SPADE, we hope to foster more usage of this innovative anomaly detection method in the community, as well as invite contributions to improve the method. The SPADE model and code is freely available on Github under the Apache-2.0 license. SPADE is currently set up to run in a Docker container as a Vertex Custom Job on Google Cloud Platform. It can also be run by installing from PyPi using pip install spade-anomaly-detection. Users can upload their dataset to BigQuery, and run the training job on Vertex, or on a local machine from the PyPi installation.

More detailed usage instructions are available in the documentation.

By Raj Sinha and Jinsung Yoon, Cloud AI Research Team

Source: Google Open Source Blog

Kubernetes 1.30 is now available in GKE in record time

Kubernetes 1.30 is now available in the Google Kubernetes Engine (GKE) Rapid Channel less than 20 days after the OSS release! For more information about the content of Kubernetes 1.30, read the Kubernetes 1.30 Release Notes and the specific GKE 1.30 Release Notes.

Control Plane Improvements

We're excited to announce that ValidatingAdmissionPolicy graduates to GA in 1.30. This is an exciting feature that enables many admission webhooks to be replaced with policies defined using the Common Expression Language (CEL) and evaluated directly in the kube-apiserver. This feature benefits both extension authors and cluster administrators by dramatically simplifying the development and operation of admission extensions. Many existing webhooks may be migrated to validating admission policies. For webhooks not ready or able to migrate, Match Conditions may be added to webhook configurations using CEL rules to pre-filter requests to reduce webhooks invocations.

Validation Ratcheting makes CustomResourceDefinitions even safer and easier to manage. Prior to Kubernetes 1.30, when updating a custom resource, validation was required to pass for all fields, even fields not changed by the update. Now, with this feature, only fields changed in the custom resource by an update request must pass validation. This limits validation failures on update to the changed portion of the object, and reduces the risk of controllers getting stuck when a CustomResourceDefinition schema is changed, either accidentally or as part of an effort to increase the strictness of validation.

Aggregated Discovery graduates to GA in 1.30, dramatically improving the performance of clients, particularly kubectl, when fetching the API information needed for many common operations. Aggregated discovery reduces the fetch to a single request and allows caches to be kept up-to-date by offering ETags that clients can use to efficiently poll the server for changes.

Data Plane Improvements

Dynamic Resource Allocation (DRA) is an alpha Kubernetes feature added in 1.26 that enables flexibility in configuring, selecting, and allocating specialized devices for pods. Feedback from SIG Scheduling and SIG Autoscaling revealed that the design needed revisions to reduce scheduling latency and fragility, and to support cluster autoscaling. In 1.30, the community introduced a new alpha design, DRA Structured Parameters, which takes the first step towards these goals. This is still an alpha feature with a lot of changes expected in upcoming releases. The newly formed WG Device Management has a charter to improve device support in Kubernetes - with a focus on GPUs and similar hardware - and DRA is a key component of that support. Expect further enhancements to the design in another alpha in 1.31. The working group has a goal of releasing some aspects to beta in 1.32.

Kubernetes continues the effort of eliminating perma-beta features: functionality that has long been used in production, but still wasn’t marked as generally available. With this release, AppArmor support got some attention and got closer to the final being marked as GA.

There are also quality of life improvements in Kubernetes Data Plane. Many of them will be only noticeable for system administrators and not particularly helpful for GKE users. This release, however, a notable Sleep Action KEP entered beta stage and is available on GKE. It will now be easier to use slim images while allowing graceful connections draining, specifically for some flavors of nginx images.

Acknowledgements

We want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. From the features mentioned in this blog, we would like to mention especially: Googlers Cici Huang, Joe Betz, Jiahui Feng, Alex Zielenski, Jeffrey Ying, John Belamaric, Tim Hockin, Aldo Culquicondor, Jordan Liggitt, Kuba Tużnik, Sergey Kanzhelev, and Tim Allclair.

Posted by Federico Bongiovanni – Google Kubernetes Engine

Source: Google Open Source Blog

OpenXLA Dev Lab 2024: Building Groundbreaking ML Systems Together

AMD, Arm, AWS, Google, NVIDIA, Intel, Tesla, SambaNova, and more come together to crack the code for colossal AI workloads

As AI models grow increasingly complex and compute-intensive, the need for efficient, scalable, and hardware-agnostic infrastructure has never been greater. OpenXLA is a deep learning compiler framework that makes it easy to speed up and massively scale AI models on a wide range of hardware types—from GPUs and CPUs to specialized chips like Google TPUs and AWS Trainium. It is compatible with popular modeling frameworks—JAX, PyTorch, and TensorFlow—and delivers leading performance. OpenXLA is the acceleration infrastructure of choice for global-scale AI-powered products like Amazon.com Search, Google Gemini, Waymo self-driving vehicles, and x.AI's Grok.

The OpenXLA Dev Lab

On April 25th, the OpenXLA Dev Lab played host to over 100 expert ML practitioners from 10 countries, representing industry leaders like AMD, Arm, AWS, ByteDance, Cerebras, Cruise, Google, NVIDIA, Intel, Tesla, SambaNova, and more. The full-day event, tailored to AI hardware vendors and infrastructure engineers, broke the mold of previous OpenXLA Summits by focusing purely on “Lab Sessions”, akin to office hours for developers, and hands-on Tutorials. The energy of the event was palpable as developers worked side-by-side, learning and collaborating on both practical challenges and exciting possibilities for AI infrastructure.

Figure 1: Developers from around the world congregated at the OpenXLA Dev Lab.

The Dev Lab was all about three key things:

Educate and Empower: Teach developers how to implement OpenXLA's essential workflows and advanced features through hands-on tutorials.
Offer Expert Guidance: Provide personalized office hours led by OpenXLA experts to help developers refine their ideas and contributions.
Foster Community: Encourage collaboration, knowledge-sharing, and lasting connections among the brilliant minds in the OpenXLA community.

Tutorials

The Tutorials included:

Integrating an AI Compiler & Runtime into PJRT

Learn how PJRT connects ML frameworks to AI accelerators, standardizing their interaction for easy model deployment on diverse hardware.
Explore the PJRT C API for framework-hardware communication.
Implement a PJRT Plugin, a Python package that implements the C API.
Discover plugin examples for Apple Metal, CUDA, Intel GPU, and TPU.

Led by Jieying Luo and Skye Wanderman-Milne

Extracting StableHLO Graphs + Intro to StableHLO Quantizer

Learn to export StableHLO from JAX, PyTorch, and TensorFlow using static/dynamic shapes and SavedModel format.
Hack along with the tutorial using the JAX, PyTorch, and TensorFlow Colab notebooks provided on OpenXLA.org.
Simplify quantization with StableHLO Quantizer; a framework and device-agnostic tool.
Explore streamlined parameter selection and model rewriting for lower precision.

Led by Kevin Gleason Jen Ha, and Xing Liu

Optimizing PyTorch/XLA Auto-sharding for Your Hardware

Discover this experimental feature that automates distributing large-scale PyTorch models across XLA devices.
Learn how it partitions and distributes for out-of-the-box performance without manual intervention
Explore future directions such as customizable cost models for different hardware

Led by Yeounoh Chung and Pratik Fegade

Optimizing Compute and Communication Scheduling with XLA

Scale ML models on multi-GPUs with SPMD partitioning, collective communication, HLO optimizations.
Explore tensor parallelism, latency hiding scheduler, pipeline parallelism.
Learn collective optimizations, pipeline parallelism for efficient large-scale training.

Led by Frederik Gossen, TJ Xu, and Abhinav Goel

Lab Sessions

Lab Sessions featured use case-specific office hours for AMD, Arm, AWS, ByteDance, Intel, NVIDIA, SambaNova, Tesla, and more. OpenXLA engineers were on hand to provide development teams with dedicated support and walkthrough specific pain points and designs. In addition, Informational Roundtables that covered broader topics like GPU ML Performance Optimization, JAX, and PyTorch-XLA GPU were available for those without specific use cases. This approach led to productive exchanges and fine-grained exploration of critical contribution areas for ML hardware vendors.

four photos of participants and vendors at OpenXLA Dev Lab

Don’t just take our word for it – here’s some of the feedback we received from developers:

"PJRT is awesome, we're looking forward to building with it. We are very grateful for the support we are getting."

— Mark Gottscho, Senior Manager and Technical Lead at SambaNova

"Today I learned a lot about Shardonnay and about some of the bugs I found in the GSPMD partitioner, and I got to learn a lot of cool stuff."

— Patrick Toulme, Machine Learning Engineer at AWS

“I learned a lot, a lot about how XLA is making tremendous progress in building their community.”

— Tejash Shah, Product Manager at NVIDIA

“Loved the format this year - please continue … lots of learning, lots of interactive sessions. It was great!”

— Om Thakkar, AI Software Engineer at Intel

Technical Innovations and The Bold Road Ahead

The event kicked off with a keynote by Robert Hundt, Distinguished Engineer at Google, who outlined OpenXLA's ambitious plans for 2024, particularly three major areas of focus:,/p>

Large-scale training
GPU and PyTorch compute performance
Modularity and extensibility

Empowering Large-Scale Training

OpenXLA is introducing powerful features to enable model training at record-breaking scales. One of the most notable additions is Shardonnay, a tool coming soon to OpenXLA that automates and optimizes how large AI workloads are divided across multiple processing units, ensuring efficient use of resources and faster time to solution. Building on the success of its predecessor, SPMD, Shardonnay empowers developers with even more fine-grained control over partitioning decisions, all while maintaining the productivity benefits that SPMD is known for.

Diagram of sharding representation with a simple rank 2 tensor and 4 devices.

Figure 2: Sharding representation example with a simple rank 2 tensor and 4 devices.

In addition to Shardonnay, developers can expect a suite of features designed to optimize computation and communication overlap, including:

Automatic profile-guided latency estimation
Collective pipelining
Heuristics-based collective combiners

These innovations will enable developers to push the boundaries of large-scale training and achieve unprecedented performance and efficiency.

OpenXLA Delivers on TorchBench Performance

OpenXLA has also made significant strides in enhancing performance, particularly on GPUs with key PyTorch-based generative AI models. PyTorch-XLA GPU is now neck and neck with TorchInductor for TorchBench Full Graph Models and has a TorchBench pass rate within 5% of TorchInductor.

A bar graph showing a performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs

Figure 3: Performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs. “Full graph models” represent all TorchBench models that can be fully represented by StableHLO

Behind these impressive gains lies XLA GPU's global cost model, a game-changer for developers. In essence, this cost model acts as a sophisticated decision-making system, intelligently determining how to best optimize computations for specific hardware. The cost model delivers state-of-the-art performance through a priority-based queue for fusion decisions and is highly extensible, allowing third-party developers to seamlessly integrate their backend infrastructure for both general-purpose and specialized accelerators. The cost model's adaptability ensures that computation optimizations are tailored to specific accelerator architectures, while less suitable computations can be offloaded to the host or other accelerators.

OpenXLA is also breaking new ground with novel kernel programming languages, Pallas and Mosaic, which empower developers to write highly optimized code for specialized hardware. Mosaic demonstrates remarkable efficiency in programming key AI accelerators, surpassing widely used libraries in GPU code generation efficiency for models with 64, 128, and 256 Q head sizes, as evidenced by its enhanced utilization of TensorCores.

A bar graph showing a performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs

Figure 4: Performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs.

Modular and Extensible AI Development

In addition to performance enhancements, OpenXLA is committed to making the entire stack more modular and extensible. Several initiatives planned for 2024 include:

Strengthening module interface contracts
Enhancing code sharing between platforms
Enabling a shared high-level compiler flow through runtime configuration and component registries

A flow diagram showing modules and subcomponents of the OpenXLA stack.

Figure 5: Modules and subcomponents of the OpenXLA stack.

These improvements will make it easier for developers to build upon and extend OpenXLA.

Alibaba's success with PyTorch XLA FSDP within their TorchAcc framework is a prime example of the benefits of OpenXLA's modularity and extensibility. By leveraging these features, Alibaba achieved state-of-the-art performance for the LLaMa 2 13B model, surpassing the previous benchmark set by Megatron. This demonstrates the power of the developer community in extending OpenXLA to push the boundaries of AI development.

A bar graph showing a performance comparison of TorchAcc and Megatron for LLaMa 2 13B at different number of GPUs.

Figure 6: Performance comparison of TorchAcc and Megatron for LLaMa 2 13B at different numbers of GPUs.

Join the OpenXLA Community

If you missed the Dev Lab, don't worry! You can still access StableHLO walkthroughs on openxla.org, as well as the GitHub Gist for the PJRT session. Additionally, the recorded keynote and tutorials are available on our YouTube channel. Explore these resources and join our global community – whether you're an AI systems expert, model developer, student, or just starting out, there's a place for you in our innovative ecosystem.

Acknowledgements

Adam Paske, Allen Hutchison, Amin Vahdat, Andrew Leaver, Andy Davis, Artem Belevich, Abhinav Goel, Benjamin Kramer, Berkin Illbeyi, Bill Jia, Eugene Zhulenev, Florian Reichl, Frederik Gossen, George Karpenkov, Gunhyun Park, Han Qi, Jack Cao, Jaesung Chung, Jen Ha, Jianting Cao, Jieying Luo, Jiewin Tan, Jini Khetan, Kevin Gleason, Kyle Lucke, Kuy Mainwaring, Lauren Clemens, Manfei Bai, Marisa Miranda, Michael Levesque-Dion, Milad Mohammadi, Nisha Miriam Johnson, Penporn Koanantakool, Robert Hundt, Sandeep Dasgupta, Sayce Falk, Shauheen Zahirazami, Skye Wanderman-milne, Yeounoh Chung, Pratik Fegade, Peter Hawkins, Vaibhav Singh, Tamás Danyluk, Thomas Jeorg, Adam Paszke and TJ Xu.

By James Rubin, Aditi Joshi, and Elliot English on behalf of the OpenXLA Project

Source: Google Open Source Blog

Google Summer of Code 2024 accepted contributors announced!

We are celebrating our 20th anniversary of Google Summer of Code (GSoC) and we are thrilled to announce the new 1,220 Contributors that will be writing code for 195 open source mentoring organizations starting May 27. Over the last few weeks, our mentoring organizations have read through applications, had discussions with applicants, and made the difficult decision of selecting the GSoC Contributors they will be mentoring this summer.

Highlighting significant results from this year’s application period:

43,984 applicants from 172 countries
9,107 proposals submitted by 6,518 applicants
1,220 GSoC contributors accepted from 73 countries
Over 2,800 mentors and organization administrators
34 mentoring organizations are participating in their 16th-20th GSoC!

Starting today, our GSoC 2024 Contributors will actively engage with their new open source community and become familiar with their organizations during the Community Bonding period. Mentors will guide the GSoC Contributors through documentation and introduce them to community norms and processes while helping plan their milestones and projects for the summer. Coding begins May 27th and while most folks will wrap up September 2nd, GSoC Contributors have the opportunity to request a longer coding period and wrap up their projects between early September and early November based on their schedules and availability.

We’d like to express our gratitude to the thousands of applicants who took the time and effort to reach out to our mentoring organizations and submit proposals this year. The experience of researching, asking questions and becoming more familiar with open source communities has hopefully helped you feel excited about open source and maybe you found a great community that you want to contribute to outside of Google Summer of Code! Communication is key to GSoC and open source, and staying connected with the community or reaching out to other organizations is an exceptional way to set the stage for future opportunities. Open source communities are always looking for new and eager collaborators to join their projects.

A huge thank you to all of our mentors and organization administrators who make this program so special and impactful for thousands of developers each year. Google Summer of Code continues because of the dedication of mentors to keep the open source ecosystem thriving by supporting new developers and their exciting perspectives and ideas. Google is honored to support the open source ecosystem (and 1,000+ open source projects and 20,000+ developers) over these past 20 years.

GSoC Contributors — have fun this summer, keep learning and enjoy becoming part of the open source community! Your mentors and community members have dozens, and in some cases hundreds, of years of combined experience so let them share their knowledge with you to help you become phenomenal open source contributors.

By Stephanie Taylor – Program Lead and Lucila Ortiz – Program Administrator

Source: Google Open Source Blog

Get ready for Google I/O: Program lineup revealed

Developers, get ready! Google I/O is just around the corner, kicking off live from Mountain View with the Google keynote on Tuesday, May 14 at 10 am PT, followed by the Developer keynote at 1:30 pm PT.

But the learning doesn’t stop there. Mark your calendars for May 16 at 8 am PT when we’ll be releasing over 150 technical deep dives, demos, codelabs, and more on-demand. If you register online, you can start building your 'My I/O' agenda today.

Here's a sneak peek at some of the exciting highlights from the I/O program preview:

Unlocking the power of AI: The Gemini era unlocks a new frontier for developers. We'll showcase the newest features in the Gemini API, Google AI Studio, and Gemma. Discover cutting-edge pre-trained models from Kaggle, and delve into Google's open-source libraries like Keras and JAX.

Android: A developer's playground: Get the latest updates on everything Android! We'll cover groundbreaking advancements in generative AI, the highly anticipated Android 15, innovative form factors, and the latest tools and libraries in the Jetpack and Compose ecosystem. Plus, discover how to optimize performance and streamline your development workflow.

Building beautiful and functional web experiences: We’ll cover Baseline updates, a revolutionary tool that empowers developers with a clear understanding of web features and API interoperability. With Baseline, you'll have access to real-time information on popular developer resource sites like MDN, Can I Use, and web.dev.

The future of ChromeOS: Get a glimpse into the exciting future of ChromeOS. We'll discuss the developer-centric investments we're making in distribution, app capabilities, and operating system integrations. Discover how our partners are shaping the future of Chromebooks and delivering world-class user experiences.

This is just a taste of what's in store at Google I/O. Stay tuned for more updates, and get ready to be a part of the future.

Don't forget to mark your calendars and register for Google I/O today!

Posted by Timothy Jordan – Director, Developer Relations and Open Source

Source: Google Open Source Blog

The Power of Open Source

At the day 1 keynote of Open Source Summit North America, Timothy Jordan, Director of Developer Relations and Open Source at Google, will talk about the landscape of open source and AI, the importance of a responsible approach, and the transformative impact of community collaboration. In anticipation of this talk, let’s break down the AI open source ecosystem, and how Google approaches it.

Google believes in the power of open technology to drive innovation and benefit everyone. It fosters creativity and collaboration, while ensuring technology access for developers and allowing customization to fit unique use cases. Open source licenses give developers full creative autonomy without restriction. It is this ecosystem of open source and open technology, shaped by ML frameworks like TensorFlow, Keras, and JAX, that has enabled so many incredible advances in AI in recent years.

The open source community has been in discussion on how to apply the Open Source Definition to carry forward the open principles of the OSD while addressing concepts like derived work and author attribution in AI. During Timothy’s keynote, he’ll speak to his own philosophy on Open Source and AI, and share how his assumptions about how we apply open source to AI have evolved. The immediate availability of AI models, powered by the open source ecosystem of ML frameworks, means it’s more important than ever that we establish a shared definition for open source and AI.

While that definition is in development, at Google we’re using precise language to describe our openly available models like Gemma. The definition and license is only one part of this open ML/AI future; advancements in safety tooling, policies, and developer knowledge are all part of creating a responsible and open future for AI. Those advancements are all fueled by a dedication to collaboration. Whether sharing innovations and improvements with the community, or having conversations with policymakers and open source leaders, collaboration is key to a responsible approach to AI in the open ecosystem. AI can only be safe and responsible if everyone’s experiences and perspectives are brought to the forefront as it’s built.

To demonstrate how open source has made AI readily available, Timothy will also take the audience through a “low code” demo of how to run large language models in-browser for web applications. Using MediaPipe, the LLM Inference API, and Gemma, users can quickly add genAI capabilities like document summarization and text generation.

Join us at Open Source Summit North America for this keynote, and visit opensource.google to learn more.

By the Google Open Source team

Source: Google Open Source Blog

Google Season of Docs announces participating organizations for 2024

Google Season of Docs provides support for open source projects to improve their documentation and gives professional technical writers an opportunity to gain experience in open source. Together we improve developer experience through better documentation, increase our understanding of best practices in open source documentation, and raise the profile of technical writers in open source.

For 2024, Google Season of Docs is pleased to announce that eleven organizations will be participating in the program! The list of participating organizations can be viewed on the website.

The project development phase now begins. Organizations and the technical writers they hire will work on their documentation projects from now until November 22nd. For organizations still looking to hire a technical writer, the hiring deadline is May 22nd.

How do I take part in Google Season of Docs as a technical writer?

Start by reading the technical writer guide and FAQs which give information about eligibility and choosing a project. Next, technical writers interested in working with accepted open source organizations can share their contact information via the Season of Docs GitHub repository; or they may submit a statement of interest directly to the organizations. We recommend technical writers reach out to organizations before submitting a statement of interest to discuss the project they’ll be working on and gain a better understanding of the organization. Technical writers do not need to submit a formal application through Google Season of Docs, so reach out to the organizations as soon as possible!

Will technical writers be paid while working with organizations accepted into Google Season of Docs?

Yes. Participating organizations will transfer funds directly to the technical writer via OpenCollective. Technical writers should review the organization's proposed project budgets and discuss their compensation and payment schedule with the organization before hiring. Check out our technical writer payment process guide for more details.

General Timeline

May 22, 2024	Technical writer hiring deadline
June 5, 2024	Organization administrators start reporting on their project status via monthly evaluations
December 10, 2024	Final date for Organization administrators submit their case study and final project evaluation
December 13, 2024	Google publishes the 2024 Season of Docs case studies and aggregate project data
May 1, 2025	Organizations begin to participate in post-program followup surveys

See the full timeline for details.

Care to join us?

Explore the Google Season of Docs website at g.co/seasonofdocs to learn more about the program. Use our logo and other promotional resources to spread the word. Review the timeline, check out the FAQ, and reach out to organizations now!

If you have any questions about the program, please email us at [email protected].

By Erin McKean, Google Open Source Programs Office