Category Archives: Open Source Blog

News about Google’s open source projects and programs

Introducing the Open Source Insights Project

Open Source Insights

Google has been working on software supply-chain security for many years, and transitive dependencies remain one of the most complex and least understood aspects. While we will be integrating this data into our Cloud and internal products in a variety of ways, we believe there is an immediate value in helping developers understand and visualize dependencies. Today, we are excited to share an exploratory visualization site: Open Source Insights, which provides an interactive view of the dependencies of open source projects.

Software development practices have evolved significantly over the last few years. Collaborative development with distributed feature development, consumption of open source and third-party packages, and publicly maintained software libraries have become commonplace, partly as a result of the widespread use of open source software. The advantages of open source are so clear that people and companies that would once have rejected OSS are now adopting it as a critical element of their environment.

But there are challenges brought by OSS too. The pace of change is electric, and it can be hard to keep up. The software packages that a large project depends on might update too frequently to keep a clear picture of what is happening. And those packages, in turn, can change their dependencies to provide new features or fix bugs. Security problems and other issues can arise unexpectedly in your project as a result, and the scale of the problem can make it all difficult to manage. Even a modest OSS project might depend on hundreds of packages.

There are tools to help, of course: vulnerability scanners and dependency audits that can help identify when a package is exposed to a vulnerability. But it can still be difficult to visualize the big picture, to understand what you depend on, and what that implies.

Open Source Insights provides a visualization of a project’s dependencies and their properties. Our exploratory website can be used to get an overview of how a particular software package is put together. Among other features, it provides interactive tools to visualize and analyze full, transitive dependency graphs. It also has a comparison tool to highlight how different versions of a package might affect your dependencies, perhaps by changing their own dependencies, adding licensing requirements, or fixing security problems.

Dependency graph for express 4.17.1

Open Source Insights shows you all this information about a package without asking you to install the package first. You can see instantly what installing a package—or an updated version—might mean for your project, how popular it is, find links to source code and other information, and then decide whether it should be installed. Insights also helps you see the importance of your project by showing the projects that depend on it: its dependents. Even a small project is important if a large number of other projects depend on it, either directly or through transitive dependencies.

Open Source Insights continuously scans millions of projects in the open source software ecosystem, gathering information about packages, including licensing, ownership, security issues, and other metadata such as download counts, popularity signals, and OpenSSF Scorecards. It then constructs a full dependency graph—transitively tracking dependencies, dependencies' dependencies, and so on—and incorporates the metadata, then publishes it so you can see how it all might affect your software. And the information it provides is continually updated.
Filtered dependency graph showing how eslint 7.27.0 depends on chalk 2.4.2 and 4.1.1

Filtered dependency graph showing how eslint 7.27.0 depends on chalk 2.4.2 and 4.1.1

This information can help visualize how software is put together, whether an update is worth doing, or how to fix a problem.

Today, Open Source Insights supports npm, Maven, Go modules, and Cargo. While we work on adding additional packaging systems, we want to hear from you: How could this data fit into your development workflow? What would make it more useful? You can reach the team at deps.dev to share your thoughts; we’ll be collecting feedback for the upcoming months and look forward to hearing your ideas on how best to improve supply-chain security.

Visit our website at deps.dev to try it out.

From the Google Cloud team: What this means for GCP’s open cloud

For users of open source software, this may be the first time you’re seeing dependency and vulnerability information in an organized and accessible way. If you’re using a managed service based on open source, it’s important to remember that you may not be affected by all vulnerabilities listed. Your provider may have taken steps to harden the products you use, and when a new vulnerability is disclosed, your provider may take responsibility for patching this on your behalf.

Google Cloud follows both these steps to help users get the benefits of open cloud while prioritizing security. Multiple layers of hardening create defense-in-depth, which helps protect services like Google Kubernetes Engine (GKE), Cloud Run and Cloud Functions from a container escape vulnerability. For components that are the user’s responsibility, we’re constantly rolling out new services—like GKE Autopilot—that automate these responsibilities.

We’re committed to protecting our customers, both through our patch rewards program and the recently launched cyber insurance partnership, the Risk Protection Program, which moves from shared responsibility to shared fate. We look forward to bringing our customers new information on their open source dependencies.

By Andrew Gerrand, Michael Goddard, Rob Pike and Nicky Ringland of the Open Source Insights Team

Google Summer of Code 2021 students are announced!

Thank you to all the students and recent graduates who applied for Google Summer of Code (GSoC) 2021 by submitting final project proposals! We are excited to announce that the 199 mentoring organizations have selected their students. Here are some notable results from this year’s application period:
  • 6,991 applications submitted by 4,795 students
  • Students from 103 countries applied
  • 1292 students were selected from 69 countries
Starting today, participating students will be paired with a mentor to begin planning their projects and milestones. For the next few weeks (May 17–June 7), students will get acquainted with their mentor and start to engage with the project’s open source community. This Community Bonding period also allows students to familiarize themselves with the languages or tools needed to complete their projects. Coding then begins on June 7th, continuing through the summer until August 16th.

Though applications for GSoC have closed for 2021, there are other ways to pursue your interests in open source projects. Staying connected with the community or reaching out to other organizations is a good first step. Making contact with potential mentors and a software community sets the stage for future opportunities. A great resource is the student guide, which has suggestions on what to do if you weren’t selected for this year’s program. It also has a section on ‘Choosing an Organization’ which is helpful whether you would like to connect now with organizations on your own, or decide to apply to GSoC in the future—which we hope you do!

Here’s to the 17th year of Google Summer of Code!

By Romina Vicente, Project Coordinator for the Google Open Source Programs Office

Modernizing Oracle operations with Kubernetes and El Carro

Google Cloud is releasing El Carro, an open source tool to help you transform and modernize your Oracle database operations. El Carro implements the Kubernetes operator pattern to deliver automation for provisioning and ongoing operations like backups, patching, and high availability for databases running in hybrid and multi-cloud environments. And it does so using the same declarative syntax that DevOps teams are using to manage applications. With El Carro, users can choose to modernize and transform their database operations in place and benefit from a consistent management experience and hybrid and multi-cloud portability. Released under the Apache License 2.0, you are free to use El Carro in any Kubernetes environment—you are in control.

Containers and Kubernetes deliver portability on standardized infrastructure, and today Oracle supports databases running in containers; they’ve also released container build files and images and helm charts to simplify provisioning. What is missing for the next level of integration is support for lifecycle operations and an extension of the Kubernetes API to the primitives needed for database management.

In addition, fully managed or autonomous services for Oracle may not make available all the required features, such as Active Data Guard, Multitenant, and In-Memory, parameters/flags, versions, and patch levels. DBAs also find themselves locked out of many roles, including sysadmin and root. These restrictions make many cloud architects fall back to lift and shift Oracle databases onto infrastructure as a service offerings and miss out on opportunities to modernize and transform database operations. And with transactional databases growing in number and criticality, organizations are struggling to deliver innovation and modernization. Engineers are already busy keeping up with sprawl and mundane operational tasks while adhering to strict change management processes.

How do we solve this database operations gap?

El Carro solves this. It is built with scalability in mind, using the same container orchestration infrastructure, Kubernetes, that powers many businesses and is a top choice for modern architectures. Its open API allows you to manage your database configurations as declarative code, enabling CI/CD or Gitops workflows for auditability and control mechanisms. El Carro automates many database lifecycle operations, like backups, replication, and patching. And, when it distributes databases on the nodes of a cluster, it is aware of the priority and resource requirements of each database to optimize tight packing while respecting quality of service. Lastly, it helps DBAs by delivering automation without restrictions and leaving DBAs in full control over their systems. You can choose to let the operator drive for you, but you can also take over the steering wheel yourself at any time.

Because Kubernetes is now the standard for portable infrastructure automation and orchestration, engineers appreciate how Kubernetes abstracts complex problems into manageable infrastructure as code. Kubernetes can scale from small projects to large projects that support the infrastructure that powers Google products and services for billions of users around the world. Moreover, Google pioneers the next generation of infrastructure as code that we refer to as Configuration as Data to declaratively establish a contract between developer intent and the runtime operation. According to the Cloud Native Survey 2020, two-thirds of respondents were either already running stateful workloads in production or were considering doing so within the next 12 months. We expect that datastores are going to drive the next wave of enterprise Kubernetes adoption.

A number of open source operators for databases, such as PostgreSQL, MySQL, and many others, have been released, are actively maintained by the community, and are popular among developers and architects looking for a hands-off approach to manage databases with their applications. El Carro extends the list of database operators to include Oracle.

What are we building with El Carro for Oracle databases?

The operator pattern emerged in late 2016 as an extension of the Kubernetes API and control loop aimed at automating more complicated and application-specific tasks that are beyond the native Kubernetes objects.

El Carro implements a custom resource definition (CRD), which is tailored to database management. Users set and change attributes of the custom resource using the Kubernetes API the same way they do for built-in objects such as pods, deployments, or services. The El Carro controller observes changes to the CRs and compares the declared state with the current reality in the cluster, then makes the necessary changes. Those changes could either affect the Kubernetes resources used by the database such as persistent volumes or the pod itself, or may result in issuing calls via SQL or command line tools to the database to create and modify users or other database objects.

Here’s a look at how this works:
El Carro Architecture
El Carro Architecture

The diagram above shows how the major components of a database managed by the El Carro Operator interact with each other. The controller monitors the CRD for any changes made by admins. It creates and manages the cluster resources that make up the actual database deployment: persistent volumes for filesystems and data, a pod to run containers with the actual database, and a daemon that allows the controller to securely run SQL commands on the database. And lastly, a service makes sqlnet connections available to applications and end users that can either run in the same Kubernetes cluster or outside of it.

At release time, the El Carro Operator can provision Oracle databases of 12c Enterprise Edition and 18c Express Edition. It manages instance parameters, pluggable databases, and users. You can take and restore backups either using rman or storage snapshots, and we are working to add additional features.

How to get involved with El Carro?

In the development process, we collaborated with users and partners in the Oracle community to help us validate the approach. "Pythian has helped Oracle users to automate and optimize the operations of their mission-critical systems for over 20 years,” says Simon Pane, principal consultant at Pythian. “We are excited about the possibilities that El Carro brings to users on their cloud modernization journeys. We are proud to work with the community on a vision for the future of database management.".

Sean Scott covers Docker for databases on his blog oraclesean.com, and says: "There are many benefits to running Oracle databases in containers. Adding Kubernetes orchestration introduces new opportunities to bring the DevOps and Oracle communities together."

You can try out El Carro today. Follow the quick start guide and try out provisioning of instances, databases, users. Import data via Data Pump, manage instance parameters, choose between different methods for backups, and try out a restore. Have a look at how we integrate with external logging and monitoring solutions. Reach out via our Google group and leave feedback for what features you would like to see next, or even create your own patch and pull request on GitHub.

By Bjoern Rost - Product Manager and Boris Dali - Team Lead, Engineering

Season of Docs announces participating organizations for 2021

Season of Docs has announced the participating open source organizations for 2021! You can view the list of participating organizations on the website.

During the documentation development phase, which runs from now until November 16, 2021, each accepted organization will work with the technical writer they hire to complete their documentation project.

For more information about the documentation development phase, visit the organization administrator guide on the website.

What is Season of Docs?

Season of Docs supports documentation in open source by:
  • Providing funds to open source organizations to use for documentation projects
  • Providing guides and support for open source organizations to help them understand their documentation needs
  • Collecting data from open source organizations to better understand documentation impact
  • Publishing case studies from open source organizations to share best practices
Season of Docs seeks to empower open source organizations to understand their documentation needs, to create documentation to fill those needs, to measure the effect and impact of their documentation, and, in the spirit of open source, share what they've learned to help guide other projects. Season of Docs also seeks to bring more technical writers into open source through funding their work with open source projects and organizations.

How do I take part in Season of Docs as a technical writer?

Technical writers interested in working with accepted open source organizations can share their contact information via the Season of Docs GitHub repository; or they may submit a statement of interest directly to the organizations. Technical writers do not need to submit a formal application through Season of Docs. We recommend technical writers reach out to organizations before submitting a statement of interest to discuss the project they’ll be working on and gain a better understanding of the organization.

Organizations must hire technical writers by May 17, 2021 at 18:00 UTC, so technical writers should begin reaching out as soon as possible.

Will technical writers be paid while working with organizations accepted into Season of Docs?

Yes. Participating organizations will transfer funds directly to the technical writer. Technical writers should review the organization's proposed project budgets and discuss their compensation and payment schedule with the organization prior to hiring. Check out our technical writer payment process guide for more details.

If you have any questions about the program, please email us at [email protected].

General timeline

May 17

Technical writer hiring deadline

June 16

Organization administrators begin reporting on their project status via monthly evaluations.

November 30

Organization administrators submit their case study and final project evaluation.

December 14

Google publishes the 2021 case studies and aggregate project data.

May 2, 2022

Organizations begin to participate in post-program followup surveys.

See the full timeline for details.

Care to join us?

Explore the Season of Docs website at g.co/seasonofdocs to learn more about the program. Use our logo and other promotional resources to spread the word. Examine the timeline, check out the FAQ, and reach out to organizations now!

By Kassandra Dhillon and Erin McKean, Google Open Source Programs Office

Actuating Google Production: How Google’s Site Reliability Engineering Team Uses Go

Google runs a small number of very large services. Those services are powered by a global infrastructure covering everything a developer needs: storage systems, load balancers, network, logging, monitoring, and much more. Nevertheless, it is not a static system—it cannot be. Architecture evolves, new products and ideas are created, new versions must be rolled out, configs pushed, database schema updated, and more. We end up deploying changes to our systems dozens of times per second.

Because of this scale and critical need for reliability, Google pioneered Site Reliability Engineering (SRE), a role that many other companies have since adopted. “SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity.” - Site Reliability Engineering (SRE).

Go Gopher logo
Credit to Renee French for the Go Gopher

In 2013-2014, Google’s SRE team realized that our approach to production management was not cutting it anymore in many ways. We had advanced far beyond shell scripts, but our scale had so many moving pieces and complexities that a new approach was needed. We determined that we needed to move toward a declarative model of our production, called "Prodspec," driving a dedicated control plane, called "Annealing."

When we started those projects, Go was just becoming a viable option for critical services at Google. Most engineers were more familiar with Python and C++, either of which would have been valid choices. Nevertheless, Go captured our interest. The appeal of novelty was certainly a factor of course. But, more importantly, Go promised a sweet spot between performance and readability that neither of the other languages were able to offer. We started a small experiment with Go for some initial parts of Annealing and Prodspec. As the projects progressed, those initial parts written in Go found themselves at the core. We were happy with Go—its simplicity grew on us, the performance was there, and concurrency primitives would have been hard to replace.

At no point was there ever a mandate or requirement to use Go, but we had no desire to return to Python or C++. Go grew organically in Annealing and Prodspec. It was the right choice, and thus is now our language of choice. Now the majority of Google production is managed and maintained by our systems written in Go.

The power of having a simple language in those projects is hard to overstate. There have been cases where some feature was indeed missing, such as the ability to enforce in the code that some complex structure should not be mutated. But for each one of those cases, there have undoubtedly been tens or hundred of cases where the simplicity helped.

For example, Annealing impacts a wide variety of teams and services meaning that we relied heavily on contributions across the company. The simplicity of Go made it possible for people outside our team to see why some part or another was not working for them, and often provide fixes or features themselves. This allowed us to quickly grow.

Prodspec and Annealing are in charge of some quite critical components. Go’s simplicity means that the code is easy to follow, whether it is to spot bugs during review or when trying to determine exactly what happened during a service disruption.

Go performance and concurrency support have also been key for our work. As our model of production is declarative, we tend to manipulate a lot of structured data, which describes what production is and what it should be. We have large services so the data can grow large, often making purely sequential processing not efficient enough.

We are manipulating this data in many ways and many places. It is not a matter of having a smart person come up with a parallel version of our algorithm. It is a matter of casual parallelism, finding the next bottleneck and parallelising that code section. And Go enables exactly that.

As a result of our success with Go, we now use Go for every new development for Prodspec and Annealing.In addition to the SRE team, engineering teams across Google have adopted Go in their development process. Read about how the Core Data Solutions, Firebase Hosting, and Chrome teams use Go to build fast, reliable, and efficient software at scale.

By Pierre Palatin, Site Reliability Engineer

Logica: organizing your data queries, making them universally reusable and fun

We present Logica, a novel open source Logic Programming language. A successor to Yedalog (a language developed at Google earlier) it is a Datalog-like logic programming language. Logica code compiles to SQL and runs on Google BigQuery (with experimental support for PostgreSQL and SQLite), but it is much more concise and supports the clean and reusable abstraction mechanisms that SQL lacks. It supports modules and imports, it can be used from an interactive Python notebook and it even makes testing your queries natural and easy.

“Data is the new oil”, they say, and SQL is so far the lingua franca for working with data. When SQL (or “Structured English Query Language”, as it was first named) was invented in the 1970s, its authors might not have imagined the popularity that it would reach half a century later. Today, systems ranging from tiny smart watch applications to enterprise IT solutions, read and write their data using SQL. Even the browser that you are using to read this post now might have a working built-in SQL database in it.

Despite the widespread adoption, SQL is not flawless. Constructing statements from long chains of English words (which are often capitalized to keep the old-fashioned COBOL spirit of the 70s alive!) can be very verbose—a single query spanning hundreds of lines is a routine occurrence. The main flaw of SQL, however, lies in its very limited support for abstraction.

Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.

This inherent resistance to decomposition of logic into bite-sized pieces is what leads into the contrived, lengthy queries, the copy-pasted chunks of code and, eventually, unmaintainable, unstructured (note the irony) SQL codebases. To make things worse, SQL code is rarely tested, because “testing SQL queries” sounds rather esoteric to most engineers, at best. Because of that, a number of alternative query languages and libraries have been developed. Of those, systems based on logic programming perhaps come the closest to addressing SQL’s limitations.

Logic programming languages solve problems of SQL by using syntax of mathematical propositional logic rather than natural English language. The language of formal logic was designed by mathematicians specifically to make expression of complex statements easier and suits this purpose much better than natural language. Logica extends classical Logic programming syntax further, most notably with aggregation, hence the name, which stands for

Logica = Logic + Aggregation.

Let us see how it all works. SQL operates with relations, which are sets of rows. In logic programming the analog of a relation is a predicate. While a predicate is a set of rows, we think of it as a logical condition, which describes the rows of a relation. Here is, for example, the definition of a simple predicate:

MagicNumber(x: 2);

MagicNumber(x: 3);

MagicNumber(x: 5);

The definition claims that the condition MagicNumber(x) must hold when X is precisely either 2, 3, or 5. That means, if we were to query this predicate (i.e. request all values of X that satisfy it), the output should be a “relation” with a single column X and rows 2, 3, and 5. The SQL equivalent would be:

SELECT 2 AS x

UNION ALL

SELECT 3 AS x

UNION ALL

SELECT 5 AS x;

Rather than listing the individual values, we could have defined the predicate by encoding a logical condition upon X as follows:

MagicNumber(x:) :-

  x in [2, 3, 5];

Now, here is where the magic starts. Firstly, any table in your database is itself already a predicate, so the following definition:

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 user_id == 5;

Defines a predicate MagicComment, which includes precisely those comment_text values, which are present in the comments table where user_id == 5. In SQL this would read:

SELECT comment_text FROM comments WHERE user_id = 5;

Observe what happens if we replace the condition “user_id == 5” in our predicate with MagicNumber(x: user_id):

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 MagicNumber(x: user_id);

Here, we are querying for comments of users whose ID is one of the “magic numbers” we just defined above. Note how easily we could reuse a previously defined piece of code without having to copy anything around. We could now even extract the MagicNumber to a common module and import it in wherever it is needed:

import my_project.magic.MagicNumber;

As a final example, let us mock the comments table, in a unittest of a query.

import my_project.magic.MagicComment;


MockComments(user_id: 1, comment_text: "Hello");

MockComments(user_id: 2, comment_text: "Logic");

MockComments(user_id: 3, comment_text: "Programming");


MagicCommentTest := MagicComment(`comments`: MockComments);

If we query the MagicComment predicate here, it will not try to read the comments table in the database. Instead, it will use the predicate we just defined, thus letting us verify its correctness by testing the output (it must include two rows “Logic” and “Programming”). Observe how natural and frictionless many of the good programming practices become with Logica, and compare that to what you would have to do to achieve the same using bare SQL.

There is much more to Logica, so make sure you give it a try—chances are, you will love it! Start with this tutorial to learn Logica. Even if you do not end up using it in your next project, learning a new powerful language may open your mind to new ideas and perspectives on data processing and computing in general.

The simple examples above are only a small sample of how concise Logica code can be over SQL for complex queries. In particular, we did not even touch the topic of aggregations in this article. For all of this see examples section of the Logica open source repository.

We also hope that some of the readers consider contributing to Logica development. That’s what open source is all about!

By Konstantin Tretyakov and Evgeny Skvortsov – Logica Open Source Project

Announcing the First Group of Google Open Source Peer Bonus winners in 2021!

 

Google Open Source Peer Bonus logo


The Google Open Source Peer Bonus program is designed to reward external open source contributors nominated by Googlers for their exceptional contributions to open source. We are very excited to announce our first group of winners in 2021!

Our current winners have contributed to a wide range of projects including Apache Beam, Kubernetes, Tekton and many others. We reward open source enthusiasts not only for their code contributions, but also community work, documentation, mentorships and other types of engagement.

We have award recipients from 25 countries all over the world: Austria, Canada, China, Cyprus, Denmark, Finland, France, Germany, India, Isle of Man, Italy, Japan, Korea, Netherlands, Norway, Russia, Singapore, Spain, Sweden, Switzerland, Uganda, Taiwan, Ukraine, United Kingdom, and the United States.

Open source encourages innovation through collaboration and our modern world, and technology that we rely on, wouldn’t be the same without you—the contributors, who are in many cases volunteers. We would like to thank you for your hard work and congratulate you on receiving this award!

Below is the list of current winners who gave us permission to thank them publicly:

WinnerProject
Kashyap JoisAndroid FHIR SDK
David AllisonAnkiDroid
Chad DombrovaApache Beam
Jeff KlukasApache Beam
Steve NiemitzApache Beam
Yoshiki ObataApache Beam
Jaskirat SinghCHAOSS - Community Health Analytics Open Source Software
Eric AmordeCocoaPods
Subrata Banikcoreboot
Ned BatchelderCoverage.py & related CPython internals
Matthew BryantCursedChrome
Simon Legnerdevdocs.io
Dmitry GutovEmacs/company-mode
Brian JostFirebase
Joe HinkleFirebase iOS SDK
Lorenzo FiamigoFirebase iOS SDK
Mike GerasymenkoFirebase iOS SDK
Morten Bek DitlevsenFirebase iOS SDK
Angel PonsFlashrom
Ole André Vadla RavnåsFrida
Junegunn Choifzf
Alex SaveauGradle Play Publisher
Nate GrahamKDE
Amit SagtaniKDE Community
Niklas HanssonKubeflow Pipelines
William TeoKubeflow Pipelines
Antonio OjeaKubernetes
Dan MangumKubernetes
Jian ZengKubernetes
Darrell Commanderlibjpeg-turbo
James (purpleidea)mgmt
Kareem ErgawyMLIR
Lily BallardNix / Fish
Eelco DolstraNix, NixOS, Nixpkgs
Samuel Dionne-RielNixOS
Dmitry DemenskyOpen source TypeScript definitions for Google Maps Platform
Kay WilliamsOpenSSF
Hassan Kibirigeplotnine
Henry Schreinerpybind11
Paul MoorePython 'pip' project
Tzu-ping ChungPython 'pip' project
Alex GrönholmPython 'wheel' project
Ramon Santamariaraylib
Alexander Weissrestic
Michael Eischerrestic
Ben Leshrxjs
Takeshi Nakatanis3fs
Daniel Wee Soong LimSymbiFlow
Unai Martinez-CorralSymbiFlow, Surelog, Verible, more
Andrea FrittoliTekton
Priti DesaiTekton
Vincent DemeesterTekton
Chengyu Zhangtestsmt & testsmt/yinyang
Dominik Winterertestsmt & testsmt/yinyang
Tom RiniU-Boot

Thank you for your contributions to open source!

By Maria Tabak — Google Open Source Programs Office

Announcing the First Group of Google Open Source Peer Bonus winners in 2021!

 

Google Open Source Peer Bonus logo


The Google Open Source Peer Bonus program is designed to reward external open source contributors nominated by Googlers for their exceptional contributions to open source. We are very excited to announce our first group of winners in 2021!

Our current winners have contributed to a wide range of projects including Apache Beam, Kubernetes, Tekton and many others. We reward open source enthusiasts not only for their code contributions, but also community work, documentation, mentorships and other types of engagement.

We have award recipients from 25 countries all over the world: Austria, Canada, China, Cyprus, Denmark, Finland, France, Germany, India, Isle of Man, Italy, Japan, Korea, Netherlands, Norway, Russia, Singapore, Spain, Sweden, Switzerland, Uganda, Taiwan, Ukraine, United Kingdom, and the United States.

Open source encourages innovation through collaboration and our modern world, and technology that we rely on, wouldn’t be the same without you—the contributors, who are in many cases volunteers. We would like to thank you for your hard work and congratulate you on receiving this award!

Below is the list of current winners who gave us permission to thank them publicly:

WinnerProject
Kashyap JoisAndroid FHIR SDK
David AllisonAnkiDroid
Chad DombrovaApache Beam
Jeff KlukasApache Beam
Steve NiemitzApache Beam
Yoshiki ObataApache Beam
Jaskirat SinghCHAOSS - Community Health Analytics Open Source Software
Eric AmordeCocoaPods
Subrata Banikcoreboot
Ned BatchelderCoverage.py & related CPython internals
Matthew BryantCursedChrome
Simon Legnerdevdocs.io
Dmitry GutovEmacs/company-mode
Brian JostFirebase
Joe HinkleFirebase iOS SDK
Lorenzo FiamigoFirebase iOS SDK
Mike GerasymenkoFirebase iOS SDK
Morten Bek DitlevsenFirebase iOS SDK
Angel PonsFlashrom
Ole André Vadla RavnåsFrida
Junegunn Choifzf
Alex SaveauGradle Play Publisher
Nate GrahamKDE
Amit SagtaniKDE Community
Niklas HanssonKubeflow Pipelines
William TeoKubeflow Pipelines
Antonio OjeaKubernetes
Dan MangumKubernetes
Jian ZengKubernetes
Darrell Commanderlibjpeg-turbo
James (purpleidea)mgmt
Kareem ErgawyMLIR
Lily BallardNix / Fish
Eelco DolstraNix, NixOS, Nixpkgs
Samuel Dionne-RielNixOS
Dmitry DemenskyOpen source TypeScript definitions for Google Maps Platform
Kay WilliamsOpenSSF
Hassan Kibirigeplotnine
Henry Schreinerpybind11
Paul MoorePython 'pip' project
Tzu-ping ChungPython 'pip' project
Alex GrönholmPython 'wheel' project
Ramon Santamariaraylib
Alexander Weissrestic
Michael Eischerrestic
Ben Leshrxjs
Takeshi Nakatanis3fs
Daniel Wee Soong LimSymbiFlow
Unai Martinez-CorralSymbiFlow, Surelog, Verible, more
Andrea FrittoliTekton
Priti DesaiTekton
Vincent DemeesterTekton
Chengyu Zhangtestsmt & testsmt/yinyang
Dominik Winterertestsmt & testsmt/yinyang
Tom RiniU-Boot

Thank you for your contributions to open source!

By Maria Tabak — Google Open Source Programs Office

Analyzing genomic data in families with deep learning

The Genomics team at Google Health is excited to share our latest expansion to DeepVariant - DeepTrio.

First released in 2017, DeepVariant is an open source tool that enables researchers and clinicians to analyze an individual’s genome sequencing data and identify genetic variants, such as those that may cause disease. Our continued work on DeepVariant has been recognized for its top-of-class accuracy. With DeepTrio, we have expanded DeepVariant to be able to consider the genetic variants in the sequence data of a mother-father-child trio.

Humans are diploid organisms, carrying two copies of the human genome. Every individual inherits one copy of the genome from their mother, and the other from their father. Parental inheritance informs analysis of traits and diseases that follow Mendelian inheritance. DeepTrio learns to use the properties of Mendelian inheritance directly from sequencing data in order to more accurately identify genetic variants in cases when both parent and a child sample can be co-analyzed.

Modifying DeepVariant to analyze trio samples

DeepVariant learns to classify positions in a genome as reference or variant using representations of data similar to the “genome browser” which experts use in analysis. “Improving the Accuracy of Genomic Analysis with DeepVariant 1.0” provides a good overview.

DeepVariant receives data as a window of the genome centered on a candidate variant which it is asked to classify as either reference (no variant), heterozygous (one copy of a variant) or homozygous (both copies are variant). DeepVariant sees the sequence evidence as channels representing features of the data (see: “Looking through DeepVariant’s eyes” for a deeper explanation).

We modified DeepTrio to represent the sequence data from a trio in a single image, with a fixed height for each sample and the child in the middle. Using gold standard samples from NIST Genome in a Bottle for truth labels, we train one model to call variants in the child and another to call variants in the top parent. To call both parents, we flip the position of the parent samples.

An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack.

conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Figure 1. (top) An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack. (bottom) conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Measuring DeepTrio’s improved accuracy

We show that DeepTrio is more accurate than DeepVariant for both parent and child variant detection, with an especially pronounced advantage at lower coverages. This enables researchers to either analyze samples at higher accuracy, or to maintain comparable accuracy at a substantially reduced expense.

To assess the accuracy of DeepTrio, we compare its accuracy to DeepVariant using extensively characterized gold standards made available by NIST Genome in a Bottle. In order to have an evaluation dataset which is never seen in training, we exclude chromosome 20 from training and perform evaluations on chromosome 20.

We train DeepVariant and DeepTrio for sequencing data from two different instruments, Illumina and Pacific Biosciences (PacBio), for more information on the differences between these technologies, please see our previous blog. These sequencers both randomly sample the genome in an error-prone manner. To accurately analyze a genome, the same region needs to be sampled repeatedly. The depth of sampling at a position is called coverage. Sequencing to greater coverage is more expensive in an approximately linear manner. This often forces trade-offs between cost, accuracy, and samples sequenced. As a result, in trios parents are often sequenced at lower depth.

In the charts below, we plot the accuracy of DeepTrio and DeepVariant across a range of coverages.

DeepTrio child accuracy

DeepTrio parent accuracy

Figure 2. F1-score for DeepTrio (solid line) and DeepVariant (dashed line) on a child sample (top) and a parent sample (bottom), sequenced with an Illumina (blue) and PacBio (black) instrument. F1 is measured for all types of small variants on chromosome 20, across samples with a range of sequencing coverage (x-axis).

DeepTrio’s performance on de novo variants

Each individual has roughly 5 million variants relative to the human reference genome. The overwhelming majority of these are inherited from their parents. A small number, around 100, are new (referred to as de novo), due to copying errors during DNA replication. We demonstrate that DeepTrio substantially reduces false positives for de novo variants. For Illumina data, this comes with a smaller decrease in recovery of true positives, while for PacBio data, this trade-off does not occur.

To assess accuracy we analyzed sites where both parents are called as non-variant, but the child is called as heterozygous variant. We observe that DeepTrio is more reluctant to call a variant as de novo, which is similar to how a human would require a higher level of evidence for sites violating Mendelian inheritance. This results in a much lower false positive rate for these de novo variants, but a slightly lower recall rate in DeepTrio Illumina. Usually when this occurs, the child is still called as a variant, but the parents are given “no-call” (the classifier is not confident enough to make a call).

Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events


Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of false positive de novo events

Figure 3. Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events (top) and false positive de novo events (bottom) for DeepTrio (solid line) and DeepVariant (dashed line) on Illumina (blue) and PacBio (black). Accuracy is measured on chromosome 20, across samples with a range of sequencing coverage (x-axis).

Contributing to rare disease research

By releasing DeepTrio as open source software, we hope to improve analysis of genomic data, by allowing scientists to more accurately analyze samples. We hope this will enable research and clinical pipelines, leading to better resolution of rare disease cases, and improve development of therapeutics.

In addition to the release of DeepTrio’s code as open source, we have also released the sequencing data that we generated in order to train these models. That data is described in our pre-print “An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development”. By releasing both this production model, and the data required to train models of similar complexity, we hope to contribute to methods development by the genomics community.

By Andrew Carroll, Product Lead Genomics and Howard Yang, Program Manager Genomics — Google Health

Lyra – enabling voice calls for the next billion users

 

Lyra Logo

The past year has shown just how vital online communication is to our lives. Never before has it been more important to clearly understand one another online, regardless of where you are and whatever network conditions are available. That’s why in February we introduced Lyra: a revolutionary new audio codec using machine learning to produce high-quality voice calls.

As part of our efforts to make the best codecs universally available, we are open sourcing Lyra, allowing other developers to power their communications apps and take Lyra in powerful new directions. This release provides the tools needed for developers to encode and decode audio with Lyra, optimized for the 64-bit ARM android platform, with development on Linux. We hope to expand this codebase and develop improvements and support for additional platforms in tandem with the community.

The Lyra Architecture

Lyra’s architecture is separated into two pieces, the encoder and decoder. When someone talks into their phone the encoder captures distinctive attributes from their speech. These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network. It is the decoder’s job to convert the features back into an audio waveform that can be played out over the listener’s phone speaker. The features are decoded back into a waveform via a generative model. Generative models are a particular type of machine learning model well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal.

Lyra Architecture Chart

The Impact

While mobile connectivity has steadily increased over the past decade, the explosive growth of on-device compute power has outstripped access to reliable high speed wireless infrastructure. For regions where this contrast exists—in particular developing countries where the next billion internet users are coming online—the promise that technology will enable people to be more connected has remained elusive. Even in areas with highly reliable connections, the emergence of work-from-anywhere and telecommuting have further strained mobile data limits. While Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs, such as Opus, it is not aiming to be a complete alternative, but can save meaningful bandwidth in these kinds of scenarios.

These trends provided motivation for Lyra and are the reason our open source library focuses on its potential for real time voice communication. There are also other applications we recognize Lyra may be uniquely well suited for, from archiving large amounts of speech, and saving battery by leveraging the computationally cheap Lyra encoder, to alleviating network congestion in emergency situations where many people are trying to make calls at once. We are excited to see the creativity the open source community is known for applied to Lyra in order to come up with even more unique and impactful applications.

The Open Source Release

The Lyra code is written in C++ for speed, efficiency, and interoperability, using the Bazel build framework with Abseil and the GoogleTest framework for thorough unit testing. The core API provides an interface for encoding and decoding at the file and packet levels. The complete signal processing toolchain is also provided, which includes various filters and transforms. Our example app integrates with the Android NDK to show how to integrate the native Lyra code into a Java-based android app. We also provide the weights and vector quantizers that are necessary to run Lyra.

We are releasing Lyra as a beta version today because we wanted to enable developers and get feedback as soon as possible. As a result, we expect the API and bitstream to change as it is developed. All of the code for running Lyra is open sourced under the Apache license, except for a math kernel, for which a shared library is provided until we can implement a fully open solution over more platforms. We look forward to seeing what people do with Lyra now that it is open sourced. Check out the code and demo on GitHub, let us know what you think, and how you plan to use it!

By Andrew Storus and Michael Chinen – Chrome

Acknowledgements

The following people helped make the open source release possible:
Yero Yeh, Alejandro Luebs, Jamieson Brettle, Tom Denton, Felicia Lim, Bastiaan Kleijn, Jan Skoglund, Yaowu Xu, Jim Bankoski (Chrome), Chenjie Gu, Zach Gleicher, Tom Walters, Norman Casagrande, Luis Cobo, Erich Elsen (DeepMind).