Author Archives: Open Source Programs Office

Open sourcing ClusterFuzz

Fuzzing is an automated method for detecting bugs in software that works by feeding unexpected inputs to a target program. It is effective at finding memory corruption bugs, which often have serious security implications. Manually finding these issues is both difficult and time consuming, and bugs often slip through despite rigorous code review practices. For software projects written in an unsafe language such as C or C++, fuzzing is a crucial part of ensuring their security and stability.

In order for fuzzing to be truly effective, it must be continuous, done at scale, and integrated into the development process of a software project. To provide these features for Chrome, we wrote ClusterFuzz, a fuzzing infrastructure running on over 25,000 cores. Two years ago, we began offering ClusterFuzz as a free service to open source projects through OSS-Fuzz.

Today, we’re announcing that ClusterFuzz is now open source and available for anyone to use.



We developed ClusterFuzz over eight years to fit seamlessly into developer workflows, and to make it dead simple to find bugs and get them fixed. ClusterFuzz provides end-to-end automation, from bug detection, to triage (accurate deduplication, bisection), to bug reporting, and finally to automatic closure of bug reports.

ClusterFuzz has found more than 16,000 bugs in Chrome and more than 11,000 bugs in over 160 open source projects integrated with OSS-Fuzz. It is an integral part of the development process of Chrome and many other open source projects. ClusterFuzz is often able to detect bugs hours after they are introduced and verify the fix within a day.

Check out our GitHub repository. You can try ClusterFuzz locally by following these instructions. In production, ClusterFuzz depends on some key Google Cloud Platform services, but you can use your own compute cluster. We welcome your contributions and look forward to any suggestions to help improve and extend this infrastructure. Through open sourcing ClusterFuzz, we hope to encourage all software developers to integrate fuzzing into their workflows.

By Abhishek Arya, Oliver Chang, Max Moroz, Martin Barbella and Jonathan Metzman, ClusterFuzz team

Dopamine 2.0: providing more flexibility in reinforcement learning research

Reinforcement learning (RL) has become one of the most popular fields of machine learning, and has seen a number of great advances over the last few years. As a result, there is a growing need from both researchers and educators to have access to a clear and reliable framework for RL research and education.

Last August, we announced Dopamine, our framework for flexible reinforcement learning.  For the initial version we decided to focus on a specific type of RL research: value-based agents evaluated on the Atari 2600 framework supported by the Arcade Learning Environment. We were thrilled to see how well it was received by the community, including a live coding session, its inclusion in a recently-announced benchmark for RL, considered as the top “Cool new open source project of 2018” by the Octoverse, and over 7K GitHub stars on our repository.

One of the most common requests we have received is support for more environments. This confirms what we have seen internally, where simpler environments, such as those supported by OpenAI’s Gym, are incredibly useful when testing out new algorithms. We are happy to announce Dopamine 2.0, which includes support for discrete-domain gym environments (e.g. discrete states and actions). The core of the framework remains unchanged, we have simply generalized the interface with the environment. For backwards compatibility, users will still be able to download version 1.0.

We include default configurations for two classic control environments: CartPole and Acrobot; on these environments one can train a Dopamine agent in minutes. When compared with the training time for a standard Atari 2600 game (around 5 days on a standard GPU), these environments allow researchers to iterate much faster on research ideas before testing them out on larger Atari games. We also include a Colaboratory that illustrates how to train an agent on Cartpole and Acrobot. Finally, our GymPreprocessing class serves as an example for how to use Dopamine with other custom environments.

We are excited by the new opportunities enabled by Dopamine 2.0, and look forward to seeing what the research community creates with it!

By Pablo Samuel Castro and Marc G. Bellemare, Dopamine Team

Seeking open source projects for Google Summer of Code 2019

Do you lead or represent a free or open source software organization? Are you seeking new contributors? (Who isn’t?) Do you enjoy the challenge and reward of mentoring new developers? Apply to be a mentor organization for Google Summer of Code 2019!

We are searching for open source projects and organizations to participate in the 15th annual Google Summer of Code (GSoC). GSoC is a global program that draws university student developers from around the world to contribute to open source. Each student spends three months working on a coding project, with the support of volunteer mentors, for participating open source organizations from late May to August.

Last year 1,264 students worked with 206 open source organizations. Organizations include individual smaller and medium sized open source projects as well as a number of umbrella organizations with many sub-projects under them (Python Software Foundation, CERN, Apache Software Foundation).

You can apply to be a mentoring organization for GSoC starting today. The deadline to apply is February 6 at 20:00 UTC. Organizations chosen for GSoC 2019 will be publicly announced on February 26.

Please visit the program site for more information on how to apply and review the detailed timeline of important deadlines. We also encourage you to check out the Mentor Guide and our short video on why open source projects choose to apply to be a part of the program.

Best of luck to all of the project applicants!

By Stephanie Taylor, Google Open Source

A new chapter for OSS-Fuzz

Cross-posted on the Google Security Blog.

Open source software (OSS) is extremely important to Google, and we rely on OSS in a variety of customer-facing and internal projects. We also understand the difficulty and importance of securing the open source ecosystem, and are continuously looking for ways to simplify it.

For the OSS community, we currently provide OSS-Fuzz, a free continuous fuzzing infrastructure hosted on the Google Cloud Platform. OSS-Fuzz uncovers security vulnerabilities and stability issues, and reports them directly to developers. Since launching in December 2016, OSS-Fuzz has reported over 9,000 bugs directly to open source developers.

In addition to OSS-Fuzz, Google's security team maintains several internal tools for identifying bugs in both Google internal and open source code. Until recently, these issues were manually reported to various public bug trackers by our security team and then monitored until they were resolved. Unresolved bugs were eligible for the Patch Rewards Program. While this reporting process had some success, it was overly complex. Now, by unifying and automating our fuzzing tools, we have been able to consolidate our processes into a single workflow, based on OSS-Fuzz. Projects integrated with OSS-Fuzz will benefit from being reviewed by both our internal and external fuzzing tools, thereby increasing code coverage and discovering bugs faster.

We are committed to helping open source projects benefit from integrating with our OSS-Fuzz fuzzing infrastructure. In the coming weeks, we will reach out via email to critical projects that we believe would be a good fit and support the community at large. Projects that integrate are eligible for rewards ranging from $1,000 (initial integration) up to $20,000 (ideal integration); more details are available here. These rewards are intended to help offset the cost and effort required to properly configure fuzzing for OSS projects. If you would like to integrate your project with OSS-Fuzz, please submit your project for review. Our goal is to admit as many OSS projects as possible and ensure that they are continuously fuzzed.

Once contacted, we might provide a sample fuzz target to you for easy integration. Many of these fuzz targets are generated with new technology that understands how library APIs are used appropriately. Watch this space for more details on how Google plans to further automate fuzz target creation, so that even more open source projects can benefit from continuous fuzzing.

Thank you for your continued contributions to the open source community. Let’s work together on a more secure and stable future for open source software.

By Matt Ruhstaller, TPM and Oliver Chang, Software Engineer, Google Security Team

The big reveal: Google Code-in 2018 winners and finalists

Our 9th consecutive year of Google Code-in (GCI) 2018 ended in mid-December. It was a very, very busy seven weeks for everyone – we had 3,124 students from 77 countries completing 15,323 tasks with a record 27 open source organizations!

Today, we are pleased to announce the Google Code-in 2018 Grand Prize Winners and Finalists with each organization. The 54 Grand Prize Winners from 19 countries completed an impressive 1,668 tasks between them while also helping other students during the contest.

Each of the Grand Prize Winners are invited to a four day trip to Google’s main campus and San Francisco offices in Northern California where they’ll meet Google engineers, meet one of the mentors they worked with during the contest, and enjoy some fun in California with the other winners. We look forward to seeing everyone later this year!
Country # of Winners Country # of Winners
Cameroon 1 Romania 1
Canada 1 Russian Federation 1
Czech Republic 1 Singapore 1
Georgia 1 South Africa 1
India 18 Spain 2
Indonesia 1 Sri Lanka 1
Macedonia 1 Ukraine 2
Netherlands 1 United Kingdom 6
Philippines 1 United States 9
Poland 4

Finalists

And a big congratulations to our 108 Finalists from 26 countries who completed over 2,350 tasks during the contest. The Finalists will all receive a special hoodie to commemorate their achievements in the contest. This year we had 1 student named as a finalist with 2 different organizations!

A breakdown of the countries represented by our finalists can be found below. 
Country # of Finalists Country # of Finalists
Canada 6 Philippines 1
China 2 Poland 15
Czech Republic 1 Russian Federation 2
Germany 1 Serbia 1
India 48 Singapore 2
Indonesia 2 South Korea 1
Israel 1 Spain 1
Kazakhstan 1 Sri Lanka 2
Luxembourg 1 Taiwan 1
Mauritius 2 Thailand 1
Mexico 1 United Kingdom 3
Nepal 1 United States 8
Pakistan 2 Uruguay 1

Mentors

This year we had 790 mentors dedicate their time and invaluable expertise to helping thousands of teenage students learn about open source by welcoming them into their communities. These mentors are the heart of GCI and the reason the contest continues to thrive. Mentors spend hundreds of hours answering questions, reviewing submitted tasks, and teaching students the basics and, in many cases, more advanced aspects of contributing to open source. GCI would not be possible without their enthusiasm and commitment.

We will post more statistics and fun stories that came from GCI 2018 here on the Google Open Source Blog over the next few months, so please stay tuned.

Congratulations to our Grand Prize Winners, Finalists, and all of the students who spent the last couple of months learning about, and contributing to, open source. We hope they will continue their journey in open source!

By Stephanie Taylor, Google Open Source

Wrapping up Google Code-in 2018

We are excited to announce the conclusion of the 9th annual Google Code-in (GCI), our global online contest introducing teenagers to the world of open source development. Over the years the contest has not only grown bigger, but also helped find and support talented young people around the world.

Here are some initial statistics about this year’s program:
  • Total number of students completing tasks: 3,123*
  • Total number of countries represented by students: 77
  • Percentage of girls among students: 17.9% 
Below you can see the total number of tasks completed by students year over year:
*These numbers will increase as mentors finish reviewing the final work submitted by students this morning.
Mentors from each of the 27 open source organizations are now busy reviewing the last  work submitted by participants. We look forward to sharing more statistics about the program, including countries and schools with the most student participants, in an upcoming blog post.

The mentors for each organization will spend the next couple of weeks selecting four Finalists (who will receive a hoodie too!) and their two Grand Prize Winners. Grand Prize Winners will be flown to Northern California to visit Google’s headquarters, enjoy a day of adventure in San Francisco, meet their mentors and hear talks from Google engineers.

Hearty congratulations to all the student participants for challenging themselves and making contributions to open source in the process!

Further, we’d like to thank the mentors and the organization administrators for GCI 2018. They are the heart of this program, volunteering countless hours creating tasks, reviewing student work, and helping bring students into the world of open source. Mentors teach young students about the many facets of open source development, from community standards and communicating across time zones to version control and testing. We couldn’t run this program without you! Thank you!

Stay tuned, we’ll be announcing the Grand Prize Winners and Finalists on January 7, 2019!

By Saranya Sampat, Google Open Source

Knative momentum continues, hits another adoption milestone

Released just four months ago by Google Cloud in collaboration with several vendors, Knative, an open source platform based on Kubernetes which provides the building blocks for serverless workloads, has already gained broad support.

The number of contributors has doubled, more than a dozen companies have contributed each month, and community contributions have increased over 45% since the 0.1 release. It’s an encouraging signal that validates the need for such a project, and suggests that ongoing development will be driven by healthy discussions among users and contributors.

Knative 0.2 Release 

In recent 0.2 release, the first major release since the project’s launch in July, we incorporated 323 pull requests from eight different companies. Knative 0.2 added a new Eventing resource model to complement the Serving and Build components. There were also lots of improvements under-the-hood, such as the implementation of pluggable routing and better support for autoscaling.

KubeCon North America

Continuing the theme, there are 10 sessions about Knative by speakers from seven different companies this week at KubeCon in Seattle. The sessions cover a variety of topics spanning from introductory overview sessions to advanced autoscaler customization. The number of companies represented by speakers illustrates the breadth of the growing Knative community.

Growing Ecosystem 

Another sign of Knative’s momentum is the growing ecosystem. A number of enterprise platform developers have begun using Knative to create serverless solutions on Kubernetes for their own hybrid cloud use-cases. Their use of the Knative API makes for a consistent developer experience and enables workload portability. For example, Pivotal, a top contributor to the Knative project, has adopted Knative alongside Kubernetes which helps them dedicate more resources higher in the stack:
"Since the release of Knative, we've been collaborating on an open functions platform to help companies run their new workloads on every cloud. That’s why we’re excited to launch the alpha of Pivotal Function Service." – Onsi Fakhouri, SVP of Engineering at Pivotal
Similarly, TriggerMesh has launched a hosted serverless management platform that runs on top of Knative, enabling developers to deploy and manage their functions from a central console.
"Knative provides us with the critical building blocks we need to create our serverless management platform." – Sebastien Goasguen, Co-founder, TriggerMesh
We’re excited by the speed with which Knative is being adopted and the broad cross section of the industry that is already contributing to the project. If you haven’t already jumped in, we invite you to get involved! Come visit github.com/knative and join the growing Knative community.

By Mark Chmarny, Knative Team

Google joins the OpenChain Project for license compliance

Google is thrilled to announce that we are joining the OpenChain Project as Platinum Members. OpenChain is an effort to make open source license compliance simpler and more consistent. We will also join the OpenChain board and are excited that Facebook and Uber will be fellow board members.

Over the last 14 years, the Open Source Programs Office (OSPO) at Google has developed rigorous policies and processes so that we can do open source license compliance correctly, and at scale. This helps us use free and open source software extensively across the company and makes it easier to upstream our work. For us, it’s a matter of legal compliance as well as showing respect for the amazing communities that create and maintain the software.

Until now, there’s been no commonly accepted standard for open source compliance within an organization. Most organizations, like Google, have had to invent and cobble together policies and processes, occasionally comparing notes and hoping we haven’t forgotten anything.

The OpenChain Project is changing that by defining the core requirements of a quality compliance program and developing curriculum to help with training and management. It’s hard to overstate the importance of this work now that open source is a critical input at every step in the supply chain, both in hardware and software.

Google believes in this mission and is excited for the opportunity to use what we’ve learned to pave the way for the rest of the industry. We can help guide the development of standards that are rigorous, clear, and easy to follow for companies both large and small.

By Max Sills and Josh Simmons, Google Open Source

TF-Ranking: a scalable TensorFlow library for learning-to-rank

Cross-posted from the Google AI Blog.

Ranking, the process of ordering a list of items in a way that maximizes the utility of the entire list, is applicable in a wide range of domains, from search engines and recommender systems to machine translation, dialogue systems and even computational biology. In applications like these (and many others), researchers often utilize a set of supervised machine learning techniques called learning-to-rank. In many cases, these learning-to-rank techniques are applied to datasets that are prohibitively large — scenarios where the scalability of TensorFlow could be an advantage. However, there is currently no out-of-the-box support for applying learning-to-rank techniques in TensorFlow. To the best of our knowledge, there are also no other open source libraries that specialize in applying learning-to-rank techniques at scale.

Today, we are excited to share TF-Ranking, a scalable TensorFlow-based library for learning-to-rank. As described in our recent paper, TF-Ranking provides a unified framework that includes a suite of state-of-the-art learning-to-rank algorithms, and supports pairwise or listwise loss functions, multi-item scoring, ranking metric optimization, and unbiased learning-to-rank.

TF-Ranking is fast and easy to use, and creates high-quality ranking models. The unified framework gives ML researchers, practitioners and enthusiasts the ability to evaluate and choose among an array of different ranking models within a single library. Moreover, we strongly believe that a key to a useful open source library is not only providing sensible defaults, but also empowering our users to develop their own custom models. Therefore, we provide flexible API's, within which the users can define and plug in their own customized loss functions, scoring functions and metrics.

Existing Algorithms and Metrics Support

The objective of learning-to-rank algorithms is minimizing a loss function defined over a list of items to optimize the utility of the list ordering for any given application. TF-Ranking supports a wide range of standard pointwise, pairwise and listwise loss functions as described in prior work. This ensures that researchers using the TF-Ranking library are able to reproduce and extend previously published baselines, and practitioners can make the most informed choices for their applications. Furthermore, TF-Ranking can handle sparse features (like raw text) through embeddings and scales to hundreds of millions of training instances. Thus, anyone who is interested in building real-world data intensive ranking systems such as web search or news recommendation, can use TF-Ranking as a robust, scalable solution.

Empirical evaluation is an important part of any machine learning or information retrieval research. To ensure compatibility with prior work,  we support many of the commonly used ranking metrics, including Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). We also make it easy to visualize these metrics at training time on TensorBoard, an open source TensorFlow visualization dashboard.
An example of the NDCG metric (Y-axis) along the training steps (X-axis) displayed in the TensorBoard. It shows the overall progress of the metrics during training. Different methods can be compared directly on the dashboard. Best models can be selected based on the metric.

Multi-Item Scoring

TF-Ranking supports a novel scoring mechanism wherein multiple items (e.g., web pages) can be scored jointly, an extension of the traditional scoring paradigm in which single items are scored independently. One challenge in multi-item scoring is the difficulty for inference where items have to be grouped and scored in subgroups. Then, scores are accumulated per-item and used for sorting. To make these complexities transparent to the user, TF-Ranking provides a List-In-List-Out (LILO) API to wrap all this logic in the exported TF models.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
As we demonstrate in recent work, multi-item scoring is competitive in its performance to the state-of-the-art learning-to-rank models such as RankNet, MART, and LambdaMART on a public LETOR benchmark.

Ranking Metric Optimization

An important research challenge in learning-to-rank is direct optimization of ranking metrics (such as the previously mentioned NDCG and MRR).  These metrics, while being able to measure the performance of ranking systems better than the standard classification metrics like Area Under the Curve (AUC), have the unfortunate property of being either discontinuous or flat. Therefore standard stochastic gradient descent optimization of these metrics is problematic.

In recent work, we proposed a novel method, LambdaLoss, which provides a principled probabilistic framework for ranking metric optimization. In this framework, metric-driven loss functions can be designed and optimized by an expectation-maximization procedure. The TF-Ranking library integrates the recent advances in direct metric optimization and provides an implementation of LambdaLoss. We are hopeful that this will encourage and facilitate further research advances in the important area of ranking metric optimization.

Unbiased Learning-to-Rank

Prior research has shown that given a ranked list of items, users are much more likely to interact with the first few results, regardless of their relevance. This observation has inspired research interest in unbiased learning-to-rank, and led to the development of unbiased evaluation and several unbiased learning algorithms, based on training instances re-weighting. In the TF-Ranking library, metrics are implemented to support unbiased evaluation and losses are implemented for unbiased learning by natively supporting re-weighting to overcome the inherent biases in user interactions datasets.

Getting Started with TF-Ranking

TF-Ranking implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. TF-Ranking is well integrated with the rich TensorFlow ecosystem. As described above, you can use TensorBoard to visualize ranking metrics like NDCG and MRR, as well as to pick the best model checkpoints using these metrics. Once your model is ready, it is easy to deploy it in production using TensorFlow Serving.

If you’re interested in trying TF-Ranking for yourself, please check out our GitHub repo, and walk through the tutorial examples. TF-Ranking is an active research project, and we welcome your feedback and contributions. We are excited to see how TF-Ranking can help the information retrieval and machine learning research communities.

By Xuanhui Wang and Michael Bendersky, Software Engineers, Google AI

Acknowledgements

This project was only possible thanks to the members of the core TF-Ranking team: Rama Pasumarthi, Cheng Li, Sebastian Bruch, Nadav Golbandi, Stephan Wolf, Jan Pfeifer, Rohan Anil, Marc Najork, Patrick McGregor and Clemens Mewald‎. We thank the members of the TensorFlow team for their advice and support: Alexandre Passos, Mustafa Ispir, Karmel Allison, Martin Wicke, and others. Finally, we extend our special thanks to our collaborators, interns and early adopters: Suming Chen, Zhen Qin, Chirag Sethi, Maryam Karimzadehgan, Makoto Uchida, Yan Zhu, Qingyao Ai, Brandon Tran, Donald Metzler, Mike Colagrosso, and many others at Google who helped in evaluating and testing the early versions of TF-Ranking.

Introducing a Web Component and Data API for Quick, Draw!


Over the past couple years, the Creative Lab in collaboration with the Handwriting Recognition team have released a few experiments in the realm of “doodle” recognition.  First, in 2016, there was Quick, Draw!, which uses a neural network to guess what you’re drawing.  Since Quick, Draw! launched we have collected over 1 billion drawings across 345 categories.  In the wake of that popularity, we open sourced a collection of 50 million drawings giving developers around the world access to the data set and the ability to conduct research with it.

"The different ways in which people draw are like different notes in some universally human scale" - Ian Johnson, UX Engineer @ Google

Since the initial dataset was released, it has been incredible to see how graphs, t-sne clusters, and simply overlapping millions of these doodles have given us the ability to infer interesting human behaviors, across different cultures.  One example, from the Quartz study, is that 86% of Americans (from a sample of 50,000) draw their circles counterclockwise, while 80% of Japanese (from a sample of 800) draw them clockwise. Part of this pattern in behavior can be attributed to the strict stroke order in Japanese writing, from the top left to the bottom right.


It’s also interesting to see how the data looks when it’s overlaid by country, as Kyle McDonald did, when he discovered that some countries draw their chairs in perspective while others draw them straight on.


On the more fun, artistic spectrum, there are some simple but clever uses of the data like Neil Mendoza’s face tracking experiment and Deborah Schmidt’s letter collages.
See the video here of Neil Mendoza mapping Quick, Draw! facial features to your own face in front of a webcam


See the process video here of Deborah Schmidt packing QuickDraw data into letters using OpenFrameworks
Some handy tools have also been released from the community since the release of all this data, and one of those that we’re releasing now is a Polymer component that allows you to display a doodle in your web-based project with one line of markup:

The Polymer component is coupled with a Data API that layers a massive file directory (50 million files) and returns a JSON object or an HTML canvas rendering for each drawing.  Without downloading all the data, you can start creating right away in prototyping your ideas.  We’ve also provided instructions for how to host the data and API yourself on Google Cloud Platform (for more serious projects that demand a higher request limit).  

One really handy tool when hosting an API on Google Cloud is Cloud Endpoints.  It allowed us to launch a demo API with a quota limit and authentication via an API key.  

By defining an OpenAPI specification (here is the Quick, Draw! Data API spec) and adding these three lines to our app.yaml file, an Extensible Service Proxy (ESP) gets deployed with our API backend code (more instructions here):
endpoints_api_service:
name: quickdrawfiles.appspot.com
rollout_strategy: managed
Based on the OpenAPI spec, documentation is also automatically generated for you:


We used a public Google Group as an access control list, so anyone who joins can then have the API available in their API library.
The Google Group used as an Access Control List
This component and Data API will make it easier for  creatives out there to manipulate the data for their own research.  Looking to the future, a potential next step for the project could be to store everything in a single database for more complex queries (i.e. “give me an recognized drawing from China in March 2017”).  Feedback is always welcome, and we hope this inspires even more types of projects using the data! More details on the project and the incredible research projects done using it can be found on our GitHub repo

By Nick Jonas, Creative Technologist, Creative Lab

Editor's Note: Some may notice that this isn’t the only dataset we’ve open sourced recently! You can find many more datasets in our open source project directory.