Category Archives: Open Source Blog

News about Google’s open source projects and programs

Enabling Developers and Organizations to Use Differential privacy

Originally posted on the Google Developers Blog
By: Miguel Guevara, Product Manager, Privacy and Data Protection Office


Whether you're a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users.

Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.

Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

A deeper look at the technology

Our open source library was designed to meet the needs of developers. In addition to being freely accessible, we wanted it to be easy to deploy and useful. 

Here are some of the key features of the library:
  • Statistical functions: Most common data science operations are supported by this release. Developers can compute counts, sums, averages, medians, and percentiles using our library.
  • Rigorous testing: Getting differential privacy right is challenging. Besides an extensive test suite, we’ve included an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
  • Ready to use: The real utility of an open-source release is in answering the question “Can I use this?” That’s why we’ve included a PostgreSQL extension along with common recipes to get you started. We’ve described the details of our approach in a technical paper that we’ve just released today.
  • Modular: We designed the library so that it can be extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.

Investing in new privacy technologies

We have driven the research and development of practical, differentially-private techniques since we released RAPPOR to help improve Chrome in 2014, and continue to spearhead their real-world application. 

We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi.


This year, we’ve announced several open-source, privacy technologies—Tensorflow Privacy, Tensorflow Federated, Private Join and Compute—and today’s launch adds to this growing list. We're excited to make this library broadly available and hope developers will consider leveraging it as they build out their comprehensive data privacy strategies. From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.

Acknowledgements
Software Engineers: Alain Forget, Bryant Gipson, Celia Zhang, Damien Desfontaines, Daniel Simmons-Marengo, Ian Pudney, Jin Fu, Michael Daub, Priyanka Sehgal, Royce Wilson, William Lam

That’s a Wrap for Google Summer of Code 2019

As the 15th year of Google Summer of Code (GSoC) comes to a close, we are pleased to announce that 1,134 students from 61 countries have successfully completed the 2019 program. Congratulations to all of our students and mentors who made this summer’s program so memorable!

Throughout the last 12 weeks, the GSoC students worked eagerly with 201 open source organizations and over 2,000 mentors from 72 countries—learning to work virtually on teams and developing complex pieces of code. The student projects are now public so feel free to take a look at the amazing efforts they put in over the summer.

Many open source communities rely on new perspectives and talent to keep their projects thriving and without student contributions like these, they wouldn’t be able to grow their communities; GSoC students assist in redesigning and enhancing these organizations’ codebases sometimes as first-time contributors not only to the project but to open source! This is just the beginning for GSoC students—many go on to become future mentors and even more become long-term committers and some will start their own open source projects in the years to come

And last but not least, we would like to thank the mentors and organization administrators who make GSoC possible. Their dedication to welcoming new student contributors into their communities is inspiring and vital to grow the open source community. Thank you all!

Bringing Live Transcribe’s Speech Engine to Everyone

Earlier this year, Google launched Live Transcribe, an Android application that provides real-time automated captions for people who are deaf or hard of hearing. Through many months of user testing, we've learned that robustly delivering good captions for long-form conversations isn't so easy, and we want to make it easier for developers to build upon what we've learned. Live Transcribe's speech recognition is provided by Google's state-of-the-art Cloud Speech API, which under most conditions delivers pretty impressive transcript accuracy. However, relying on the cloud introduces several complications—most notably robustness to ever-changing network connections, data costs, and latency. Today, we are sharing our transcription engine with the world so that developers everywhere can build applications with robust transcription.

Those who have worked with our Cloud Speech API know that sending infinitely long streams of audio is currently unsupported. To help solve this challenge, we take measures to close and restart streaming requests prior to hitting the timeout, including restarting the session during long periods of silence and closing whenever there is a detected pause in the speech. Otherwise, this would result in a truncated sentence or word. In between sessions, we buffer audio locally and send it upon reconnection. This reduces the amount of text lost mid-conversation—either due to restarting speech requests or switching between wireless networks.



Endlessly streaming audio comes with its own challenges. In many countries, network data is quite expensive and in spots with poor internet, bandwidth may be limited. After much experimentation with audio codecs (in particular, we evaluated the FLAC, AMR-WB, and Opus codecs), we were able to achieve a 10x reduction in data usage without compromising accuracy. FLAC, a lossless codec, preserves accuracy completely, but doesn't save much data. It also has noticeable codec latency. AMR-WB, on the other hand, saves a lot of data, but delivers much worse accuracy in noisy environments. Opus was a clear winner, allowing data rates many times lower than most music streaming services while still preserving the important details of the audio signal—even in noisy environments. Beyond relying on codecs to keep data usage to a minimum, we also support using speech detection to close the network connection during extended periods of silence. That means if you accidentally leave your phone on and running Live Transcribe when nobody is around, it stops using your data.

Finally, we know that if you are relying on captions, you want them immediately, so we've worked hard to keep latency to a minimum. Though most of the credit for speed goes to the Cloud Speech API, Live Transcribe's final trick lies in our custom Opus encoder. At the cost of only a minor increase in bitrate, we see latency that is visually indistinguishable to sending uncompressed audio.

Today, we are excited to make all of this available to developers everywhere. We hope you'll join us in trying to build a world that is more accessible for everyone.

By Chet Gnegy, Alex Huang, and Ausmus Chang from the Live Transcribe Team

OpenCensus Web: Unlocking Full End-to-End Observability for Your Entire Stack

OpenCensus Web is a tool to trace and monitor the user-perceived performance of your web pages. It can help determine whether or not your web pages are experiencing performance issues that you might otherwise not know how to diagnose.

Web application owners want to monitor the operational health of their applications so that they can better understand actual user performance; however, capturing relevant telemetry from your web applications is often very difficult. Today, we are introducing OpenCensus Web (OC Web) to make instrumenting and exporting metrics and distributed traces from web applications simple and automatic.

Background

The OpenCensus project provides a set of language-specific instrumentation libraries that collect traces and metrics from applications and export them to tracing and monitoring backends like Prometheus, Zipkin, Jaeger, Stackdriver, and others.

The OpenCensus Web library is an implementation of OpenCensus that focuses on frontend web application code that executes in the browser. OC Web instruments web pages and collects user-side performance data, including latency and distributed traces, which gives developers the information to diagnose frontend issues and monitor overall application health.

Overshadowing the work on OC Web, the wider OpenCensus family of projects is merging with OpenTracing into OpenTelemetry. OpenCensus Web’s functionality will be migrated into OpenTelemetry JS once this project is ready, although OC Web will continue working as an alpha release in the meantime.

Architecture

OC Web interacts with three application components:
  • Frontend web server: renders the initial HTML to the browser including the OC Web library code and configuration. This would typically be instrumented with an OpenCensus server-side library (Go, Java, etc.). We also suggest that you create an endpoint in the server that receives HTTP/JSON traces and proxies to the OpenCensus Agent.
  • Browser JS: the OC Web library code that runs in the browser. This measures user interactions and collects browser data and writes them to the OpenCensus Agent as spans via HTTP/JSON.
  • OpenCensus Agent: receives traces from the frontend web server proxy endpoint or directly from the browser JS, and exports them to a trace backend (e.g. Stackdriver, Zipkin).
OC Web requires the OpenCensus Agent, which will proxy and re-export telemetry to your backend of choice. For more details see the documentation.


Features

Initial page load tracing

You can use OC Web to capture traces of initial page loads, which will even capture events that take place before the OC Web library was loaded by the browser! Initial page load traces show you which resources may be causing poor website performance, and contain data that you can’t typically capture from a distributed tracing system.

To measure the time of the overall initial page load interaction, OC Web waits until after the document load event and generates spans from the initial load performance timings via the browser's Navigation Timing and Resource Timing APIs. Below is a sample trace from OC Web that has been exported to Zipkin and captured from the initial load example app. Notice that there is an overall ‘nav./’ span for the user navigation experience until the browser load event fires.

This example also includes ‘/’ spans for the client and server side measurements of the initial HTML load. These spans are connected by the server sending back a ‘window.traceparent’ variable in the W3C Trace Context format, which is necessary because the browser does not send a trace context header for the initial page load. The server side spans also indicate how much time was spent parsing and rendering the template:

Notice the long js task span in the previous image, which indicates a CPU-bound JavaScript event loop that took 80.095ms, as measured by the Long Tasks browser API.

Span annotations for DOM and network events

Spans captured by OC Web also include detailed annotations for DOM events like `domInteractive` and `first-paint`, as well as network events like domainLookupStart and secureConnectionStart. Here is a similar trace exported to Stackdriver Trace with the annotations expanded:


User Interactions

For single page applications there are often subsequent interactions after the initial load (e.g. user clicks a button or navigates to a different section of the page). Measuring end-user interactions within a browser application adds useful data for your application:
  • Ability to relate an initial page render with subsequent on-page interactions
  • Visibility into slowness as perceived by the end user, for example, an unresponsive page after clicking
Currently, OC Web tracks clicks and route transitions by monkey-patching the Angular Zone.js library. OC Web tracks the subsequent synchronous and asynchronous tasks (e.g. setTimeouts, XHRs, etc.) caused by the interaction even if there are several concurrent interactions.

Automatic tracing for click events

All browser click events are traced as long as the click is done in a DOM element (e.g. button) and the clicked element is not disabled. When the user clicks the element, a new Zone is created to measure this interaction and determine the total time.

To name this root span, we provide developers with the option of adding the attribute data-ocweb-id to elements and give a custom name to the interaction. For the next example, the resulting name will be ‘Save edit user info’:
<button type="submit" data-ocweb-id="Save edit user info">       Save changes </button>
This helps you to identify the traces related to a specific element. Also, this may avoid ambiguity when there are similar interaction. If you don’t add this attribute, OC Web will use the DOM element ID, the tag name plus the event involved in the interaction. For example, clicking this button:
<button id="save_changes"> Save changes </button>
will generate a span named : “button#save_changes click”.

Automatic tracing for route transitions

OC Web traces route transitions between the different sections of your page by monkey-patching the History API. OC Web will name these interactions with the pattern ‘Navigation /path/to/page’. The following screenshot of a trace exported to Stackdriver from the user interaction example shows a Navigation trace which includes several network calls before the route transition is complete:

Creating your own custom spans

OC Web allows you to instrument your web application with custom spans for tasks or code involved in a user interaction. Here is a code snippet that shows how to do this:

import { tracing } from '@opencensus/web-instrumentation-zone';

function handleClick() {
  // Start child span of the current root span on the current interaction.
  // This must run in in code that the button is running.
  const childSpan = tracing.tracer.startChildSpan({
    name: 'name of your child span'
  });
  // Do some operation...
  // Finish the child span at the end of it's operation
  childSpan.end();
}

See the OC Web documentation for more details.

Automatic spans for HTTP requests and Browser performance data

OC Web automatically intercepts and generates spans for HTTP requests generated by user interactions. Additionally, OC Web attaches Trace Context Headers to each intercepted HTTP request, using the W3C Trace Context format. This is only done for same-origin requests or requests that match a provided regex.

If your servers are also instrumented with OpenCensus, these requests will continue to be traced throughout your backend services! This lets you know if the issues are related to either the front-end or the server-side.

OC Web also includes Performance API data to make annotations like domainLookupStart and responseEnd and generates spans for any CORS preflight requests.

The next screenshot shows a trace exported to Stackdriver as result of the user interaction example. There, you can see the several network calls with the automatic generated spans (e.g. ‘Sent./sleep’) with annotations, the server-side spans (e.g. ‘/sleep’ and ‘ocweb.handlerequest’) and CORS Preflight related spans:

Relate user interactions back to the initial page load tracing

OC Web attaches the initial page load trace id to the user interactions as an attribute and a span link. This enables you to do a trace search by attribute to find the initial load trace and its interactions traces via a single attribute query as well as letting you understand the whole navigation of a user through the application for a given page load.

The next screenshot shows a search by initial_load_trace_id attribute containing all user interaction traces after the initial page loaded:


Making it Real

With OC Web and a few lines of instrumentation, you can now export distributed traces from your web application. Start exploring the initial load and user interaction examples and you're welcome to poke around the source code and send us feedback via either Gitter or contributing with Pull Requests!

By Cristian González – OpenCensus Team – Software Engineering intern at Google Summer 2019 and student of Computer and Systems Engineering at Universidad Nacional de Colombia.

Special thanks to Dave Raffensperger for being initial creator of OC Web and guiding me in the process to develop i
t.

Season of Docs Announces Technical Writing Projects

Season of Docs has announced the technical writers participating in the program and their technical writing projects! You can view a list of organizations and technical writing projects on the website.

The program received nearly 450 technical writer applications, and with them, over 700 technical writing project proposals. The enthusiasm from the technical writing and open source communities has been amazing!

What is next?

During the community bonding period from August 7 to September 1, mentors must work with the technical writers to prepare them for the doc development phase. By the end of community bonding, the technical writer should be familiar with the open source project and community, understand of the product as a whole, establish communication channels with the mentoring organization, and set clear goals and expectations for the project. These are critical to the successful completion of the technical writing project.

Documentation development begins on September 2, 2019.

What is Season of Docs?

Documentation is essential to the adoption of open source projects as well as to the success of their communities. Season of Docs brings together technical writers and open source projects to foster collaboration and improve documentation in the open source space. You can find out more about the program on the introduction page of the website.

During the program, technical writers spend a few months working closely with an open source community. They bring their technical writing expertise to the project's documentation and, at the same time, learn about the open source project and new technologies.

The open source projects work with the technical writers to improve the project's documentation and processes. Together, they may choose to build a new documentation set, redesign the existing docs, or improve and document the project's contribution procedures and onboarding experience.

General Timeline

August 6Google announces the accepted technical writer projects
August 7 - September 1Community bonding: Technical writers get to know mentors and the open source community, and refine their projects in collaboration with their mentors
September 2 - November 29Technical writers work with open source mentors on the accepted projects, and submit their work at the end of the period.
December 10Google publishes the list of successfully-completed projects.

See the full timeline for details, including the provision for projects that run longer than three months.

Find out more

Explore the Season of Docs website at g.co/seasonofdocs to learn more about the program. Use our logo and other promotional resources to spread the word. Check out the FAQ for further questions!

By Andrew Chen, Google Open Source and Sarah Maddox, Google Technical Writer

The Apache Beam Community in 2019

2019 has already been a busy time for the Apache Beam. The ASF blog featured our way of community building and we've had more Beam meetups around the world. Apache Beam also received the Technology of the Year Award from InfoWorld.
As these events happened, we were building up to the 20th anniversary of the Apache Software Foundation. The contributions of the Beam community were a part of Maximilian Michels blog post on the success of the ASF's open source development model:

As the founder of the first Beam meetup in London back in 2017, seeing the community flourish on a larger and worldwide scale is something that makes me happy. And we have come quite a long way since 2017, both in terms of geographical spread:



As well as in numbers:



All of this culminates in two Beam Summits this year—one we already had a few weeks ago in Berlin, and the other which will take place in a few weeks in Las Vegas, where we worked together with Apache and the ApacheCon team.

In that spirit, let's have a more detailed overview of the things that have happened, what the next few months look like, and how we can foster even more community growth.

Meetups

We've had a flurry of activity, with several meetups in the planning process and more popping up globally over time. As diversity of contributors is a core ASF value, this geographic spread is exciting for the community. Here's a picture from the latest Apache Beam meetup organized at Lyft in San Francisco:



We have more Bay Area meetups coming soon, and the community is looking into kicking off a meetup in Toronto and New York! In Europe, London had its first meetup of 2019 at the start of April, as did Stockholm at the start of May:
Meetup groups are becoming active in Berlin and New York also, so stay tuned for events there and more meetups internationally! If you are interested in starting your own meetup, feel free to reach out! Good places to start include our Slack channel, the dev and user mailing lists, or the Apache Beam Twitter. Even if you can’t travel to these meetups, you can stay informed on the happenings of the community. The talks and sessions from previous conferences and meetups are archived on the Apache Beam YouTube channel. If you want your session added to the channel, don’t hesitate to get in touch!

Summits

The first summit of the year was held in Berlin this past June. You can read about the inaugural edition of the Beam Summit Europe here. At these summits, you have the opportunity to meet with other Apache Beam creators and users, get expert advice, learn from the speaker sessions, and participate in workshops. We are proud to say that the Summit doubled in size this year with attendees from 24 countries across 4 continents.

You can find resources from this year’s Summit here:
  • ? the recordings can be found on our YouTube channel.
  • ? presentations of the Summit are made available via the website and in this folder.
  • We strongly encourage you to get involved again this year! You can still sign up for the upcoming summit in North America.
  • ? If you want to secure your ticket to attend the Beam Summit North America 2019, check our the ApacheCon website.
  • ? In case you want to get involved in speaking at events, do not hesitate to contact us via email or Twitter.

Why community engagement matters

Why we need a strong Apache Beam community:
  • We’re gaining lots of code contributions and need committers to review them
  • We want people to feel a sense of ownership to the project. By fostering this level of engagement, the work becomes even more exciting.
  • A healthy community has a further reach and leads to more growth. More hours can be contributed to the project as we can spread the work and ownership.
Why are we organizing these summits:
  • We’d like to give folks a place to meet and share ideas.
  • We know that offline interactions often changes the nature of the online ones in a positive manner.
  • Building an active and diverse community is part of the Apache Way. These summits provide an opportunity for us to engage people from different locations, companies, and backgrounds.
By Matthias Baetens, Google Developer Expert for Cloud, Apache Beam committer and community organiser

Announcing Docsy: A Website Theme for Technical Documentation

Have you ever struggled with the process of creating documentation for an open source project? Do you have an open source project that's outgrown its README? Open source projects need great docs to succeed, but great open source doc sites aren't always easy to produce and share.

Google supports over 2000 open source projects, and there has been growing demand from these projects for tooling and guidance to help them write and publish their documentation. To meet this need we created Docsy: a documentation website with templates and guidance for documentation, which we’re open sourcing to the public to use and help improve the tool.

Docsy builds on existing open source tools, like Hugo, and our experience with open source docs, providing a fast and easy way to stand up an OSS documentation website with features specifically designed to support technical documentation. Special features include everything from site navigation to multi-language support – with easy site deployment options provided by Hugo. We also created guidance on how to add additional pages, structure your documentation, and accept community contributions, all with the goal of letting you focus on creating great content.

Who’s using it?

The Kubeflow, Knative, and Agones websites were built using the Docsy theme, with more projects in the pipeline. We’ve also created an example site that uses lots of Docsy features for you to explore and copy.

Ready to get started?

Visit the Docsy site to find out how to create your first site with Docsy! You can either use Docsy like a regular Hugo theme, or clone our example site. Docsy is an open source project—of course—and we welcome your issues, suggestions, and contributions!

Come Meet the Google Open Source Team at OSCON!

Google Cloud is proud to be a Diamond Sponsor at OSCON, and we’re excited for another year of connecting, learning, and sharing with the open source community! Google is deeply grateful to all of your amazing open source efforts, so to celebrate, our booth will have an Open Gratitude wall where we will acknowledge your contributions, and where we encourage you to express your gratitude for those who have helped you in open source!

Once you’ve recognized your open source heroes on the Open Gratitude wall, stick around at the Google Open Source booth to learn about topics such as open source governance, documentation, open source in ML and gaming, encouraging non-code contributions, and about Google’s open source outreach programs in general. At our booth sessions you can also explore open source projects such as Kubernetes, Istio, Go, and Beam (as well as other Apache projects). Booth office hours run from 10:15am to 7pm Wednesday, July 17, and from 10:15am to 4:10pm on Thursday, July 18. The full schedule will be posted at the booth—please come by and check it out!

In addition to the events at the booth, the Google open source team has two workshops on Tuesday, July 16:
This half-day workshop kicks off with an overview of research-backed documentation best practices. Andrew Chen, Erin McKean, and Aizhamal Nurmamat kyzy lead you through a hands-on exercise in which you'll create the skeleton of a ready-to-deploy documentation website for your open source project.
Paris Pittman takes you through the ins and outs of the Kubernetes contributor community so you can land your first PR. You'll learn about SIGs, the GitHub workflow, its automation and continuous integration (CI), setting up your dev environment, and much more. Stick around until the end, and you'll have time to work on your first PR with the help of current contributors.
We also hope you attend the main conference sessions presented by Googlers, especially the keynotes on Wednesday (Built to last: What Google and Microsoft have learned growing open source communities) and Thursday (Be a Docs Star), and the sessions on Wednesday:
And Thursday:
As part of our commitment to creating a diverse and inclusive community, we’ve redirected our conference swag budget into diversity scholarships. (We believe you’d prefer to have more interesting conversations with a wider range of people over another pair of socks!) But if you are looking for a souvenir of your time in Portland there will be a special Portland-themed sticker featuring Pancakes, the (extremely adorable) gRPC mascot, and we encourage projects to take and leave stickers in our sticker-swap space!

OSCON is one of the highlights of the year for those of us who love open source—we’re thrilled to be able to share what we’ve learned with you, and to learn what you’re interested in and excited about (and also what you think could improve). See you in Portland!

Truth 1.0: Fluent Assertions for Java and Android Tests

Software testing is important—and sometimes frustrating. The frustration can come from working on innately hard domains, like concurrency, but too often it comes from a thousand small cuts:
assertEquals("Message has been sent", getString(notification, EXTRA_BIG_TEXT));
assertTrue(
    getString(notification, EXTRA_TEXT)
        .contains("Kurt Kluever <[email protected]>"));
The two assertions above test almost the same thing, but they are structured differently. The difference in structure makes it hard to identify the difference in what's being tested.
A better way to structure these assertions is to use a fluent API:
assertThat(getString(notification, EXTRA_BIG_TEXT))
    .isEqualTo("Message has been sent");
assertThat(getString(notification, EXTRA_TEXT))
    .contains("Kurt Kluever <[email protected]>");
A fluent API naturally leads to other advantages:
  • IDE autocompletion can suggest assertions that fit the value under test, including rich operations like containsExactly(permission.SEND_SMS, permission.READ_SMS).
  • Failure messages can include the value under test and the expected result. Contrast this with the assertTrue call above, which lacks a failure message entirely.
Google's fluent assertion library for Java and Android is Truth. We're happy to announce that we've released Truth 1.0, which stabilizes our API after years of fine-tuning.



Truth started in 2011 as a Googler's personal open source project. Later, it was donated back to Google and cultivated by the Java Core Libraries team, the people who bring you Guava.
You might already be familiar with assertion libraries like Hamcrest and AssertJ, which provide similar features. We've designed Truth to have a simpler API and more readable failure messages. For example, here's a failure message from AssertJ:
java.lang.AssertionError:
Expecting:
  <[year: 2019
month: 7
day: 15
]>
to contain exactly in any order:
  <[year: 2019
month: 6
day: 30
]>
elements not found:
  <[year: 2019
month: 6
day: 30
]>
and elements not expected:
  <[year: 2019
month: 7
day: 15
]>
And here's the equivalent message from Truth:
value of:
    iterable.onlyElement()
expected:
    year: 2019
    month: 6
    day: 30

but was:
    year: 2019
    month: 7
    day: 15
For more details, read our comparison of the libraries, and try Truth for yourself.

Also, if you're developing for Android, try AndroidX Test. It includes Truth extensions that make assertions even easier to write and failure messages even clearer:
assertThat(notification).extras().string(EXTRA_BIG_TEXT)
    .isEqualTo("Message has been sent");
assertThat(notification).extras().string(EXTRA_TEXT)
    .contains("Kurt Kluever <[email protected]>");
Coming soon: Kotlin users of Truth can look forward to Kotlin-specific enhancements.
By Chris Povirk, Java Core Libraries

Google’s robots.txt Parser is Now Open Source

Originally posted on the Google Webmaster Central Blog

For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.

We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

We also included a testing tool in the open source package to help you test a few rules. Once built, the usage is very straightforward:

robots_main <robots.txt content> <user_agent> <url>

If you want to check out the library, head over to our GitHub repository for the robots.txt parser. We'd love to see what you can build using it! If you built something using the library, drop us a comment on Twitter, and if you have comments or questions about the library, find us on GitHub.

Posted by Edu Pereda, Lode Vandevenne, and Gary, Search Open Sourcing team