Tag Archives: open source release

TF-Ranking: a scalable TensorFlow library for learning-to-rank

Cross-posted from the Google AI Blog.

Ranking, the process of ordering a list of items in a way that maximizes the utility of the entire list, is applicable in a wide range of domains, from search engines and recommender systems to machine translation, dialogue systems and even computational biology. In applications like these (and many others), researchers often utilize a set of supervised machine learning techniques called learning-to-rank. In many cases, these learning-to-rank techniques are applied to datasets that are prohibitively large — scenarios where the scalability of TensorFlow could be an advantage. However, there is currently no out-of-the-box support for applying learning-to-rank techniques in TensorFlow. To the best of our knowledge, there are also no other open source libraries that specialize in applying learning-to-rank techniques at scale.

Today, we are excited to share TF-Ranking, a scalable TensorFlow-based library for learning-to-rank. As described in our recent paper, TF-Ranking provides a unified framework that includes a suite of state-of-the-art learning-to-rank algorithms, and supports pairwise or listwise loss functions, multi-item scoring, ranking metric optimization, and unbiased learning-to-rank.

TF-Ranking is fast and easy to use, and creates high-quality ranking models. The unified framework gives ML researchers, practitioners and enthusiasts the ability to evaluate and choose among an array of different ranking models within a single library. Moreover, we strongly believe that a key to a useful open source library is not only providing sensible defaults, but also empowering our users to develop their own custom models. Therefore, we provide flexible API's, within which the users can define and plug in their own customized loss functions, scoring functions and metrics.

Existing Algorithms and Metrics Support

The objective of learning-to-rank algorithms is minimizing a loss function defined over a list of items to optimize the utility of the list ordering for any given application. TF-Ranking supports a wide range of standard pointwise, pairwise and listwise loss functions as described in prior work. This ensures that researchers using the TF-Ranking library are able to reproduce and extend previously published baselines, and practitioners can make the most informed choices for their applications. Furthermore, TF-Ranking can handle sparse features (like raw text) through embeddings and scales to hundreds of millions of training instances. Thus, anyone who is interested in building real-world data intensive ranking systems such as web search or news recommendation, can use TF-Ranking as a robust, scalable solution.

Empirical evaluation is an important part of any machine learning or information retrieval research. To ensure compatibility with prior work,  we support many of the commonly used ranking metrics, including Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). We also make it easy to visualize these metrics at training time on TensorBoard, an open source TensorFlow visualization dashboard.
An example of the NDCG metric (Y-axis) along the training steps (X-axis) displayed in the TensorBoard. It shows the overall progress of the metrics during training. Different methods can be compared directly on the dashboard. Best models can be selected based on the metric.

Multi-Item Scoring

TF-Ranking supports a novel scoring mechanism wherein multiple items (e.g., web pages) can be scored jointly, an extension of the traditional scoring paradigm in which single items are scored independently. One challenge in multi-item scoring is the difficulty for inference where items have to be grouped and scored in subgroups. Then, scores are accumulated per-item and used for sorting. To make these complexities transparent to the user, TF-Ranking provides a List-In-List-Out (LILO) API to wrap all this logic in the exported TF models.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
As we demonstrate in recent work, multi-item scoring is competitive in its performance to the state-of-the-art learning-to-rank models such as RankNet, MART, and LambdaMART on a public LETOR benchmark.

Ranking Metric Optimization

An important research challenge in learning-to-rank is direct optimization of ranking metrics (such as the previously mentioned NDCG and MRR).  These metrics, while being able to measure the performance of ranking systems better than the standard classification metrics like Area Under the Curve (AUC), have the unfortunate property of being either discontinuous or flat. Therefore standard stochastic gradient descent optimization of these metrics is problematic.

In recent work, we proposed a novel method, LambdaLoss, which provides a principled probabilistic framework for ranking metric optimization. In this framework, metric-driven loss functions can be designed and optimized by an expectation-maximization procedure. The TF-Ranking library integrates the recent advances in direct metric optimization and provides an implementation of LambdaLoss. We are hopeful that this will encourage and facilitate further research advances in the important area of ranking metric optimization.

Unbiased Learning-to-Rank

Prior research has shown that given a ranked list of items, users are much more likely to interact with the first few results, regardless of their relevance. This observation has inspired research interest in unbiased learning-to-rank, and led to the development of unbiased evaluation and several unbiased learning algorithms, based on training instances re-weighting. In the TF-Ranking library, metrics are implemented to support unbiased evaluation and losses are implemented for unbiased learning by natively supporting re-weighting to overcome the inherent biases in user interactions datasets.

Getting Started with TF-Ranking

TF-Ranking implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. TF-Ranking is well integrated with the rich TensorFlow ecosystem. As described above, you can use TensorBoard to visualize ranking metrics like NDCG and MRR, as well as to pick the best model checkpoints using these metrics. Once your model is ready, it is easy to deploy it in production using TensorFlow Serving.

If you’re interested in trying TF-Ranking for yourself, please check out our GitHub repo, and walk through the tutorial examples. TF-Ranking is an active research project, and we welcome your feedback and contributions. We are excited to see how TF-Ranking can help the information retrieval and machine learning research communities.

By Xuanhui Wang and Michael Bendersky, Software Engineers, Google AI

Acknowledgements

This project was only possible thanks to the members of the core TF-Ranking team: Rama Pasumarthi, Cheng Li, Sebastian Bruch, Nadav Golbandi, Stephan Wolf, Jan Pfeifer, Rohan Anil, Marc Najork, Patrick McGregor and Clemens Mewald‎. We thank the members of the TensorFlow team for their advice and support: Alexandre Passos, Mustafa Ispir, Karmel Allison, Martin Wicke, and others. Finally, we extend our special thanks to our collaborators, interns and early adopters: Suming Chen, Zhen Qin, Chirag Sethi, Maryam Karimzadehgan, Makoto Uchida, Yan Zhu, Qingyao Ai, Brandon Tran, Donald Metzler, Mike Colagrosso, and many others at Google who helped in evaluating and testing the early versions of TF-Ranking.

Introducing a Web Component and Data API for Quick, Draw!


Over the past couple years, the Creative Lab in collaboration with the Handwriting Recognition team have released a few experiments in the realm of “doodle” recognition.  First, in 2016, there was Quick, Draw!, which uses a neural network to guess what you’re drawing.  Since Quick, Draw! launched we have collected over 1 billion drawings across 345 categories.  In the wake of that popularity, we open sourced a collection of 50 million drawings giving developers around the world access to the data set and the ability to conduct research with it.

"The different ways in which people draw are like different notes in some universally human scale" - Ian Johnson, UX Engineer @ Google

Since the initial dataset was released, it has been incredible to see how graphs, t-sne clusters, and simply overlapping millions of these doodles have given us the ability to infer interesting human behaviors, across different cultures.  One example, from the Quartz study, is that 86% of Americans (from a sample of 50,000) draw their circles counterclockwise, while 80% of Japanese (from a sample of 800) draw them clockwise. Part of this pattern in behavior can be attributed to the strict stroke order in Japanese writing, from the top left to the bottom right.


It’s also interesting to see how the data looks when it’s overlaid by country, as Kyle McDonald did, when he discovered that some countries draw their chairs in perspective while others draw them straight on.


On the more fun, artistic spectrum, there are some simple but clever uses of the data like Neil Mendoza’s face tracking experiment and Deborah Schmidt’s letter collages.
See the video here of Neil Mendoza mapping Quick, Draw! facial features to your own face in front of a webcam


See the process video here of Deborah Schmidt packing QuickDraw data into letters using OpenFrameworks
Some handy tools have also been released from the community since the release of all this data, and one of those that we’re releasing now is a Polymer component that allows you to display a doodle in your web-based project with one line of markup:

The Polymer component is coupled with a Data API that layers a massive file directory (50 million files) and returns a JSON object or an HTML canvas rendering for each drawing.  Without downloading all the data, you can start creating right away in prototyping your ideas.  We’ve also provided instructions for how to host the data and API yourself on Google Cloud Platform (for more serious projects that demand a higher request limit).  

One really handy tool when hosting an API on Google Cloud is Cloud Endpoints.  It allowed us to launch a demo API with a quota limit and authentication via an API key.  

By defining an OpenAPI specification (here is the Quick, Draw! Data API spec) and adding these three lines to our app.yaml file, an Extensible Service Proxy (ESP) gets deployed with our API backend code (more instructions here):
endpoints_api_service:
name: quickdrawfiles.appspot.com
rollout_strategy: managed
Based on the OpenAPI spec, documentation is also automatically generated for you:


We used a public Google Group as an access control list, so anyone who joins can then have the API available in their API library.
The Google Group used as an Access Control List
This component and Data API will make it easier for  creatives out there to manipulate the data for their own research.  Looking to the future, a potential next step for the project could be to store everything in a single database for more complex queries (i.e. “give me an recognized drawing from China in March 2017”).  Feedback is always welcome, and we hope this inspires even more types of projects using the data! More details on the project and the incredible research projects done using it can be found on our GitHub repo

By Nick Jonas, Creative Technologist, Creative Lab

Editor's Note: Some may notice that this isn’t the only dataset we’ve open sourced recently! You can find many more datasets in our open source project directory.

Outline: secure access to the open web

Censorship and surveillance are challenges that many journalists around the world face on a daily basis. Some of them use a virtual private network (VPN) to provide safer access to the open internet, but not all VPNs are equally reliable and trustworthy, and even fewer are open source.

That’s why Jigsaw created Outline, a new open source, independently audited platform that lets any organization easily create and operate their own VPN.

Outline’s most striking feature is arguably how easy it is to use. An organization starts by downloading the Outline Manager app, which lets them sign in to DigitalOcean, where they can host their own VPN, and set it up with just a few clicks. They can also easily use other cloud providers, provided they have shell access to run the installation script. Once an Outline server is set up, the server administrator can create access credentials and share with their network of contacts, who can then use the Outline clients to connect to it.


A core element to any VPN’s security is the protocol that the server and clients use to communicate. When we looked at the existing protocols, we realized that many of them were easily identifiable by network adversaries looking to spot and block VPN traffic. To make Outline more resilient against this threat, we chose Shadowsocks, a secure, handshake-less, and open source protocol that is known for its strength and performance, and enjoys the support of many developers worldwide. Shadowsocks is a combination of a simplified SOCKS5-like routing protocol, running on top of an encrypted channel. We chose the AEAD_CHACHA20_POLY1305 cipher, which is an IETF standard and provides the security and performance users need.

Another important component to security is running up-to-date software. We package the server code as a Docker image, enabling us to run on multiple platforms, and allowing for automatic updates using Watchtower. On DigitalOcean installations, we also enable automatic security updates on the host machine.

If security is one of the most critical parts of creating a better VPN, usability is the other. We wanted Outline to offer a consistent, simple user experience across platforms, and for it to be easy for developers around the world to contribute to it. With that in mind, we use the cross-platform development framework Apache Cordova for Android, iOS, macOS and ChromeOS, and Electron for Windows. The application logic is a web application written in TypeScript, while the networking code had to be written in native code for each platform. This setup allows us to reutilize most of code, and create consistent user experiences across diverse platforms.

In order to encourage a robust developer community we wanted to strike a balance between simplicity, reproducibility, and automation of future contributions. To that end, we use Travis for continuous builds and to generate the binaries that are ultimately uploaded to the app stores. Thanks to its cross-platform support, any team member can produce a macOS or Windows binary with a single click. We also use Docker to package the build tools for client platforms, and thanks to Electron, developers familiar with the server's Node.js code base can also contribute to the Outline Manager application.

You can find our code in the Outline GitHub repositories and more information on the Outline website. We hope that more developers join the project to build technology that helps people connect to the open web and stay more safe online.

By Vinicius Fortuna, Jigsaw

Introducing the Tink cryptographic software library

Cross-posted on the Google Security Blog

At Google, many product teams use cryptographic techniques to protect user data. In cryptography, subtle mistakes can have serious consequences, and understanding how to implement cryptography correctly requires digesting decades' worth of academic literature. Needless to say, many developers don’t have time for that.

To help our developers ship secure cryptographic code we’ve developed Tink—a multi-language, cross-platform cryptographic library. We believe in open source and want Tink to become a community project—thus Tink has been available on GitHub since the early days of the project, and it has already attracted several external contributors. At Google, Tink is already being used to secure data of many products such as AdMob, Google Pay, Google Assistant, Firebase, the Android Search App, etc. After nearly two years of development, today we’re excited to announce Tink 1.2.0, the first version that supports cloud, Android, iOS, and more!

Tink aims to provide cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse. Tink is built on top of existing libraries such as BoringSSL and Java Cryptography Architecture, but includes countermeasures to many weaknesses in these libraries, which were discovered by Project Wycheproof, another project from our team.

With Tink, many common cryptographic operations such as data encryption, digital signatures, etc. can be done with only a few lines of code. Here is an example of encrypting and decrypting with our AEAD interface in Java:
 import com.google.crypto.tink.Aead;
import com.google.crypto.tink.KeysetHandle;
import com.google.crypto.tink.aead.AeadFactory;
import com.google.crypto.tink.aead.AeadKeyTemplates;
// 1. Generate the key material.
KeysetHandle keysetHandle = KeysetHandle.generateNew(
AeadKeyTemplates.AES256_EAX);
// 2. Get the primitive.
Aead aead = AeadFactory.getPrimitive(keysetHandle);
// 3. Use the primitive.
byte[] plaintext = ...;
byte[] additionalData = ...;
byte[] ciphertext = aead.encrypt(plaintext, additionalData);
Tink aims to eliminate as many potential misuses as possible. For example, if the underlying encryption mode requires nonces and nonce reuse makes it insecure, then Tink does not allow the user to pass nonces. Interfaces have security guarantees that must be satisfied by each primitive implementing the interface. This may exclude some encryption modes. Rather than adding them to existing interfaces and weakening the guarantees of the interface, it is possible to add new interfaces and describe the security guarantees appropriately.

We’re cryptographers and security engineers working to improve Google’s product security, so we built Tink to make our job easier. Tink shows the claimed security properties (e.g., safe against chosen-ciphertext attacks) right in the interfaces, allowing security auditors and automated tools to quickly discover usages where the security guarantees don’t match the security requirements. Tink also isolates APIs for potentially dangerous operations (e.g., loading cleartext keys from disk), which allows discovering, restricting, monitoring and logging their usage.

Tink provides support for key management, including key rotation and phasing out deprecated ciphers. For example, if a cryptographic primitive is found to be broken, you can switch to a different primitive by rotating keys, without changing or recompiling code.

Tink is also extensible by design: it is easy to add a custom cryptographic scheme or an in-house key management system so that it works seamlessly with other parts of Tink. No part of Tink is hard to replace or remove. All components are composable, and can be selected and assembled in various combinations. For example, if you need only digital signatures, you can exclude symmetric key encryption components to minimize code size in your application.

To get started, please check out our HOW-TO for Java, C++ and Obj-C. If you'd like to talk to the developers or get notified about project updates, you may want to subscribe to our mailing list. To join, simply send an empty email to tink-users+subscribe@googlegroups.com. You can also post your questions to StackOverflow, just remember to tag them with tink.

We’re excited to share this with the community, and welcome your feedback!

By Thai Duong, Information Security Engineer, on behalf of Tink team

How we brought the latest version of Python to App Engine and Cloud Functions

At Cloud Next 2018, we added Python 3.7 support to Cloud Functions and now we’ve announced Python 3.7 support for the App Engine standard environment. These new runtimes allow you to write Python functions and apps using the latest version of Python and the rich ecosystem of packages available on Python Packaging Index (PyPI).

This new runtime marks a significant update to App Engine and was enabled by new open source software that we recently released: gVisor and FTL.

Python, straight from the source

Running Python 3.7 on App Engine and Cloud Functions required us to fundamentally rethink our infrastructure. Traditionally, meeting Google Cloud’s security requirements meant that we had to run a modified version of the Python interpreter. However, using a modified interpreter constrained some language features and only allowed us to support a limited set of whitelisted Python libraries.

Thanks to gVisor, a container sandbox that provides improved security and process isolation, we can now run the unmodified Python 3.7.0 interpreter. We’ve done extensive testing to make sure Python 3.7 is compatible with gVisor. As part of our compatibility testing, we run Python’s full suite of language tests, and tests for Python packages that are popular on PyPI. We’re committed to ensuring that everything you’ve come to know and love about Python is supported on our platform.

Seamless deployments

Most importantly, this change in our infrastructure makes it easier to take advantage of Python’s vast ecosystem. As a developer, you just add project dependencies to a requirements.txt file and deploy.

During deployment, FTL, a tool for building containers, fetches dependencies listed in your requirements.txt file and installs them alongside your app or function. FTL also includes a short-lived dependency cache, which speeds up repeated deployments if no changes are detected in your requirements.txt file. This is particularly useful if you find just need to re-deploy because you found a typo.

Keeping up with the Pythonistas

In making these changes, we also decided to expand the list of system packages that are included with each runtime’s Ubuntu 18.04 distribution. We think that will make life just a little bit easier for developers working with the latest release of Python.

Looking forward, we’re excited about how these changes will allow us to keep up with the Python community’s progress as they release new versions and libraries. Please let us know what you think and if you run into any challenges.

You can learn more about how to get started with it on App Engine and Cloud Functions in our documentation. We can’t wait to see what you build with Python 3.7.

By Stewart Reichling, Product Manager

Introducing the new lead for Android Open Source Project

This week began with the announcement of Android 9 Pie and, as usual, the subsequent upstreaming of code to the Android Open Source Project (AOSP). But the release of Android 9 isn’t the only important Android news!

Tucked away in the announcement to the Android Building mailing list was this note:

“I also wanted to take a moment to introduce myself as the new Tech Lead / Manager for AOSP. My name is Jeff Bailey, and I’ve been involved in the Open Source community for more than two decades. Since I joined the Android team a few months ago, I’ve been learning how we do things and getting an understanding of how we could work better with the community. I’d love to hear from you: @JeffBaileyAOSP on Twitter or jeffbailey+aosp@google.com. Be well!”

As Jeff notes in his introduction, he has a history in free and open source software (FOSS). He’s been an avid user, contributor, and maintainer since before the Open Source Definition was inked!

Jeff co-founded Savannah, where GNU software is developed and distributed, spent 15 years working on Debian, and has been an Ubuntu core developer. Further, he spent some time on the Google Open Source team and was involved in open sourcing Android back in 2008.

Open source projects, even those which originate inside of companies, are powered by the community of users and contributors that surround them. And those communities thrive when they have stewards who are steeped in the traditions of free and open source software. We’re excited for AOSP as Jeff takes the reins. He brings both technical and cultural skills to the table, and he’s been involved with the project since the beginning!

Suffice it to say, AOSP is in good hands. We welcome Jeff to his new role and, as he said in his introduction, he’d love to hear from the community: you can reach Jeff on Twitter and via email.

By Josh Simmons, Google Open Source

Announcing Cirq: an open source framework for NISQ algorithms

Cross-posted from the Google AI Blog

Over the past few years, quantum computing has experienced a growth not only in the construction of quantum hardware, but also in the development of quantum algorithms. With the availability of Noisy Intermediate Scale Quantum (NISQ) computers (devices with ~50 - 100 qubits and high fidelity quantum gates), the development of algorithms to understand the power of these machines is of increasing importance. However, a common problem when designing a quantum algorithm on a NISQ processor is how to take full advantage of these limited quantum devices—using resources to solve the hardest part of the problem rather than on overheads from poor mappings between the algorithm and hardware. Furthermore some quantum processors have complex geometric constraints and other nuances, and ignoring these will either result in faulty quantum computation, or a computation that is modified and sub-optimal.*

Today at the First International Workshop on Quantum Software and Quantum Machine Learning (QSML), the Google AI Quantum team announced the public alpha of Cirq, an open source framework for NISQ computers. Cirq is focused on near-term questions and helping researchers understand whether NISQ quantum computers are capable of solving computational problems of practical importance. Cirq is licensed under Apache 2, and is free to be modified or embedded in any commercial or open source package.

Once installed, Cirq enables researchers to write quantum algorithms for specific quantum processors. Cirq gives users fine tuned control over quantum circuits, specifying gate behavior using native gates, placing these gates appropriately on the device, and scheduling the timing of these gates within the constraints of the quantum hardware. Data structures are optimized for writing and compiling these quantum circuits to allow users to get the most out of NISQ architectures. Cirq supports running these algorithms locally on a simulator, and is designed to easily integrate with future quantum hardware or larger simulators via the cloud.


We are also announcing the release of OpenFermion-Cirq, an example of a Cirq based application enabling near-term algorithms. OpenFermion is a platform for developing quantum algorithms for chemistry problems, and OpenFermion-Cirq is an open source library which compiles quantum simulation algorithms to Cirq. The new library uses the latest advances in building low depth quantum algorithms for quantum chemistry problems to enable users to go from the details of a chemical problem to highly optimized quantum circuits customized to run on particular hardware. For example, this library can be used to easily build quantum variational algorithms for simulating properties of molecules and complex materials.

Quantum computing will require strong cross-industry and academic collaborations if it is going to realize its full potential. In building Cirq, we worked with early testers to gain feedback and insight into algorithm design for NISQ computers. Below are some examples of Cirq work resulting from these early adopters:
To learn more about how Cirq is helping enable NISQ algorithms, please visit the links above where many of the adopters have provided example source code for their implementations.

Today, the Google AI Quantum team is using Cirq to create circuits that run on Google’s Bristlecone processor. In the future, we plan to make this processor available in the cloud, and Cirq will be the interface in which users write programs for this processor. In the meantime, we hope Cirq will improve the productivity of NISQ algorithm developers and researchers everywhere. Please check out the GitHub repositories for Cirq and OpenFermion-Cirq — pull requests welcome!

By Alan Ho, Product Lead and Dave Bacon, Software Lead, Google AI Quantum Team

Acknowledgements
We would like to thank Craig Gidney for leading the development of Cirq, Ryan Babbush and Kevin Sung for building OpenFermion-Cirq and a whole host of code contributors to both frameworks.



* An analogous situation is how early classical programmers needed to run complex programs in very small memory spaces by paying careful attention to the lowest level details of the hardware.

Introducing Data Transfer Project: an open source platform promoting universal data portability

In 2007, a small group of engineers in our Chicago office formed the Data Liberation Front, a team that believed consumers should have better tools to put their data where they want, when they want, and even move it to a different service. This idea, called “data portability,” gives people greater control of their information, and pushes us to develop great products because we know they can pack up and leave at any time.

In 2011, we launched Takeout, a new way for Google users to download or transfer a copy of the data they store or create in a variety of industry-standard formats. Since then, we've continued to invest in Takeout—we now call it Download Your Data—and today, our users can download a machine-readable copy of the data they have stored in 50+ Google products, with more on the way.

Now, we’re taking our commitment to portability a step further. In tandem with Microsoft, Twitter, and Facebook we’re announcing the Data Transfer Project, an open source initiative dedicated to developing tools that will enable consumers to transfer their data directly from one service to another, without needing to download and re-upload it. Download Your Data users can already do this; they can transfer their information directly to their Dropbox, Box, MS OneDrive, and Google Drive accounts today. With this project, the development of which we mentioned in our blog post about preparations for the GDPR, we’re looking forward to working with companies across the industry to bring this type of functionality to individuals across the web.

Our approach

The organizations involved with this project are developing tools that can convert any service's proprietary APIs to and from a small set of standardized data formats that can be used by anyone. This makes it possible to transfer data between any two providers using existing industry-standard infrastructure and authorization mechanisms, such as OAuth. So far, we have developed adapters for seven different service providers across five different types of consumer data; we think this demonstrates the viability of this approach to scale to a large number of use cases.

Consumers will benefit from improved flexibility and control over their data. They will be able to import their information into any participating service that offers compelling features—even brand new ones that could rely on powerful, cloud-based infrastructure rather than the consumers’ potentially limited bandwidth and capability to transfer files. Services will benefit as well, as they will be able to compete for users that can move their data more easily.

Protecting users’ data and keeping them in control

Data security and privacy are foundational to the design of the Data Transfer Project. Services must first agree to allow data transfer between them, and then they will require that individuals authenticate each account independently. All credentials and user data will be encrypted both in transit and at rest. The protocol uses a form of perfect forward secrecy where a new unique key is generated for each transfer. Additionally, the framework allows partners to support any authorization mechanism they choose. This enables partners to leverage their existing security infrastructure when authorizing accounts.

As it is an open source product, anyone can inspect the code to verify that data isn't being collected or used for profiling purposes. Tech savvy consumers are also free to download and run an instance of the framework themselves. Interested parties can learn more at the Data Transfer Project website, which explains the technical foundations behind the project and goes into greater detail on how it works.

How to get involved

It is very early days for the Data Transfer Project and we encourage the developer community to join us and help extend the platform to support many more data types, service providers, and hosting solutions.

The Data Transfer Project’s open source code can be found at datatransferproject.dev and you can learn more about Google’s approach to portability in our paper, where we describe our history with this topic and the values and principles that motivated us to invest in the Data Transfer Project. Our prototype already supports data transfer for several product verticals including: photos, mail, contacts, calendar, and tasks. These are enabled by existing, publicly available APIs from Google, Microsoft, Twitter, Flickr, Instagram, Remember the Milk, and Smugmug.

Data portability makes it easy for consumers to try new services and use the ones that they like best. We’re thrilled to help drive an initiative that incentivizes companies large and small to continue innovating across the internet. We’re just getting started and we’re looking forward to what comes next.

By Brian Willard, Software Engineer and Greg Fair, Product Manager

31st anniversary of the GIF: give your terminal some personality with Tenor GIF for CLI

Creating ASCII art for your terminal isn’t new. Displaying animating ASCII GIFs in the CLI (command line interface), however, hasn’t been possible, or at least, not easy to do -- until now.

Shortly after Tenor was acquired by Google, I had an idea.

Many developers configure static ASCII art to appear when opening a terminal, but I imagined that ASCII art could animate like a GIF, and could easily be created from any GIF on Tenor.

After some tinkering, GIF for CLI was born.

Just in time for the 31st anniversary of the GIF, GIF for CLI is available today on GitHub. GIF for CLI takes in a GIF, short video, or a query to the Tenor GIF API and converts it to animated ASCII art. This means each time you log on to your programming workstation, your GIF is there to greet you in ASCII form. Animation and color support are performed using ANSI escape sequences.

Rob Delaney as “Peter” from Deadpool 2, in ASCII GIF form. See the original GIF on Tenor here.
For the more technically-minded, here are the details:

When the command line program is run, it takes the chosen .gif file (file, url, or Tenor query) and uses ffmpeg to split the animated gif or short video into static jpg frames. Those jpg frames are then converted to ASCII art (these ASCII art frames are cached in $HOME/.cache/gif-for-cli). The program then prints one frame at a time to the console, clearing the console using ANSI escape sequences between each frame.

You could also use GIF for CLI to run gif-for-cli in your .bashrc or .profile to get an animated ASCII art image as your MOTD, or with Git hooks.

GIF for CLI integrates with the Tenor GIF API to source the GIFs. The Tenor API powers GIF search within many of today’s most popular messaging apps and social platforms on iOS and Android. 

Share screen captures of your ASCII GIFs with us on Twitter with #GIFforCLI. 

By Sean Hayes, Tenor

Introducing Git protocol version 2


Today we announce Git protocol version 2, a major update of Git's wire protocol (how clones, fetches and pushes are communicated between clients and servers). This update removes one of the most inefficient parts of the Git protocol and fixes an extensibility bottleneck, unblocking the path to more wire protocol improvements in the future.

The protocol version 2 spec can be found here. The main improvements are:
    The main motivation for the new protocol was to enable server side filtering of references (branches and tags). Prior to protocol v2, servers responded to all fetch commands with an initial reference advertisement, listing all references in the repository. This complete listing is sent even when a client only cares about updating a single branch, e.g.: `git fetch origin master`. For repositories that contain 100s of thousands of references (the Chromium repository has over 500k branches and tags) the server could end up sending 10s of megabytes of data that get ignored. This typically dominates both time and bandwidth during a fetch, especially when you are updating a branch that's only a few commits behind the remote, or even when you are only checking if you are up-to-date, resulting in a no-op fetch.

    We recently rolled out support for protocol version 2 at Google and have seen a performance improvement of 3x for no-op fetches of a single branch on repositories containing 500k references. Protocol v2 has also enabled a reduction of 8x of the overhead bytes (non-packfile) sent from googlesource.com servers. A majority of this improvement is due to filtering references advertised by the server to the refs the client has expressed interest in.

    Getting over the hurdles

    The Git project has tried on a number of occasions over the years to either limit the initial ref advertisement or move to a new protocol altogether but continued to run into two problems: (1) the initial request is rigid and does not include a field that could be used to request that new servers modify their response without breaking compatibility with existing servers and (2) error handling is not well enough defined to allow safely using a new protocol that existing servers do not understand with a quick fallback to the old protocol. To migrate to a new protocol version, we needed to find a side channel which existing servers would ignore but could be used to safely communicate with newer servers.

    There are three main transports that are used to speak Git’s wire-protocol (git://, ssh://, and https://), and the side channel that we use to request v2 needs to communicate in such a way that an older server would ignore any additional data sent and not crash. The http transport was the easiest as we can simply include an additional http header in the request (“Git-Protocol: version=2”). The ssh transport is a bit more difficult as it requires sending an environment variable (“GIT_PROTOCOL=version=2”) to be set on the remote end. This is more challenging because it requires server administrators to configure sshd to accept the new environment variable on their server. The most difficult transport is the anonymous Git transport (git://).

    Initial requests made to a server using the anonymous Git transport are made in the form of a single packet-line which includes the requested service (git-upload-pack for fetches and git-receive-pack for pushes), and the repository followed by a NUL byte. Later virtualization support was added and a hostname parameter could be tacked on and  terminated by a NUL byte: `0033git-upload-pack /project.git\0host=myserver.com\0`. Ideally we’d be able to add a new parameter to be used to request v2 by adding it in the same manner as the hostname was added: `003dgit-upload-pack /project.git\0host=myserver.com\0version=2\0`. Unfortunately due to a bug introduced in 2006 we aren't able to place any extra arguments (separated by NULs) other than the host because otherwise the parsing of those arguments would enter an infinite loop. When this bug was fixed in 2009, a check was put in place to disallow extra arguments so that new clients wouldn't trigger this bug in older servers.

    Fortunately, that check doesn't notice if we send additional request arguments hidden behind a second NUL byte, which was pointed out back in 2009.  This allows requests structured like: `003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0`. By placing version information behind a second NUL byte we can skirt around both the infinite loop bug and the explicit disallowal of extra arguments besides hostname. Only newer servers will know to look for additional information hidden behind two NUL bytes and older servers won’t croak.

    Now, in every case, a client can issue a request to use v2, using a transport-specific side channel, and v2 servers can respond using the new protocol while older servers will ignore the side channel and just respond with a ref advertisement.

    Try it for yourself

    To try out protocol version 2 for yourself you'll need an up to date version of Git (support for v2 was recently merged to Git's master branch and is expected to be part of Git 2.18) and a v2 enabled server (repositories on googlesource.com and Cloud Source Repositories are v2 enabled). If you enable tracing and run the `ls-remote` command querying for a single branch, you can see the server sends a much smaller set of references when using protocol version 2:

    ```
    # Using the original wire protocol
    GIT_TRACE_PACKET=1 git -c protocol.version=0 ls-remote https://chromium.googlesource.com/chromium/src.git master

    # Using protocol version 2
    GIT_TRACE_PACKET=1 git -c protocol.version=2 ls-remote https://chromium.googlesource.com/chromium/src.git master
    ```

    By Brandon Williams, Git-core Team