Tag Archives: releases

Announcing the S2 Library: Geometry on the Sphere

Google has always embraced new approaches to organizing all the world's information, and this includes all the world's geography. Today we are announcing the open source release of Google's S2 library, the core geometric library on which Google's global geographic database is built.

A unique feature of the S2 library is that unlike traditional geographic information systems, which represent data as flat two-dimensional projections (similar to an atlas), the S2 library represents all data on a three-dimensional sphere (similar to a globe). This makes it possible to build a worldwide geographic database with no seams or singularities, using a single coordinate system, and with low distortion everywhere compared to the true shape of the Earth. While the Earth is not quite spherical, it is much closer to being a sphere than it is to being flat!

Notable features of the library include:
  • Flexible support for spatial indexing, including the ability to approximate arbitrary regions as collections of discrete S2 cells. This feature makes it easy to build large distributed spatial indexes. (The image above illustrates the S2 space-filling curve, an important tool used for spatial indexing.)
  • Fast in-memory spatial indexing of collections of points, polylines, and polygons.
  • Robust constructive operations (such as intersection, union, and simplification) and boolean predicates (such as testing for containment).
  • Efficient query operations for finding nearby objects, measuring distances, computing centroids, etc.
  • A flexible and robust implementation of snap rounding (a geometric technique that allows operations to be implemented 100% robustly while using small and fast coordinate representations).
  • A collection of efficient yet exact mathematical predicates for testing relationships among geometric primitives.
  • Extensive testing on Google's vast collection of geographic data.
  • Flexible Apache 2.0 license.
The reference implementation of the S2 library is written in C++, and subsets have been ported to Go, Java, and Python. An early version of the code was released in 2011, but today's announcement represents a major update along with a commitment to maintain the library going forward. The code is under active development and new features will be released regularly. (The Java port is based on the 2011 code and does not have the same robustness, performance, or features as the current C++ version.)

Our C++ code repository is here: https://github.com/google/s2geometry
And check out our documentation here: https://s2geometry.io

To learn more, start by reading the overview and quick start documents, then explore the documentation site. The library also has extensive documentation in the header files, which is where the most authoritative information can be found. More introductions and tutorials will be added over time - contributions are welcome!

The S2 library was written primarily by Eric Veach. Other significant contributors include Jesse Rosenstock, Eric Engle (Java port lead), Robert Snedegar (Go port lead), Julien Basch, and Tom Manshreck.

By Eric Veach, Software Engineer

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.


DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.


DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.


DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

Introducing container-diff, a tool for quickly comparing container images

Originally posted by Nick Kubala, Colette Torres, and Abby Tisdale from the Container Tools team, on the Google Open Source Blog

The Google Container Tools team originally built container-diff, a new project to help uncover differences between container images, to aid our own development with containers. We think it can be useful for anyone building containerized software, so we're excited to release it as open source to the development community.

Containers and the Dockerfile format help make customization of an application's runtime environment more approachable and easier to understand. While this is a great advantage of using containers in software development, a major drawback is that it can be hard to visualize what changes in a container image will result from a change in the respective Dockerfile. This can lead to bloated images and make tracking down issues difficult.

Imagine a scenario where a developer is working on an application, built on a runtime image maintained by a third-party. During development someone releases a new version of that base image with updated system packages. The developer rebuilds their application and picks up the latest version of the base image, and suddenly their application stops working; it depended on a previous version of one of the installed system packages, but which one? What version was it on before? With no currently existing tool to easily determine what changed between the two base image versions, this totally stalls development until the developer can track down the package version incompatibility.

Introducing container-diff

container-diff helps users investigate image changes by computing semantic diffs between images. What this means is that container-diff figures out on a low-level what data changed, and then combines this with an understanding of package manager information to output this information in a format that's actually readable to users. The tool can find differences in system packages, language-level packages, and files in a container image.

Users can specify images in several formats - from local Docker daemon (using the prefix `daemon://` on the image path), a remote registry (using the prefix `remote://`), or a file in the .tar in the format exported by "docker save" command. You can also combine these formats to compute the diff between a local version of an image and a remote version. This can be useful when experimenting with new builds of an image that you might not be quite ready to push yet. container-diff supports image tarballs and the registry protocol natively, enabling it to run in environments without a Docker daemon.

Examples and Use Cases

Here is a basic Dockerfile that installs Python inside our Debian base image. Running container-diff on the base image and the new one with Python, users can see all the apt packages that were installed as dependencies of Python.

➜  debian_with_python cat Dockerfile
FROM gcr.io/google-appengine/debian8
RUN apt-get update && apt-get install -qq --force-yes python
➜ debian_with_python docker build -q -t debian_with_python .
sha256:be2cd1ae6695635c7041be252589b73d1539a858c33b2814a66fe8fa4b048655
➜ debian_with_python container-diff diff gcr.io/google-appengine/debian8:latest daemon://debian_with_python:latest

-----Apt-----
Packages found only in gcr.io/google-appengine/debian8:latest: None

Packages found only in debian_with_python:latest:
NAME VERSION SIZE
-file 1:5.22 15-2+deb8u3 76K
-libexpat1 2.1.0-6 deb8u4 386K
-libffi6 3.1-2 deb8u1 43K
-libmagic1 1:5.22 15-2+deb8u3 3.1M
-libpython-stdlib 2.7.9-1 54K
-libpython2.7-minimal 2.7.9-2 deb8u1 2.6M
-libpython2.7-stdlib 2.7.9-2 deb8u1 8.2M
-libsqlite3-0 3.8.7.1-1 deb8u2 877K
-mime-support 3.58 146K
-python 2.7.9-1 680K
-python-minimal 2.7.9-1 163K
-python2.7 2.7.9-2 deb8u1 360K
-python2.7-minimal 2.7.9-2 deb8u1 3.7M

Version differences: None

And below is a Dockerfile that inherits from our Python base runtime image, and then installs the mock and six packages inside of it. Running container-diff with the pip differ, users can see all the Python packages that have either been installed or changed as a result of this:

➜  python_upgrade cat Dockerfile
FROM gcr.io/google-appengine/python
RUN pip install -U six
➜ python_upgrade docker build -q -t python_upgrade .
sha256:7631573c1bf43727d7505709493151d3df8f98c843542ed7b299f159aec6f91f
➜ python_upgrade container-diff diff gcr.io/google-appengine/python:latest daemon://python_upgrade:latest --types=pip

-----Pip-----

Packages found only in gcr.io/google-appengine/python:latest: None

Packages found only in python_upgrade:latest:
NAME VERSION SIZE
-funcsigs 1.0.2 51.4K
-mock 2.0.0 531.2K
-pbr 3.1.1 471.1K

Version differences:
PACKAGE IMAGE1 (gcr.io/google-appengine/python:latest) IMAGE2 (python_upgrade:latest)
-six 1.8.0, 26.7K

This can be especially useful when it's unclear which packages might have been installed or changed incidentally as a result of dependency management of Python modules.

These are just a few examples. The tool currently has support for Python and Node.js packages installed via pip and npm, respectively, as well as comparison of image filesystems and Docker history. In the future, we'd like to see support added for additional runtime and language differs, including Java, Go, and Ruby. External contributions are welcome! For more information on contributing to container-diff, see this how-to guide.

Now that we've seen container-diff compare two images in action, it's easy to imagine how the tool may be integrated into larger workflows to aid in development:

  • Changelog generation: Given container-diff's capacity to facilitate investigation of filesystem and package modifications, it can do most of the heavy lifting in discerning changes for automatic changelog generation for new releases of an image.
  • Continuous integration: As part of a CI system, users can leverage container-diff to catch potentially breaking filesystem changes resulting from a Dockerfile change in their builds.

container-diff's default output mode is "human-readable," but also supports output to JSON, allowing for easy automated parsing and processing by users.

Single Image Analysis

In addition to comparing two images, container-diff has the ability to analyze a single image on its own. This can enable users to get a quick glance at information about an image, such as its system and language-level package installations and filesystem contents.

Let's take a look at our Debian base image again. We can use the tool to easily view a list of all packages installed in the image, along with each one's installed version and size:

➜  Development container-diff analyze gcr.io/google-appengine/debian8:latest

-----Apt-----
Packages found in gcr.io/google-appengine/debian8:latest:
NAME VERSION SIZE
-acl 2.2.52-2 258K
-adduser 3.113 nmu3 1M
-apt 1.0.9.8.4 3.1M
-base-files 8 deb8u9 413K
-base-passwd 3.5.37 185K
-bash 4.3-11 deb8u1 4.9M
-bsdutils 1:2.25.2-6 181K
-ca-certificates 20141019 deb8u3 367K
-coreutils 8.23-4 13.9M
-dash 0.5.7-4 b1 191K
-debconf 1.5.56 deb8u1 614K
-debconf-i18n 1.5.56 deb8u1 1.1M
-debian-archive-keyring 2017.5~deb8u1 137K

We could use this to verify compatibility with an application we're building, or maybe sort the packages by size in another one of our images and see which ones are taking up the most space.

For more information about this tool as well as a breakdown with examples, uses, and inner workings of the tool, please take a look at documentation on our GitHub page. Happy diffing!

Special thanks to Colette Torres and Abby Tisdale, our software engineering interns who helped build the tool from the ground up.

Tangent: Source-to-Source Debuggable Derivatives

Crossposted on the Google Research Blog

Tangent is a new, free, and open source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models.
This post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.

Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy.

Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it.

Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.
As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.

You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.

Automatic differentiation of Python code

How do we automatically generate derivatives of plain Python code? Math functions like tf.exp or tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as  subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.

Tangent has a one-function API:
import tangent
df = tangent.grad(f)
Here’s an animated graphic of what happens when we call tangent.grad on a Python function:
If you want to print out your derivatives, you can run
import tangent
df = tangent.grad(f, verbose=1)
Under the hood, tangent.grad first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as TensorFlow Eager functions. The function tangent.grad then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end of the derivative function. This reverse-order processing gives the technique its name: reverse-mode automatic differentiation.

The function df above only works for scalar (non-array) inputs. Tangent also supports
Although we started with TensorFlow Eager support, Tangent isn’t tied to one numeric library or another—we would gladly welcome pull requests adding PyTorch or MXNet derivative recipes.

Next Steps

Tangent is open source now at github.com/google/tangent. Go check it out for download and installation instructions. Tangent is still an experiment, so expect some bugs. If you report them to us on GitHub, we will do our best to fix them quickly.

We are working to add support in Tangent for more aspects of the Python language (e.g., closures, inline function definitions, classes, more NumPy and TensorFlow functions). We also hope to add more advanced automatic differentiation and compiler functionality in the future, such as automatic trade-off between memory and compute (Griewank and Walther 2000; Gruslys et al., 2016), more aggressive optimizations, and lambda lifting.

We intend to develop Tangent together as a community. We welcome pull requests with fixes and features. Happy deriving!

By Alex Wiltschko, Research Scientist, Google Brain Team

Acknowledgments

Bart van Merriënboer contributed immensely to all aspects of Tangent during his internship, and Dan Moldovan led TF Eager integration, infrastructure and benchmarking. Also, thanks to the Google Brain team for their support of this post and special thanks to Sanders Kleinfeld and Aleks Haecky for their valuable contribution for the technical aspects of the post.

Announcing OpenFermion: The Open Source Chemistry Package for Quantum Computers

Crossposted on the Google Research Blog

“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.”
-Paul Dirac, Quantum Mechanics of Many-Electron Systems (1929)

In this passage, physicist Paul Dirac laments that while quantum mechanics accurately models all of chemistry, exactly simulating the associated equations appears intractably complicated. Not until 1982 would Richard Feynman suggest that instead of surrendering to the complexity of quantum mechanics, we might harness it as a computational resource. Hence, the original motivation for quantum computing: by operating a computer according to the laws of quantum mechanics, one could efficiently unravel exact simulations of nature. Such simulations could lead to breakthroughs in areas such as photovoltaics, batteries, new materials, pharmaceuticals and superconductivity. And while we do not yet have a quantum computer large enough to solve classically intractable problems in these areas, rapid progress is being made. Last year, Google published this paper detailing the first quantum computation of a molecule using a superconducting qubit quantum computer. Building on that work, the quantum computing group at IBM scaled the experiment to larger molecules, which made the cover of Nature last month.

Today, we announce the release of OpenFermion, the first open source platform for translating problems in chemistry and materials science into quantum circuits that can be executed on existing platforms. OpenFermion is a library for simulating the systems of interacting electrons (fermions) which give rise to the properties of matter. Prior to OpenFermion, quantum algorithm developers would need to learn a significant amount of chemistry and write a large amount of code hacking apart other codes to put together even the most basic quantum simulations. While the project began at Google, collaborators at ETH Zurich, Lawrence Berkeley National Labs, University of Michigan, Harvard University, Oxford University, Dartmouth College, Rigetti Computing and NASA all contributed to alpha releases. You can learn more details about this release in our paper, OpenFermion: The Electronic Structure Package for Quantum Computers.

One way to think of OpenFermion is as a tool for generating and compiling physics equations which describe chemical and material systems into representations which can be interpreted by a quantum computer1. The most effective quantum algorithms for these problems build upon and extend the power of classical quantum chemistry packages used and developed by research chemists across government, industry and academia. Accordingly, we are also releasing OpenFermion-Psi4 and OpenFermion-PySCF which are plugins for using OpenFermion in conjunction with the classical electronic structure packages Psi4 and PySCF.

The core OpenFermion library is designed in a quantum programming framework agnostic way to ensure compatibility with various platforms being developed by the community. This allows OpenFermion to support external packages which compile quantum assembly language specifications for diverse hardware platforms. We hope this decision will help establish OpenFermion as a community standard for putting quantum chemistry on quantum computers. To see how OpenFermion is used with diverse quantum programming frameworks, take a look at OpenFermion-ProjectQ and Forest-OpenFermion - plugins which link OpenFermion to the externally developed circuit simulation and compilation platforms known as ProjectQ and Forest.

The following workflow describes how a quantum chemist might use OpenFermion in order to simulate the energy surface of a molecule (for instance, by preparing the sort of quantum computation we described in our past blog post):
  1. The researcher initializes an OpenFermion calculation with specification of:
    • An input file specifying the coordinates of the nuclei in the molecule.
    • The basis set (e.g. cc-pVTZ) that should be used to discretize the molecule.
    • The charge and spin multiplicity (if known) of the system.
  1. The researcher uses the OpenFermion-Psi4 plugin or the OpenFermion-PySCF plugin to perform scalable classical computations which are used to optimally stage the quantum computation. For instance, one might perform a classical Hartree-Fock calculation to choose a good initial state for the quantum simulation.
  2. The researcher then specifies which electrons are most interesting to study on a quantum computer (known as an active space) and asks OpenFermion to map the equations for those electrons to a representation suitable for quantum bits, using one of the available procedures in OpenFermion, e.g. the Bravyi-Kitaev transformation.
  3. The researcher selects a quantum algorithm to solve for the properties of interest and uses a quantum compilation framework such as OpenFermion-ProjectQ to output the quantum circuit in assembly language which can be run on a quantum computer. If the researcher has access to a quantum computer, they then execute the experiment.
A few examples of what one might do with OpenFermion are demonstrated in ipython notebooks here, here and here. While quantum simulation is widely recognized as one of the most important applications of quantum computing in the near term, very few quantum computer scientists know quantum chemistry and even fewer chemists know quantum computing. Our hope is that OpenFermion will help to close the gap between these communities and bring the power of quantum computing to chemists and material scientists. If you’re interested, please checkout our GitHub repository - pull requests welcome! By Ryan Babbush and Jarrod McClean, Quantum Software Engineers, Quantum AI Team

1 If we may be allowed one sentence for the experts: the primary function of OpenFermion is to encode the electronic structure problem in second quantization defined by various basis sets and active spaces and then to transform those operators into spin Hamiltonians using various isomorphisms between qubit and fermion algebras.

TensorFlow Lattice: Flexibility Empowered by Prior Knowledge

Crossposted on the Google Research Blog

Machine learning has made huge advances in many applications including natural language processing, computer vision and recommendation systems by capturing complex input/output relationships using highly flexible models. However, a remaining challenge is problems with semantically meaningful inputs that obey known global relationships, like “the estimated time to drive a road goes up if traffic is heavier, and all else is the same.” Flexible models like DNNs and random forests may not learn these relationships, and then may fail to generalize well to examples drawn from a different sampling distribution than the examples the model was trained on.

Today we present TensorFlow Lattice, a set of prebuilt TensorFlow Estimators that are easy to use, and TensorFlow operators to build your own lattice models. Lattices are multi-dimensional interpolated look-up tables (for more details, see [1--5]), similar to the look-up tables in the back of a geometry textbook that approximate a sine function.  We take advantage of the look-up table’s structure, which can be keyed by multiple inputs to approximate an arbitrarily flexible relationship, to satisfy monotonic relationships that you specify in order to generalize better. That is, the look-up table values are trained to minimize the loss on the training examples, but in addition, adjacent values in the look-up table are constrained to increase along given directions of the input space, which makes the model outputs increase in those directions. Importantly, because they interpolate between the look-up table values, the lattice models are smooth and the predictions are bounded, which helps to avoid spurious large or small predictions in the testing time.

How Lattice Models Help You

Suppose you are designing a system to recommend nearby coffee shops to a user. You would like the model to learn, “if two cafes are the same, prefer the closer one.”  Below we show a flexible model (pink) that accurately fits some training data for users in Tokyo (purple), where there are many coffee shops nearby.  The pink flexible model overfits the noisy training examples, and misses the overall trend that a closer cafe is better. If you used this pink model to rank test examples from Texas (blue), where businesses are spread farther out, you would find it acted strangely, sometimes preferring farther cafes!
Slice through a model’s feature space where all the other inputs stay the same and only distance changes. A flexible function (pink) that is accurate on training examples from Tokyo (purple) predicts that a cafe 10km-away is better than the same cafe if it was 5km-away. This problem becomes more evident at test-time if the data distribution has shifted, as shown here with blue examples from Texas where cafes are spread out more.
A monotonic flexible function (green) is both accurate on training examples and can generalize for Texas examples compared to non-monotonic flexible function (pink) from the previous figure.

In contrast, a lattice model, trained over the same example from Tokyo, can be constrained to satisfy such a monotonic relationship and result in a monotonic flexible function (green). The green line also accurately fits the Tokyo training examples, but also generalizes well to Texas, never preferring farther cafes.

In general, you might have many inputs about each cafe, e.g., coffee quality, price, etc. Flexible models have a hard time capturing global relationships of the form, “if all other inputs are equal, nearer is better, ” especially in parts of the feature space where your training data is sparse and noisy. Machine learning models that capture prior knowledge (e.g.  how inputs should impact the prediction) work better in practice, and are easier to debug and more interpretable.

Pre-built Estimators

We provide a range of lattice model architectures as TensorFlow Estimators. The simplest estimator we provide is the calibrated linear model, which learns the best 1-d transformation of each feature (using 1-d lattices), and then combines all the calibrated features linearly. This works well if the training dataset is very small, or there are no complex nonlinear input interactions. Another estimator is a calibrated lattice model. This model combines the calibrated features nonlinearly using a two-layer single lattice model, which can represent complex nonlinear interactions in your dataset. The calibrated lattice model is usually a good choice if you have 2-10 features, but for 10 or more features, we expect you will get the best results with an ensemble of calibrated lattices, which you can train using the pre-built ensemble architectures. Monotonic lattice ensembles can achieve 0.3% -- 0.5% accuracy gain compared to Random Forests [4], and these new TensorFlow lattice estimators can achieve 0.1 -- 0.4% accuracy gain compared to the prior state-of-the-art in learning models with monotonicity [5].

Build Your Own

You may want to experiment with deeper lattice networks or research using partial monotonic functions as part of a deep neural network or other TensorFlow architecture. We provide the building blocks: TensorFlow operators for calibrators, lattice interpolation, and monotonicity projections. For example, the figure below shows a 9-layer deep lattice network [5].


Example of a 9-layer deep lattice network architecture [5], alternating layers of linear embeddings and ensembles of lattices with calibrators layers (which act like a sum of ReLU’s in Neural Networks). The blue lines correspond to monotonic inputs, which is preserved layer-by-layer, and hence for the entire model. This and other arbitrary architectures can be constructed with TensorFlow Lattice because each layer is differentiable.

In addition to the choice of model flexibility and standard L1 and L2 regularization, we offer new regularizers with TensorFlow Lattice:
  • Monotonicity constraints [3] on your choice of inputs as described above.
  • Laplacian regularization [3] on the lattices to make the learned function flatter.
  • Torsion regularization [3] to suppress un-necessary nonlinear feature interactions.
We hope TensorFlow Lattice will be useful to the larger community working with meaningful semantic inputs. This is part of a larger research effort on interpretability and controlling machine learning models to satisfy policy goals, and enable practitioners to take advantage of their prior knowledge. We’re excited to share this with all of you. To get started, please check out our GitHub repository and our tutorials, and let us know what you think!

By Maya Gupta, Research Scientist, Jan Pfeifer, Software Engineer and Seungil You, Software Engineer

Acknowledgements

Developing and open sourcing TensorFlow Lattice was a huge team effort. We’d like to thank all the people involved: Andrew Cotter, Kevin Canini, David Ding, Mahdi Milani Fard, Yifei Feng, Josh Gordon, Kiril Gorovoy, Clemens Mewald, Taman Narayan, Alexandre Passos, Christine Robson, Serena Wang, Martin Wicke, Jarek Wilkiewicz, Sen Zhao, Tao Zhu

References

[1] Lattice Regression, Eric Garcia, Maya Gupta, Advances in Neural Information Processing Systems (NIPS), 2009
[2] Optimized Regression for Efficient Function Evaluation, Eric Garcia, Raman Arora, Maya R. Gupta, IEEE Transactions on Image Processing, 2012
[3] Monotonic Calibrated Interpolated Look-Up Tables, Maya Gupta, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov, Wojciech Moczydlowski, Alexander van Esbroeck, Journal of Machine Learning Research (JMLR), 2016
[4] Fast and Flexible Monotonic Functions with Ensembles of Lattices, Mahdi Milani Fard, Kevin Canini, Andrew Cotter, Jan Pfeifer, Maya Gupta, Advances in Neural Information Processing Systems (NIPS), 2016
[5] Deep Lattice Networks and Partial Monotonic Functions, Seungil You, David Ding, Kevin Canini, Jan Pfeifer, Maya R. Gupta, Advances in Neural Information Processing Systems (NIPS), 2017

Introducing Abseil, a new common libraries project

Today we are open sourcing Abseil, a collection of libraries drawn from the most fundamental pieces of Google’s internal codebase. These libraries are the nuts-and-bolts that underpin almost everything that Google runs. Bits and pieces of these APIs are embedded in most of our open source projects, and now we have brought them together into one comprehensive project. Abseil encompasses the most basic building blocks of Google’s codebase: code that is production tested and will be fully maintained for years to come.

Our C++ code repository is available at: https://github.com/abseil/abseil-cpp

By adopting these new Apache-licensed libraries, you can reap the benefit of years (over a decade in many cases) of our design and optimization work in this space. Our past experience is baked in.

Just as interesting, we’ve also prepared for the future: several types in Abseil’s C++ libraries are “pre-adoption” versions of C++17 types like string_view and optional - implemented in C++11 to the greatest extent possible. We look forward to moving more and more of our code to match the current standard, and using these new vocabulary types helps us make that transition. Importantly, in C++17 mode these types are merely aliases to the standard, ensuring that you only ever have one type for optional or string_view in a project at a time. Put another way: Abseil is focused on the engineering task of providing APIs that remain stable over time.

Consisting of the foundational C++ and Python code at Google, Abseil includes libraries that will grow to underpin other Google-backed open source projects like gRPC, Protobuf and TensorFlow. We love those projects, and we love the users of those projects - we want to ensure smooth usage for these things over time. In the next few months we’ll introduce new distribution methods to incorporate these projects as a collection into your project.

Continuing with the “over time” theme, Abseil aims for compatibility with major compilers, platforms and standard libraries for approximately 5 years. Our 5-year target also applies to language version: we assume everyone builds with C++11 at this point. (In 2019 we’ll start talking about requiring C++14 as our base language version.) This 5-year horizon is part of our balance between “support everything” and “provide modern implementations and APIs.”

Highlights of the initial release include:
  • Zero configuration: most platforms (OS, compiler, architecture) should just work.
  • Pre-adoption for C++17 types: string_view, optional, any. We’ll follow up with variant soon.
  • Our primary synchronization type, absl::Mutex, has an elegant interface and has been extensively optimized.
  • Efficient support for handling time: absl::Time and absl::Duration are conceptually similar to std::chrono types, but are concrete (not class templates) and have defined behavior in all cases. Additionally, our clock-sampling API absl::Now() is more heavily optimized than most standard library calls for std::chrono::system_clock::now().
  • String handling routines: among internal users, we’ve been told that releasing absl::StrCat(), absl::StrJoin(), and absl::StrSplit() would itself be a big improvement for the open source C++ world.
The project has support for C++ and some Python. Over time we’ll tie those two projects together more closely with shared logging and command-line flag infrastructure. To start contributing, please see our contribution guidelines and fork us on GitHub. Check out our documentation and community page for information on how to contact us, ask questions or contribute to Abseil.

By Titus Winters, Abseil Lead

Making the Google Developers Documentation Style Guide Public

Cross-posted on the Google Developers Blog

You can now use our developer-documentation style guide for open source documentation projects.

For some years now, our technical writers at Google have used an internal-only editorial style guide for most of our developer documentation. In order to better support external contributors to our open source projects, such as Kubernetes, AMP, or Dart, and to allow for more consistency across developer documentation, we're now making that style guide public.

If you contribute documentation to projects like those, you now have direct access to useful guidance about voice, tone, word choice, and other style considerations. It can be useful for general issues, like reminders to use second person, present tense, active voice, and the serial comma; it can also be great for checking very specific issues, like whether to write "app" or "application" when you want to be consistent with the Google Developers style.

The style guide is a reference document, so instead of reading through it in linear order, you can use it to look things up as needed. For matters of punctuation, grammar, and formatting, you can do a search-in-page to find items like "Commas," "Lists," and "Link text" in the left nav. For specific terms and phrases, you can look at the word list.

Keep an eye on the guide's release notes page for updates and developments, and send us your comments and suggestions via the Send Feedback link on each page of the guide—we want to hear from you as we continue to evolve the style guide.

Posted by Jed Hartman, Technical Writer