Tag Archives: open source release

Open Sourcing the Hunt for Exoplanets



Recently, we discovered two exoplanets by training a neural network to analyze data from NASA’s Kepler space telescope and accurately identify the most promising planet signals. And while this was only an initial analysis of ~700 stars, we consider this a successful proof-of-concept for using machine learning to discover exoplanets, and more generally another example of using machine learning to make meaningful gains in a variety of scientific disciplines (e.g. healthcare, quantum chemistry, and fusion research).

Today, we’re excited to release our code for processing the Kepler data, training our neural network model, and making predictions about new candidate signals. We hope this release will prove a useful starting point for developing similar models for other NASA missions, like K2 (Kepler’s second mission) and the upcoming Transiting Exoplanet Survey Satellite mission. As well as announcing the release of our code, we’d also like take this opportunity to dig a bit deeper into how our model works.

A Planet Hunting Primer

First, let’s consider how data collected by the Kepler telescope is used to detect the presence of a planet. The plot below is called a light curve, and it shows the brightness of the star (as measured by Kepler’s photometer) over time. When a planet passes in front of the star, it temporarily blocks some of the light, which causes the measured brightness to decrease and then increase again shortly thereafter, causing a “U-shaped” dip in the light curve.
A light curve from the Kepler space telescope with a “U-shaped” dip that indicates a transiting exoplanet.
However, other astronomical and instrumental phenomena can also cause the measured brightness of a star to decrease, including binary star systems, starspots, cosmic ray hits on Kepler’s photometer, and instrumental noise.
The first light curve has a “V-shaped” pattern that tells us that a very large object (i.e. another star) passed in front of the star that Kepler was observing. The second light curve contains two places where the brightness decreases, which indicates a binary system with one bright and one dim star: the larger dip is caused by the dimmer star passing in front of the brighter star, and vice versa. The third light curve is one example of the many other non-planet signals where the measured brightness of a star appears to decrease.
To search for planets in Kepler data, scientists use automated software (e.g. the Kepler data processing pipeline) to detect signals that might be caused by planets, and then manually follow up to decide whether each signal is a planet or a false positive. To avoid being overwhelmed with more signals than they can manage, the scientists apply a cutoff to the automated detections: those with signal-to-noise ratios above a fixed threshold are deemed worthy of follow-up analysis, while all detections below the threshold are discarded. Even with this cutoff, the number of detections is still formidable: to date, over 30,000 detected Kepler signals have been manually examined, and about 2,500 of those have been validated as actual planets!

Perhaps you’re wondering: does the signal-to-noise cutoff cause some real planet signals to be missed? The answer is, yes! However, if astronomers need to manually follow up on every detection, it’s not really worthwhile to lower the threshold, because as the threshold decreases the rate of false positive detections increases rapidly and actual planet detections become increasingly rare. However, there’s a tantalizing incentive: it’s possible that some potentially habitable planets like Earth, which are relatively small and orbit around relatively dim stars, might be hiding just below the traditional detection threshold — there might be hidden gems still undiscovered in the Kepler data!

A Machine Learning Approach

The Google Brain team applies machine learning to a diverse variety of data, from human genomes to sketches to formal mathematical logic. Considering the massive amount of data collected by the Kepler telescope, we wondered what we might find if we used machine learning to analyze some of the previously unexplored Kepler data. To find out, we teamed up with Andrew Vanderburg at UT Austin and developed a neural network to help search the low signal-to-noise detections for planets.
We trained a convolutional neural network (CNN) to predict the probability that a given Kepler signal is caused by a planet. We chose a CNN because they have been very successful in other problems with spatial and/or temporal structure, like audio generation and image classification.
Luckily, we had 30,000 Kepler signals that had already been manually examined and classified by humans. We used a subset of around 15,000 of these signals, of which around 3,500 were verified planets or strong planet candidates, to train our neural network to distinguish planets from false positives. The inputs to our network are two separate views of the same light curve: a wide view that allows the model to examine signals elsewhere on the light curve (e.g., a secondary signal caused by a binary star), and a zoomed-in view that enables the model to closely examine the shape of the detected signal (e.g., to distinguish “U-shaped” signals from “V-shaped” signals).

Once we had trained our model, we investigated the features it learned about light curves to see if they matched with our expectations. One technique we used (originally suggested in this paper) was to systematically occlude small regions of the input light curves to see whether the model’s output changed. Regions that are particularly important to the model’s decision will change the output prediction if they are occluded, but occluding unimportant regions will not have a significant effect. Below is a light curve from a binary star that our model correctly predicts is not a planet. The points highlighted in green are the points that most change the model’s output prediction when occluded, and they correspond exactly to the secondary “dip” indicative of a binary system. When those points are occluded, the model’s output prediction changes from ~0% probability of being a planet to ~40% probability of being a planet. So, those points are part of the reason the model rejects this light curve, but the model uses other evidence as well - for example, zooming in on the centred primary dip shows that it's actually “V-shaped”, which is also indicative of a binary system.

Searching for New Planets

Once we were confident with our model’s predictions, we tested its effectiveness by searching for new planets in a small set 670 stars. We chose these stars because they were already known to have multiple orbiting planets, and we believed that some of these stars might host additional planets that had not yet been detected. Importantly, we allowed our search to include signals that were below the signal-to-noise threshold that astronomers had previously considered. As expected, our neural network rejected most of these signals as spurious detections, but a handful of promising candidates rose to the top, including our two newly discovered planets: Kepler-90 i and Kepler-80 g.

Find your own Planet(s)!

Let’s take a look at how the code released today can help (re-)discover the planet Kepler-90 i. The first step is to train a model by following the instructions on the code’s home page. It takes a while to download and process the data from the Kepler telescope, but once that’s done, it’s relatively fast to train a model and make predictions about new signals. One way to find new signals to show the model is to use an algorithm called Box Least Squares (BLS), which searches for periodic “box shaped” dips in brightness (see below). The BLS algorithm will detect “U-shaped” planet signals, “V-shaped” binary star signals and many other types of false positive signals to show the model. There are various freely available software implementations of the BLS algorithm, including VARTOOLS and LcTools. Alternatively, you can even look for candidate planet transits by eye, like the Planet Hunters.
A low signal-to-noise detection in the light curve of the Kepler 90 star detected by the BLS algorithm. The detection has period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days after 12:00 on 1/1/2009 (the year the Kepler telescope launched).
To run this detected signal though our trained model, we simply execute the following command:
python predict.py  --kepler_id=11442793 --period=14.44912 --t0=2.2
--duration=0.11267 --kepler_data_dir=$HOME/astronet/kepler
--output_image_file=$HOME/astronet/kepler-90i.png
--model_dir=$HOME/astronet/model
The output of the command is prediction = 0.94, which means the model is 94% certain that this signal is a real planet. Of course, this is only a small step in the overall process of discovering and validating an exoplanet: the model’s prediction is not proof one way or the other. The process of validating this signal as a real exoplanet requires significant follow-up work by an expert astronomer — see Sections 6.3 and 6.4 of our paper for the full details. In this particular case, our follow-up analysis validated this signal as a bona fide exoplanet, and it’s now called Kepler-90 i!
Our work here is far from done. We’ve only searched 670 stars out of 200,000 observed by Kepler — who knows what we might find when we turn our technique to the entire dataset. Before we do that, though, we have a few improvements we want to make to our model. As we discussed in our paper, our model is not yet as good at rejecting binary stars and instrumental false positives as some more mature computer heuristics. We’re hard at work improving our model, and now that it’s open sourced, we hope others will do the same!


By Chris Shallue, Senior Software Engineer, Google Brain Team

If you’d like to learn more, Chris is featured on the latest episode of This Week In Machine Learning discussing his work.

How Google uses Census internally

This post is the first in a series about OpenCensus, a set of open source instrumentation libraries based on what we use inside Google. This series will cover the benefits of OpenCensus for developers and vendors, Google’s interest in open sourcing instrumentation tools, how to get started with OpenCensus, and our long-term vision.

If you’re new to distributed tracing and metrics, we recommend Adrian Cole’s excellent talk on the subject: Observability Three Ways.

Gaining Observability into Planet-Scale Computing

Google adopted or invented new technologies, including distributed tracing (Dapper) and metrics processing, in order to operate some of the world’s largest web services. However, building analysis systems didn’t solve the difficult problem of instrumenting and extracting data from production services. This is what Census was created to do.

The Census project provides uniform instrumentation across most Google services, capturing trace spans, app-level metrics, and other metadata like log correlations from production applications. One of the biggest benefits of uniform instrumentation to developers inside of Google is that it’s almost entirely automatic: any service that uses gRPC automatically collects and exports basic traces and metrics.

OpenCensus offers these capabilities to developers everywhere. Today we’re sharing how we use distributed tracing and metrics inside of Google.

Incident Management

When latency problems or new errors crop up in a highly distributed environment, visibility into what’s happening is critical. For example, when the latency of a service crosses expected boundaries, we can view distributed traces in Dapper to find where things are slowing down. Or when a request is returning an error, we can look at the chain of calls that led to the error and examine the metadata captured during a trace (typically logs or trace annotations). This is effectively a bigger stack trace. In rare cases, we enable custom trigger-based sampling which allows us to focus on specific kinds of requests.

Once we know there’s a production issue, we can use Census data to determine the regions, services, and scope (one customer vs many) of a given problem. You can use service-specific diagnostics pages, called “z-pages,” to monitor problems and the results of solutions you deploy. These pages are hosted locally on each service and provide a firehose view of recent requests, stats, and other performance-related information.

Performance Optimization

At Google’s scale, we need to be able to instrument and attribute costs for services. We use Census to help us answer questions like:
  • How much CPU time does my query consume?
  • Does my feature consume more storage resources than before?
  • What is the cost of a particular user operation at a particular layer of the stack?
  • What is the total cost of a particular user operation across all layers of the stack?
We’re obsessed with reducing the tail latency of all services, so we’ve built sophisticated analysis systems that process traces and metrics captured by Census to identify regressions and other anomalies.

Quality of Service

Google also improves performance dynamically depending on the source and type of traffic. Using Census tags, traffic can be directed to more appropriate shards, or we can do things like load shedding and rate limiting.

Next week we’ll discuss Google’s motivations for open sourcing Census, then we’ll shift the focus back onto the open source project itself.

By Pritam Shah and Morgan McLean, Census team

The Building Blocks of Interpretability

Cross-posted on the Google Research Blog.

In 2015, our early attempts to visualize how neural networks understand images led to psychedelic images. Soon after, we open sourced our code as DeepDream and it grew into a small art movement producing all sorts of amazing things. But we also continued the original line of research behind DeepDream, trying to address one of the most exciting questions in Deep Learning: how do neural networks do what they do?

Last year in the online journal Distill, we demonstrated how those same techniques could show what individual neurons in a network do, rather than just what is “interesting to the network” as in DeepDream. This allowed us to see how neurons in the middle of the network are detectors for all sorts of things — buttons, patches of cloth, buildings — and see how those build up to be more and more sophisticated over the networks layers.
Visualizations of neurons in GoogLeNet. Neurons in higher layers represent higher level ideas.
While visualizing neurons is exciting, our work last year was missing something important: how do these neurons actually connect to what the network does in practice?

Today, we’re excited to publish “The Building Blocks of Interpretability,” a new Distill article exploring how feature visualization can combine together with other interpretability techniques to understand aspects of how networks make decisions. We show that these combinations can allow us to sort of “stand in the middle of a neural network” and see some of the decisions being made at that point, and how they influence the final output. For example, we can see things like how a network detects a floppy ear, and then that increases the probability it gives to the image being a “Labrador retriever” or “beagle”.

We explore techniques for understanding which neurons fire in the network. Normally, if we ask which neurons fire, we get something meaningless like “neuron 538 fired a little bit,” which isn’t very helpful even to experts. Our techniques make things more meaningful to humans by attaching visualizations to each neuron, so we can see things like “the floppy ear detector fired”. It’s almost a kind of MRI for neural networks.
We can also zoom out and show how the entire image was “perceived” at different layers. This allows us to really see the transition from the network detecting very simple combinations of edges, to rich textures and 3d structure, to high-level structures like ears, snouts, heads and legs.
These insights are exciting by themselves, but they become even more exciting when we can relate them to the final decision the network makes. So not only can we see that the network detected a floppy ear, but we can also see how that increases the probability of the image being a labrador retriever.
In addition to our paper, we’re also releasing Lucid, a neural network visualization library building off our work on DeepDream. It allows you to make the sort lucid feature visualizations we see above, in addition to more artistic DeepDream images.

We’re also releasing colab notebooks. These notebooks make it extremely easy to use Lucid to reproduce visualizations in our article! Just open the notebook, click a button to run code — no setup required!
In colab notebooks you can click a button to run code, and see the result below.
This work only scratches the surface of the kind of interfaces that we think it’s possible to build for understanding neural networks. We’re excited to see what the community will do — and we’re excited to work together towards deeper human understanding of neural networks.

By Chris Olah, Research Scientist and Arvind Satyanarayan, Visiting Researcher, Google Brain Team

The cpu_features library

"Write Once, Run Anywhere." That was the promise of Java back in the 1990s. You could write your Java code on one platform, and it would run on any CPU implementing a Java Virtual Machine.

But for developers who need to squeeze every bit of performance out of their applications, that's not enough. Since the dawn of computing, performance-minded programmers have used insights about hardware to fine tune their code.

Let's say you're working on code for which speed is paramount, perhaps a new video codec or a library to process tensors. There are individual instructions that will dramatically improve performance, like fused multiply-add, as well as entire instruction sets like SSE2 and AVX, that can give the critical portions of your code a speed boost.
Photo by Andrew Dunn, licensed CC-BY-SA-2.0.

Here's the problem: there's no way to know a priori which instructions your CPU supports. Identifying the CPU manufacturer isn't sufficient. For instance, Intel’s Haswell architecture supports the AVX2 instruction set, while Sandy Bridge doesn't. Some developers resort to desperate measures like reading /proc/cpuinfo to identify the CPU and then consulting hardcoded mappings of CPU IDs to instructions.

Enter cpu_features, a small, fast, and simple open source library to report CPU features at runtime. Written in C89 for maximum portability, it allocates no memory and is suitable for implementing fundamental functions and running in sandboxed environments.

The library currently supports x86, ARM/AArch64, and MIPS processors, and we'll be adding to it as the need arises. We also welcome contributions from others interested in making programs “write once, run fast everywhere.”

By Guillaume Chatelet, Google Compiler Research Team

OpenCensus: A Stats Collection and Distributed Tracing Framework

Today we’re pleased to announce the release of OpenCensus, a vendor-neutral open source library for metric collection and tracing. OpenCensus is built to add minimal overhead and be deployed fleet wide, especially for microservice-based architectures.

The Need for Instrumentation & Observability 

As a startup, often the focus is to get an initial version of the product out the door, rapidly prototype and iterate with customers. Most startups start out with monolithic applications as a simple model-view-controller (MVC) web application. As the customer base, code, and number of engineers increase, they migrate from monolithic architecture to a microservices architecture. A microservices architecture has its advantages, but often makes debugging more challenging as traditional debugging and monitoring tools don’t always work in these environments or are designed for monolithic use cases. When operating multiple microservices with strict service level objectives (SLOs), you need insights into the root cause of reliability and performance problems.

Not having proper instrumentation and observability can result in lost engineering hours, violated SLOs and frustrated customers. Instead, diagnostic data should be collected from across the stack. This data can be used for incident management to identify and debug potential bottlenecks or for system tuning and performance improvement.

OpenCensus

At Google scale, an instrumentation layer with minimal overhead is a requirement. As Google grew, we realized the importance of having a highly efficient tracing and stats instrumentation library that could be deployed fleet wide.

OpenCensus is the open source version of Google’s Census library, written based on years of optimization experience. It aims to make the collection and submission of app metrics and traces easier for developers. It is a vendor neutral, single distribution of libraries that automatically collects traces and metrics from your app, displays them locally, and sends them to analysis tools. OpenCensus currently supports Prometheus, SignalFX, Stackdriver and Zipkin.

Developers can use this powerful, out-of-the box library to instrument microservices and send data to any supported backend. For an Application Performance Management (APM) vendor, OpenCensus provides free instrumentation coverage with minimal work, and affords customers a simple setup experience.

Below are Stackdriver Trace and Monitor screenshots showing traces generated from a demo app, which calls Google’s Cloud Bigtable API and uses OpenCensus.



We’d love to hear your feedback on OpenCensus. Try using it in your app, tell us about your success story, and help by contributing to our existing language-specific libraries, or by creating one for an not-yet-supported language. You can also help us integrate OpenCensus with new APM tools!

We hope you find this as useful as we have. Visit opencensus.io for more information.

By Pritam Shah, Census team

Container Structure Tests: Unit Tests for Docker Images

Usage of containers in software applications is on the rise, and with their increasing usage in production comes a need for robust testing and validation. Containers provide great testing environments, but actually validating the structure of the containers themselves can be tricky. The Docker toolchain provides us with easy ways to interact with the container images themselves, but no real way of verifying their contents. What if we want to ensure a set of commands runs successfully inside of our container, or check that certain files are in the correct place with the correct contents, before shipping?

The Container Tools team at Google is happy to announce the release of the Container Structure Test framework. This framework provides a convenient and powerful way to verify the contents and structure of your containers. We’ve been using this framework at Google to test all of our team’s released containers for over a year now, and we’re excited to finally share it with the public.

The framework supports four types of tests:
  • Command Tests - to run a command inside your container image and verify the output or error it produces
  • File Existence Tests - to check the existence of a file in a specific location in the image’s filesystem
  • File Content Tests - to check the contents and metadata of a file in the filesystem
  • A unique Metadata Test - to verify configuration and metadata of the container itself
Users can specify test configurations through YAML or JSON. This provides a way to abstract away the test configuration from the implementation of the tests, eliminating the need for hacky bash scripting or other solutions to test container images.

Command Tests

The Command Tests give the user a way to specify a set of commands to run inside of a container, and verify that the output, error, and exit code were as expected. An example configuration looks like this:
globalEnvVars:
- key: "VIRTUAL_ENV"
value: "/env"
- key: "PATH"
value: "/env/bin:$PATH"

commandTests:

# check that the python binary is in the correct location
- name: "python installation"
command: "which"
args: ["python"]
expectedOutput: ["/usr/bin/python\n"]

# setup a virtualenv, and verify the correct python binary is run
- name: "python in virtualenv"
setup: [["virtualenv", "/env"]]
command: "which"
args: ["python"]
expectedOutput: ["/env/bin/python\n"]

# setup a virtualenv, install gunicorn, and verify the installation
- name: "gunicorn flask"
setup: [["virtualenv", "/env"],
["pip", "install", "gunicorn", "flask"]]
command: "which"
args: ["gunicorn"]
expectedOutput: ["/env/bin/gunicorn"]
Regexes are used to match the expected output, and error, of each command (or excluded output/error if you want to make sure something didn’t happen). Additionally, setup and teardown commands can be run with each individual test, and environment variables can be specified to be set for each individual test, or globally for the entire test run (shown in the example).

File Tests

File Tests allow users to verify the contents of an image’s filesystem. We can check for existence of files, as well as examine the contents of individual files or directories. This can be particularly useful for ensuring that scripts, config files, or other runtime artifacts are in the correct places before shipping and running a container.
fileExistenceTests:

# check that the apt-packages text file exists and has the correct permissions
- name: 'apt-packages'
path: '/resources/apt-packages.txt'
shouldExist: true
permissions: '-rw-rw-r--'
Expected permissions and file mode can be specified for each file path in the form of a standard Unix permission string. As with the Command Tests’ “Excluded Output/Error,” a boolean can be provided to these tests to tell the framework to be sure a file is not present in a filesystem.

Additionally, the File Content Tests verify the contents of files and directories in the filesystem. This can be useful for checking package or repository versions, or config file contents among other things. Following the pattern of the previous tests, regexes are used to specify the expected or excluded contents.
fileContentTests:

# check that the default apt repository is set correctly
- name: 'apt sources'
path: '/etc/apt/sources.list'
expectedContents: ['.*httpredir\.debian\.org/debian jessie main.*']

# check that the retry policy is correctly specified
- name: 'retry policy'
path: '/etc/apt/apt.conf.d/apt-retry'
expectedContents: ['Acquire::Retries 3;']

Metadata Test

Unlike the previous tests which all allow any number to be specified, the Metadata test is a singleton test which verifies a container’s configuration. This is useful for making sure things specified in the Dockerfile (e.g. entrypoint, exposed ports, mounted volumes, etc.) are manifested correctly in a built container.
metadataTest:
env:
- key: "VIRTUAL_ENV"
value: "/env"
exposedPorts: ["8080", "2345"]
volumes: ["/test"]
entrypoint: []
cmd: ["/bin/bash"]
workdir: ["/app"]

Tiny Images

One interesting case that we’ve put focus on supporting is “tiny images.” We think keeping container sizes small is important, and sometimes the bare minimum in a container image might even exclude a shell. Users might be used to running something like:
`docker run -d "cat /etc/apt/sources.list && grep -rn 'httpredir.debian.org' image"`
… but this breaks without a working shell in a container. With the structure test framework, we convert images to in-memory filesystem representations, so no shell is needed to examine the contents of an image!

Dockerless Test Runs

At their core, Docker images are just bundles of tarballs. One of the major use cases for these tests is running in CI systems, and often we can't guarantee that we'll have access to a working Docker daemon in these environments. To address this, we created a tar-based test driver, which can handle the execution of all file-related tests through simple tar manipulation. Command tests are currently not supported in this mode, since running commands in a container requires a container runtime.

This means that using the tar driver, we can retrieve images from a remote registry, convert them into filesystems on disk, and verify file contents and metadata all without a working Docker daemon on the host! Our container-diff library is leveraged here to do all the image processing; see our previous blog post for more information.
structure-test -test.v -driver tar -image gcr.io/google-appengine/python:latest structure-test-examples/python/python_file_tests.yaml

Running in Bazel

Structure tests can also be run natively through Bazel, using the “container_test” rule. Bazel provides convenient build rules for building Docker images, so the structure tests can be run as part of a build to ensure any new built images are up to snuff before being released. Check out this example repo for a quick demonstration of how to incorporate these tests into a Bazel build.

We think this framework can be useful for anyone building and deploying their own containers in the wild, and hope that it can promote their usage everywhere through increasing the robustness of containers. For more detailed information on the test specifications, check out the documentation in our GitHub repository.

By Nick Kubala, Container Tools team

TFGAN: A Lightweight Library for Generative Adversarial Networks

Crossposted on the Google Research Blog

Training a neural network usually involves defining a loss function, which tells the network how close or far it is from its objective. For example, image classification networks are often given a loss function that penalizes them for giving wrong classifications; a network that mislabels a dog picture as a cat will get a high loss. However, not all problems have easily-defined loss functions, especially if they involve human perception, such as image compression or text-to-speech systems. Generative Adversarial Networks (GANs), a machine learning technique that has led to improvements in a wide range of applications including generating images from text, superresolution, and helping robots learn to grasp, offer a solution. However, GANs introduce new theoretical and software engineering challenges, and it can be difficult to keep up with the rapid pace of GAN research.

A video of a generator improving over time. It begins by producing random noise, and eventually learns to generate MNIST digits.
In order to make GANs easier to experiment with, we’ve open sourced TFGAN, a lightweight library designed to make it easy to train and evaluate GANs. It provides the infrastructure to easily train a GAN, provides well-tested loss and evaluation metrics, and gives easy-to-use examples that highlight the expressiveness and flexibility of TFGAN. We’ve also released a tutorial that includes a high-level API to quickly get a model trained on your data.
This demonstrates the effect of an adversarial loss on image compression. The top row shows image patches from the ImageNet dataset. The middle row shows the results of compressing and uncompressing an image through an image compression neural network trained on a traditional loss. The bottom row shows the results from a network trained with a traditional loss and an adversarial loss. The GAN-loss images are sharper and more detailed, even if they are less like the original.
TFGAN supports experiments in a few important ways. It provides simple function calls that cover the majority of GAN use-cases so you can get a model running on your data in just a few lines of code, but is built in a modular way to cover more exotic GAN designs as well. You can just use the modules you want -- loss, evaluation, features, training, etc. are all independent.. TFGAN’s lightweight design also means you can use it alongside other frameworks, or with native TensorFlow code. GAN models written using TFGAN will easily benefit from future infrastructure improvements, and you can select from a large number of already-implemented losses and features without having to rewrite your own. Lastly, the code is well-tested, so you don’t have to worry about numerical or statistical mistakes that are easily made with GAN libraries.
Most neural text-to-speech (TTS) systems produce over-smoothed spectrograms. When applied to the Tacotron TTS system, a GAN can recreate some of the realistic-texture, which reduces artifacts in the resulting audio.
When you use TFGAN, you’ll be using the same infrastructure that many Google researchers use, and you’ll have access to the cutting-edge improvements that we develop with the library. Anyone can contribute to the github repositories, which we hope will facilitate code-sharing among ML researchers and users.

By Joel Shor, Senior Software Engineer, Machine Perception

Announcing the S2 Library: Geometry on the Sphere

Google has always embraced new approaches to organizing all the world's information, and this includes all the world's geography. Today we are announcing the open source release of Google's S2 library, the core geometric library on which Google's global geographic database is built.

A unique feature of the S2 library is that unlike traditional geographic information systems, which represent data as flat two-dimensional projections (similar to an atlas), the S2 library represents all data on a three-dimensional sphere (similar to a globe). This makes it possible to build a worldwide geographic database with no seams or singularities, using a single coordinate system, and with low distortion everywhere compared to the true shape of the Earth. While the Earth is not quite spherical, it is much closer to being a sphere than it is to being flat!

Notable features of the library include:
  • Flexible support for spatial indexing, including the ability to approximate arbitrary regions as collections of discrete S2 cells. This feature makes it easy to build large distributed spatial indexes. (The image above illustrates the S2 space-filling curve, an important tool used for spatial indexing.)
  • Fast in-memory spatial indexing of collections of points, polylines, and polygons.
  • Robust constructive operations (such as intersection, union, and simplification) and boolean predicates (such as testing for containment).
  • Efficient query operations for finding nearby objects, measuring distances, computing centroids, etc.
  • A flexible and robust implementation of snap rounding (a geometric technique that allows operations to be implemented 100% robustly while using small and fast coordinate representations).
  • A collection of efficient yet exact mathematical predicates for testing relationships among geometric primitives.
  • Extensive testing on Google's vast collection of geographic data.
  • Flexible Apache 2.0 license.
The reference implementation of the S2 library is written in C++, and subsets have been ported to Go, Java, and Python. An early version of the code was released in 2011, but today's announcement represents a major update along with a commitment to maintain the library going forward. The code is under active development and new features will be released regularly. (The Java port is based on the 2011 code and does not have the same robustness, performance, or features as the current C++ version.)

Our C++ code repository is here: https://github.com/google/s2geometry
And check out our documentation here: https://s2geometry.io

To learn more, start by reading the overview and quick start documents, then explore the documentation site. The library also has extensive documentation in the header files, which is where the most authoritative information can be found. More introductions and tutorials will be added over time - contributions are welcome!

The S2 library was written primarily by Eric Veach. Other significant contributors include Jesse Rosenstock, Eric Engle (Java port lead), Robert Snedegar (Go port lead), Julien Basch, and Tom Manshreck.

By Eric Veach, Software Engineer

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.


DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.


DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team