Tag Archives: data science

Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges

Machine learning (ML) offers tremendous potential, from diagnosing cancer to engineering safe self-driving cars to amplifying human productivity. To realize this potential, however, organizations need ML solutions to be reliable with ML solution development that is predictable and tractable. The key to both is a deeper understanding of ML data — how to engineer training datasets that produce high quality models and test datasets that deliver accurate indicators of how close we are to solving the target problem.

The process of creating high quality datasets is complicated and error-prone, from the initial selection and cleaning of raw data, to labeling the data and splitting it into training and test sets. Some experts believe that the majority of the effort in designing an ML system is actually the sourcing and preparing of data. Each step can introduce issues and biases. Even many of the standard datasets we use today have been shown to have mislabeled data that can destabilize established ML benchmarks. Despite the fundamental importance of data to ML, it’s only now beginning to receive the same level of attention that models and learning algorithms have been enjoying for the past decade.

Towards this goal, we are introducing DataPerf, a set of new data-centric ML challenges to advance the state-of-the-art in data selection, preparation, and acquisition technologies, designed and built through a broad collaboration across industry and academia. The initial version of DataPerf consists of four challenges focused on three common data-centric tasks across three application domains; vision, speech and natural language processing (NLP). In this blogpost, we outline dataset development bottlenecks confronting researchers and discuss the role of benchmarks and leaderboards in incentivizing researchers to address these challenges. We invite innovators in academia and industry who seek to measure and validate breakthroughs in data-centric ML to demonstrate the power of their algorithms and techniques to create and improve datasets through these benchmarks.


Data is the new bottleneck for ML

Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense the model is a lossy compiler for the data. Though high-quality training datasets are vital to continued advancement in the field of ML, much of the data on which the field relies today is nearly a decade old (e.g., ImageNet or LibriSpeech) or scraped from the web with very limited filtering of content (e.g., LAION or The Pile).

Despite the importance of data, ML research to date has been dominated by a focus on models. Before modern deep neural networks (DNNs), there were no ML models sufficient to match human behavior for many simple tasks. This starting condition led to a model-centric paradigm in which (1) the training dataset and test dataset were “frozen” artifacts and the goal was to develop a better model, and (2) the test dataset was selected randomly from the same pool of data as the training set for statistical reasons. Unfortunately, freezing the datasets ignored the ability to improve training accuracy and efficiency with better data, and using test sets drawn from the same pool as training data conflated fitting that data well with actually solving the underlying problem.

Because we are now developing and deploying ML solutions for increasingly sophisticated tasks, we need to engineer test sets that fully capture real world problems and training sets that, in combination with advanced models, deliver effective solutions. We need to shift from today’s model-centric paradigm to a data-centric paradigm in which we recognize that for the majority of ML developers, creating high quality training and test data will be a bottleneck.

Shifting from today’s model-centric paradigm to a data-centric paradigm enabled by quality datasets and data-centric algorithms like those measured in DataPerf.

Enabling ML developers to create better training and test datasets will require a deeper understanding of ML data quality and the development of algorithms, tools, and methodologies for optimizing it. We can begin by recognizing common challenges in dataset creation and developing performance metrics for algorithms that address those challenges. For instance:

  • Data selection: Often, we have a larger pool of available data than we can label or train on effectively. How do we choose the most important data for training our models?
  • Data cleaning: Human labelers sometimes make mistakes. ML developers can’t afford to have experts check and correct all labels. How can we select the most likely-to-be-mislabeled data for correction?

We can also create incentives that reward good dataset engineering. We anticipate that high quality training data, which has been carefully selected and labeled, will become a valuable product in many industries but presently lack a way to assess the relative value of different datasets without actually training on the datasets in question. How do we solve this problem and enable quality-driven “data acquisition”?


DataPerf: The first leaderboard for data

We believe good benchmarks and leaderboards can drive rapid progress in data-centric technology. ML benchmarks in academia have been essential to stimulating progress in the field. Consider the following graph which shows progress on popular ML benchmarks (MNIST, ImageNet, SQuAD, GLUE, Switchboard) over time:

Performance over time for popular benchmarks, normalized with initial performance at minus one and human performance at zero. (Source: Douwe, et al. 2021; used with permission.)

Online leaderboards provide official validation of benchmark results and catalyze communities intent on optimizing those benchmarks. For instance, Kaggle has over 10 million registered users. The MLPerf official benchmark results have helped drive an over 16x improvement in training performance on key benchmarks.

DataPerf is the first community and platform to build leaderboards for data benchmarks, and we hope to have an analogous impact on research and development for data-centric ML. The initial version of DataPerf consists of leaderboards for four challenges focused on three data-centric tasks (data selection, cleaning, and acquisition) across three application domains (vision, speech and NLP):

  • Training data selection (Vision): Design a data selection strategy that chooses the best training set from a large candidate pool of weakly labeled training images.
  • Training data selection (Speech): Design a data selection strategy that chooses the best training set from a large candidate pool of automatically extracted clips of spoken words.
  • Training data cleaning (Vision): Design a data cleaning strategy that chooses samples to relabel from a “noisy” training set where some of the labels are incorrect.
  • Training dataset evaluation (NLP): Quality datasets can be expensive to construct, and are becoming valuable commodities. Design a data acquisition strategy that chooses which training dataset to “buy” based on limited information about the data.

For each challenge, the DataPerf website provides design documents that define the problem, test model(s), quality target, rules and guidelines on how to run the code and submit. The live leaderboards are hosted on the Dynabench platform, which also provides an online evaluation framework and submission tracker. Dynabench is an open-source project, hosted by the MLCommons Association, focused on enabling data-centric leaderboards for both training and test data and data-centric algorithms.


How to get involved

We are part of a community of ML researchers, data scientists and engineers who strive to improve data quality. We invite innovators in academia and industry to measure and validate data-centric algorithms and techniques to create and improve datasets through the DataPerf benchmarks. The deadline for the first round of challenges is May 26th, 2023.


Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

Source: Google AI Blog


Introducing Discovery Ad Performance Analysis

Posted by Manisha Arora, Nithya Mahadevan, and Aritra Biswas, gPS Data Science team

Overview of Discovery Ads and need for Ad Performance Analysis

Discovery ads, launched in May 2019, allow advertisers to easily extend their reach of social ads users across YouTube, Google Feed and Gmail worldwide. They provide brands a new opportunity to reach 3 billion people as they explore their interests and search for inspiration across their favorite Google feeds (YouTube, Gmail, and Discover) -- all with a single campaign. Learn more about Discovery ads here.


Due to these uniquenesses, customers need a data driven method to identify textual & imagery elements in Discovery Ad copies that drive Interaction Rate of their Discovery Ad campaigns, where interaction is defined as the main user action associated with an ad format—clicks and swipes for text and Shopping ads, views for video ads, calls for call extensions, and so on.

Interaction Rate = interaction / impressions


“Customers need a data driven method to identify textual & imagery elements in Discovery Ad copies that drive Interaction Rate of their campaigns.”

- Manisha Arora, Data Scientist



Our analysis approach:

The Data Science team at Google is investing in a machine learning approach to uncover insights from complex unstructured data and provide machine learning based recommendations to our customers. Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.

We follow a six-step based approach for Discovery Ad Performance Analysis:
  • Understand Business Goals
  • Build Creative Hypothesis
  • Data Extraction
  • Feature Engineering
  • Machine Learning Modeling
  • Analysis & Insight Generation

To begin with, we work closely with the advertisers to understand their business goals, current ad strategy, and future goals. We closely map this to industry insights to draw a larger picture and provide a customized analysis for each advertiser. As a next step, we build hypotheses that best describe the problem we are trying to solve. An example of a hypothesis can be -”Do superlatives (words like “top”, “best”) in the ad copy drive performance?”


“Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.”

- Manisha Arora, Data Scientist


Once we have a hypothesis we are working towards, the next step is to deep-dive into the technical analysis.

Data Extraction & Pre-processing


Our initial dataset includes raw ad text, imagery, performance KPIs & target audience details from historic ad campaigns in the industry. Each Discovery ad contains two text assets (Headline and Description) and one image asset. We then apply ML to extract text and image features from these assets.

Text Feature Extraction

We apply NLP to extract the text features from the ad text. We pass the raw text in the ad headline & description through Google Cloud’s Language API which parses the raw text into our feature set: commonly used keywords, sentiments etc.

Example: 


Image Feature Extraction

We apply Image Processing to extract image features from the ad copy imagery. We pass the raw images through Google Cloud’s Vision API & extract image components including objects, person, background, lighting etc.
Following are the holistic set of features that are extracted from the ad content:

Feature Design


Text Feature Design

There are two types of text features being included in DisCat:
1. Generic text feature
a. These are features returned by Google Cloud’s Language API including sentiment, word / character count, tone (imperative vs indicative), symbols, most frequent words and so on.

2. Industry-specific value propositions
a. These are features that only apply to a specific industry (e.g. finance) that are manually curated by the data science developer in collaboration with specialists and other industry experts.
  • For example, for the finance industry, one value proposition can be “Price Offer”. A list of keywords / phrases that are related to price offers (e.g. “discount”, “low rate”, “X% off”) will be curated based on domain knowledge to identify this value proposition in the ad copies. NLP techniques (e.g. wordnet synset) and manual examination will be used to make sure this list is inclusive and accurate.
Image Feature Design

Like the text features, image features can largely be grouped into two categories:
1. Generic image features
a. These features apply to all images and include the color profile, whether any logos were detected, how many human faces are included, etc.
b. The face-related features also include some advanced aspects: we look for prominent smiling faces looking directly at the camera, we differentiate between individuals vs. small groups vs. crowds, etc.
2. Object-based features
a. These features are based on the list of objects and labels detected in all the images in the dataset, which can often be a massive list including generic objects like “Person” and specific ones like particular dog breeds.
b. The biggest challenge here is dimensionality: we have to cluster together related objects into logical themes like natural vs. urban imagery.
c. We currently have a hybrid approach to this problem: we use unsupervised clustering approaches to create an initial clustering, but we manually revise it as we inspect sample images. The process is:
  • Extract object and label names (e.g. Person, Chair, Beach, Table) from the Vision API output and filter out the most uncommon objects
  • Convert these names to 50-dimensional semantic vectors using a Word2Vec model trained on the Google News corpus
  • Using PCA, extract the top 5 principal components from the semantic vectors. This step takes advantage of the fact that each Word2Vec neuron encodes a set of commonly adjacent words, and different sets represent different axes of similarity and should be weighted differently
  • Use an unsupervised clustering algorithm, namely either k-means or DBSCAN, to find semantically similar clusters of words
  • We are also exploring augmenting this approach with a combined distance metric:
d(w1, w2) = a * (semantic distance) + b * (co-appearance distance)
where the latter is a Jaccard distance metric

Each of these components represents a choice the advertiser made when creating the messaging for an ad. Now that we have a variety of ads broken down into components, we can ask: which components are associated with ads that perform well or not so well?

We use a fixed effects1 model to control for unobserved differences in the context in which different ads were served. This is because the features we are measuring are observed multiple times in different contexts i.e. ad copy, audience groups, time of year & device in which ad is served.

The trained model will seek to estimate the impact of individual keywords, phrases & image components in the discovery ad copies. The model form estimates Interaction Rate (denoted as ‘IR’ in the following formulas) as a function of individual ad copy features + controls:



We use ElasticNet to spread the effect of features in presence of multicollinearity & improve the explanatory power of the model:


“Machine Learning model estimates the impact of individual keywords, phrases, and image components in discovery ad copies.”

- Manisha Arora, Data Scientist

 

Outputs & Insights


Outputs from the machine learning model help us determine the significant features. Coefficient of each feature represents the percentage point effect on CTR.

In other words, if the mean CTR without feature is X% and the feature ‘xx’ has a coeff of Y, then the mean CTR with feature ‘xx’ included will be (X + Y)%. This can help us determine the expected CTR if the most important features are included as part of the ad copies.

Key-takeaways (sample insights):

We analyze keywords & imagery tied to the unique value propositions of the product being advertised. There are 6 key value propositions we study in the model. Following are the sample insights we have received from the analyses:
Shortcomings:

Although insights from DisCat are quite accurate and highly actionable, the moel does have a few limitations:
1. The current model does not consider groups of keywords that might be driving ad performance instead of individual keywords (Example - “Buy Now” phrase instead of “Buy” and “Now” individual keywords).
2. Inference and predictions are based on historical data and aren’t necessarily an indication of future success.
3. Insights are based on industry insights and may need to be tailored for a given advertiser.

DisCat breaks down exactly which features are working well for the ad and which ones have scope for improvement. These insights can help us identify high-impact keywords in the ads which can then be used to improve ad quality, thus improving business outcomes. As next steps, we recommend testing out the new ad copies with experiments to provide a more robust analysis. Google Ads A/B testing feature also allows you to create and run experiments to test these insights in your own campaigns.

Summary


Discovery Ads are a great way for advertisers to extend their social outreach to millions of people across the globe. DisCat helps break down discovery ads by analyzing text and images separately and using advanced ML/AI techniques to identify key aspects of the ad that drives greater performance. These insights help advertisers identify room for growth, identify high-impact keywords, and design better creatives that drive business outcomes.

Acknowledgement


Thank you to Shoresh Shafei and Jade Zhang for their contributions. Special mention to Nikhil Madan for facilitating the publishing of this blog.

Notes

  1. Greene, W.H., 2011. Econometric Analysis, 7th ed., Prentice Hall;

    Cameron, A. Colin; Trivedi, Pravin K. (2005). Microeconometrics: Methods and Applications

Prediction Framework, a time saver for Data Science prediction projects

Posted by Álvaro Lamas, Héctor Parra, Jaime Martínez, Julia Hernández, Miguel Fernandes, Pablo Gil

Acquiring high value customers using predicted Lifetime Value, taking specific actions on high propensity of churn users, generating and activating audiences based on machine learning processed signals…All of those marketing scenarios require of analyzing first party data, performing predictions on the data and activating the results into the different marketing platforms like Google Ads as frequently as possible to keep the data fresh.

Feeding marketing platforms like Google Ads on a regular and frequent basis, requires a robust, report oriented and cost reduced ETL & prediction pipeline. These pipelines are very similar regardless of the use case and it’s very easy to fall into reinventing the wheel every time or manually copy & paste structural code increasing the risk of introducing errors.

Wouldn't it be great to have a common reusable structure and just add the specific code for each of the stages?

Here is where Prediction Framework plays a key role in helping you implement and accelerate your first-party data prediction projects by providing the backbone elements of the predictive process.

Prediction Framework is a fully customizable pipeline that allows you to simplify the implementation of prediction projects. You only need to have the input data source, the logic to extract and process the data and a Vertex AutoML model ready to use along with the right feature list, and the framework will be in charge of creating and deploying the required artifacts. With a simple configuration, all the common artifacts of the different stages of this type of projects will be created and deployed for you: data extraction, data preparation (aka feature engineering), filtering, prediction and post-processing, in addition to some other operational functionality including backfilling, throttling (for API limits), synchronization, storage and reporting.

The Prediction Framework was built to be hosted in the Google Cloud Platform and it makes use of Cloud Functions to do all the data processing (extraction, preparation, filtering and post-prediction processing), Firestore, Pub/Sub and Schedulers for the throttling system and to coordinate the different phases of the predictive process, Vertex AutoML to host your machine learning model and BigQuery as the final storage of your predictions.

Prediction Framework Architecture

To get involved and start using the Prediction Framework, a configuration file needs to be prepared with some environment variables about the Google Cloud Project to be used, the data sources, the ML model to make the predictions and the scheduler for the throttling system. In addition, custom queries for the data extraction, preparation, filtering and post-processing need to be added in the deploy files customization. Then, the deployment is done automatically using a deployment script provided by the tool.

Once deployed, all the stages will be executed one after the other, storing the intermediate and final data in the BigQuery tables:

  • Extract: this step will, on a timely basis, query the transactions from the data source, corresponding to the run date (scheduler or backfill run date) and will store them in a new table into the local project BigQuery.
  • Prepare: immediately after the extract of the transactions for one specific date is available, the data will be picked up from the local BigQuery and processed according to the specs of the model. Once the data is processed, it will be stored in a new table into the local project BigQuery.
  • Filter: this step will query the data stored by the prepare process and will filter the required data and store it into the local project BigQuery. (i.e only taking into consideration new customers transactionsWhat a new customer is up to the instantiation of the framework for the specific use case. Will be covered later).
  • Predict: once the new customers are stored, this step will read them from BigQuery and call the prediction using Vertex API. A formula based on the result of the prediction could be applied to tune the value or to apply thresholds. Once the data is ready, it will be stored into the BigQuery within the target project.
  • Post_process: A formula could be applied to the AutoML batch results to tune the value or to apply thresholds. Once the data is ready, it will be stored into the BigQuery within the target project.

One of the powerful features of the prediction framework is that it allows backfilling directly from the BigQuery user interface, so in case you’d need to reprocess a whole period of time, it could be done in literally 4 clicks.

In summary: Prediction Framework simplifies the implementation of first-party data prediction projects, saving time and minimizing errors of manual deployments of recurrent architectures.

For additional information and to start experimenting, you can visit the Prediction Framework repository on Github.

Prediction Framework, a time saver for Data Science prediction projects

Posted by Álvaro Lamas, Héctor Parra, Jaime Martínez, Julia Hernández, Miguel Fernandes, Pablo Gil

Acquiring high value customers using predicted Lifetime Value, taking specific actions on high propensity of churn users, generating and activating audiences based on machine learning processed signals…All of those marketing scenarios require of analyzing first party data, performing predictions on the data and activating the results into the different marketing platforms like Google Ads as frequently as possible to keep the data fresh.

Feeding marketing platforms like Google Ads on a regular and frequent basis, requires a robust, report oriented and cost reduced ETL & prediction pipeline. These pipelines are very similar regardless of the use case and it’s very easy to fall into reinventing the wheel every time or manually copy & paste structural code increasing the risk of introducing errors.

Wouldn't it be great to have a common reusable structure and just add the specific code for each of the stages?

Here is where Prediction Framework plays a key role in helping you implement and accelerate your first-party data prediction projects by providing the backbone elements of the predictive process.

Prediction Framework is a fully customizable pipeline that allows you to simplify the implementation of prediction projects. You only need to have the input data source, the logic to extract and process the data and a Vertex AutoML model ready to use along with the right feature list, and the framework will be in charge of creating and deploying the required artifacts. With a simple configuration, all the common artifacts of the different stages of this type of projects will be created and deployed for you: data extraction, data preparation (aka feature engineering), filtering, prediction and post-processing, in addition to some other operational functionality including backfilling, throttling (for API limits), synchronization, storage and reporting.

The Prediction Framework was built to be hosted in the Google Cloud Platform and it makes use of Cloud Functions to do all the data processing (extraction, preparation, filtering and post-prediction processing), Firestore, Pub/Sub and Schedulers for the throttling system and to coordinate the different phases of the predictive process, Vertex AutoML to host your machine learning model and BigQuery as the final storage of your predictions.

Prediction Framework Architecture

To get involved and start using the Prediction Framework, a configuration file needs to be prepared with some environment variables about the Google Cloud Project to be used, the data sources, the ML model to make the predictions and the scheduler for the throttling system. In addition, custom queries for the data extraction, preparation, filtering and post-processing need to be added in the deploy files customization. Then, the deployment is done automatically using a deployment script provided by the tool.

Once deployed, all the stages will be executed one after the other, storing the intermediate and final data in the BigQuery tables:

  • Extract: this step will, on a timely basis, query the transactions from the data source, corresponding to the run date (scheduler or backfill run date) and will store them in a new table into the local project BigQuery.
  • Prepare: immediately after the extract of the transactions for one specific date is available, the data will be picked up from the local BigQuery and processed according to the specs of the model. Once the data is processed, it will be stored in a new table into the local project BigQuery.
  • Filter: this step will query the data stored by the prepare process and will filter the required data and store it into the local project BigQuery. (i.e only taking into consideration new customers transactionsWhat a new customer is up to the instantiation of the framework for the specific use case. Will be covered later).
  • Predict: once the new customers are stored, this step will read them from BigQuery and call the prediction using Vertex API. A formula based on the result of the prediction could be applied to tune the value or to apply thresholds. Once the data is ready, it will be stored into the BigQuery within the target project.
  • Post_process: A formula could be applied to the AutoML batch results to tune the value or to apply thresholds. Once the data is ready, it will be stored into the BigQuery within the target project.

One of the powerful features of the prediction framework is that it allows backfilling directly from the BigQuery user interface, so in case you’d need to reprocess a whole period of time, it could be done in literally 4 clicks.

In summary: Prediction Framework simplifies the implementation of first-party data prediction projects, saving time and minimizing errors of manual deployments of recurrent architectures.

For additional information and to start experimenting, you can visit the Prediction Framework repository on Github.

Finding Complex Metal Oxides for Technology Advancement

A crystalline material has atoms systematically arranged in repeating units, with this structure and the elements it contains determining the material’s properties. For example, silicon’s crystal structure allows it to be widely used in the semiconductor industry, whereas graphite’s soft, layered structure makes for great pencils. One class of crystalline materials that are critical for a wide range of applications, ranging from battery technology to electrolysis of water (i.e., splitting H2O into its component hydrogen and oxygen), are crystalline metal oxides, which have repeating units of oxygen and metals. Researchers suspect that there is a significant number of crystalline metal oxides that could prove to be useful, but their number and the extent of their useful properties is unknown.

In “Discovery of complex oxides via automated experiments and data science”, a collaborative effort with partners at the Joint Center for Artificial Photosynthesis (JCAP), a Department of Energy (DOE) Energy Innovation Hub at Caltech, we present a systematic search for new complex crystalline metal oxides using a novel approach for rapid materials synthesis and characterization. Using a customized inkjet printer to print samples with different ratios of metals, we were able to generate more than 350k distinct compositions, a number of which we discovered had interesting properties. One example, based on cobalt, tantalum and tin, exhibited tunable transparency, catalytic activity, and stability in strong acid electrolytes, a rare combination of properties of importance for renewable energy technologies. To stimulate continued research in this field, we are releasing a database consisting of nine channels of optical absorption measurements, which can be used as an indicator of interesting properties, across 376,752 distinct compositions of 108 3-metal oxide systems, along with model results that identify the most promising compositions for a variety of technical applications.

Background
There are on the order of 100 properties of interest in materials science that are relevant to enhancing existing technologies and to creating new ones, ranging from electrical, optical, and magnetic to thermal and mechanical. Traditionally, exploring materials for a target technology involves considering only one or a few such properties at a time, resulting in many parallel efforts where the same materials are being evaluated. Machine learning (ML) for material properties prediction has been successfully deployed in many of these parallel efforts, but the models are inherently specialized and fail to capture the universality of the prediction problem. Instead of asking traditional questions of how ML can help find a suitable material for a particular property, we instead apply ML to find a short-list of materials that may be exceptional for any given property. This strategy combines high throughput materials experiments with a physics-aware data science workflow.

A challenge in realizing this strategy is that the search space for new crystalline metal oxides is enormous. For example, the Inorganic Crystal Structure Database (ICSD) lists 73 metals that exist in oxides composed of a single metal and oxygen. Generating novel compounds simply by making various combinations of these metals would yield 62,196 possible 3-metal oxide systems, some of which will contain several unique structures. If, in addition, one were to vary the relative quantities of each metal, the set of possible combinations would be orders of magnitude larger.

However, while this search space is large, only a small fraction of these novel compositions will form new crystalline structures, with the majority simply resulting in combinations of existing structures. While these combinations of structures may be interesting for some applications, the goal is to find the core single-structure compositions. Of the possible 3-metal oxide systems, the ICSD reports only 2,205 with experimentally confirmed compositions, indicating that the vast majority of possible compositions either have not been explored or have yielded negative results and have not been published. In the present work we do not directly measure the crystal structures of new materials, but instead use high throughput experiments to enable ML-based inferences of where new structures can be found.

Synthesis
Our goal was to explore a large swath of chemical space as quickly as possible. Whereas traditional synthesis techniques like physical vapor deposition can create high quality thin films, we decided to reuse an existing technology that was already optimized to mix and deposit small amounts of material very quickly: an inkjet printer. We made each metal element printable by dissolving a metal nitrate or metal chloride into an ink solution. We then printed a series of lines on glass plates, where the ratios of the elements used in the printing varied along each line according to our experiment design so that we could generate thousands of unique compositions per plate. Several such plates were then dried and baked together in a series of ovens to oxidize the metals. Due to the inherent variability in the printing, drying, and baking of the plates, we opted to print 10 duplicates of each composition. Even with this level of replication, we still were able to generate novel compositions 100x faster than traditional vapor deposition techniques.

The modified professional grade inkjet printer.
Top: A printed and baked plate that is 10 x 15 cm. Bottom: A close-up of a portion of the plate. Since the optical properties vary with composition, the gradient in composition appears as a color gradient along each line.

Characterization
When making samples at this rate, it is hard to find a characterization technique that can keep up. A traditional approach to design a material for a specific purpose would require significant time to measure the pertinent properties of each combination, but for the analysis to keep up with our high-throughput printing method, we needed something faster. So, we built a custom microscope capable of taking pictures at nine discrete wavelengths ranging from the ultraviolet (385 nm), through the visible, to the infrared (850 nm). This microscope produced over 20 TB of image data over the course of the project, which we used to calculate the optical absorption coefficients of each sample at each wavelength. While optical absorption itself is important for technologies such as solar energy harvesting, in our work we are interested in optical absorption vs. wavelength as a fingerprint of each material.

Analysis
After generating 376,752 distinct compositions, we needed to know which ones were actually interesting. We hypothesized that since the structure of a material determines its properties, when a material property (in this case, the optical absorption spectrum) changes in a nontrivial way, that could indicate a structural change. To test this, we built two ML models to identify potentially interesting compositions.

As the composition of metals changes in a metal oxide, the crystal structure of the resulting material may change. The map of the compositions that crystallize into the same structure, which we call the phase, is the “phase diagram”. The first model, the ‘phase diagram’ model, is a physics-based model that assumes thermodynamic equilibrium, which imposes limits on the number of phases that can coexist. Assuming that the optical properties of a combination of crystalline phases vary linearly with the ratio of each crystalline phase, the model generates a set of phases that best fit the optical absorption spectra. The phase diagram model involved a comprehensive search through the space of thermodynamically allowed phase diagrams. The second model seeks to identify “emergent properties” by identifying 3-metal oxide absorption spectra that can not be explained by a linear combination of 1-metal or 2-metal oxide signals.

Phase analysis of compounds with different relative fractions of the metals iron (Fe), tin (Sn) and yttrium (Y). Left: Panels showing the absorption coefficient at different wavelengths: a) 375 nm; b) 530 nm; c) 660 nm, d) 850 nm. Right: Based on the absorption, the phase diagram model identifies the boundaries at which changes in the relative composition in the compound lead to different optical properties and hence suggest compositions with potentially interesting behavior. In panels e), f) and g), red points are candidate phases, and vertices where blue lines meet indicate interesting phase behavior. Panel h) shows the emergent property model, where compositions are colored by the log-likelihood of their properties being explainable by lower-order compositions (darker colors are more likely to represent more interesting compounds).

Experimental Verification
In the end our systematic, combinatorial sweep of 108 3-metal oxide systems found 51 of these systems exhibited interesting behavior. Of these 108 systems, only 1 of them has an experimentally reported entry in the ICSD. We performed an in-depth experimental study of one unexplored system, the Co-Ta-Sn oxides. With guidance from the high throughput workflow, we validated the discovery of a new family of solid solutions by x-ray diffraction, successfully resynthesized the new materials using a common technique (physical vapor deposition), validated the surprisingly high transparency in compositions with up to 30% Co, and performed follow-up electrochemical testing that demonstrated electrocatalytic activity for water oxidation (a critical step in hydrogen fuel synthesis from water). Catalyst testing for water oxidation is far more expensive than the optical screening from our high throughput workflow, and even though there is no known connection between the optical properties and the catalytic properties, we use the analysis of optical properties to select a small number of compositions for catalyst testing, demonstrating our high level concept of using one high throughput workflow to down-select materials for practically any target technology.

Conclusions
The Co-Ta-Sn oxide example illustrates how finding new materials quickly is an important step in developing improved technologies, such as those critical for hydrogen production. We hope this work inspires the materials community — for the experimentalists, we hope to inspire creativity in aggressively scaling high-throughput techniques, and for computationalists, we hope to provide a rich dataset with plenty of negative results to better inform ML and other data science models.

Acknowledgements
It was a pleasure and a privilege to work with John Gregoire and Joel Haber at Caltech for this complex, long-running project. Additionally, we would like to thank Zan Armstrong, Sam Yang, Kevin Kan, Lan Zhou, Matthias Richter, Chris Roat, Nick Wagner, Marc Coram, Marc Berndl, Pat Riley, and Ted Baltz for their contributions.

Source: Google AI Blog


13 Most Common Google Cloud Reference Architectures

Posted by Priyanka Vergadia, Developer Advocate

Google Cloud is a cloud computing platform that can be used to build and deploy applications. It allows you to take advantage of the flexibility of development while scaling the infrastructure as needed.

I'm often asked by developers to provide a list of Google Cloud architectures that help to get started on the cloud journey. Last month, I decided to start a mini-series on Twitter called “#13DaysOfGCP" where I shared the most common use cases on Google Cloud. I have compiled the list of all 13 architectures in this post. Some of the topics covered are hybrid cloud, mobile app backends, microservices, serverless, CICD and more. If you were not able to catch it, or if you missed a few days, here we bring to you the summary!

Series kickoff #13DaysOfGCP

#1: How to set up hybrid architecture in Google Cloud and on-premises

Day 1

#2: How to mask sensitive data in chatbots using Data loss prevention (DLP) API?

Day 2

#3: How to build mobile app backends on Google Cloud?

Day 3

#4: How to migrate Oracle Database to Spanner?

Day 4

#5: How to set up hybrid architecture for cloud bursting?

Day 5

#6: How to build a data lake in Google Cloud?

Day 6

#7: How to host websites on Google Cloud?

Day 7

#8: How to set up Continuous Integration and Continuous Delivery (CICD) pipeline on Google Cloud?

Day 8

#9: How to build serverless microservices in Google Cloud?

Day 9

#10: Machine Learning on Google Cloud

Day 10

#11: Serverless image, video or text processing in Google Cloud

Day 11

#12: Internet of Things (IoT) on Google Cloud

Day 12

#13: How to set up BeyondCorp zero trust security model?

Day 13

Wrap up with a puzzle

Wrap up!

We hope you enjoy this list of the most common reference architectures. Please let us know your thoughts in the comments below!

Federated Analytics: Collaborative Data Science without Data Collection



Federated learning, introduced in 2017, enables developers to train machine learning (ML) models across many devices without centralized data collection, ensuring that only the user has a copy of their data, and is used to power experiences like suggesting next words and expressions in Gboard for Android and improving the quality of smart replies in Android Messages. Following the success of these applications, there is a growing interest in using federated technologies to answer more basic questions about decentralized data — like computing counts or rates — that often don’t involve ML at all. Analyzing user behavior through these techniques can lead to better products, but it is essential to ensure that the underlying data remains private and secure.

Today we’re introducing federated analytics, the practice of applying data science methods to the analysis of raw data that is stored locally on users’ devices. Like federated learning, it works by running local computations over each device’s data, and only making the aggregated results — and never any data from a particular device — available to product engineers. Unlike federated learning, however, federated analytics aims to support basic data science needs. This post describes the basic methodologies of federated analytics that were developed in the pursuit of federated learning, how we extended those insights into new domains, and how recent advances in federated technologies enable better accuracy and privacy for a growing range of data science needs.

Origin of Federated Analytics
The first exploration into federated analytics was in support of federated learning: how can engineers measure the quality of federated learning models against real-world data when that data is not available in a data center? The answer was to re-use the federated learning infrastructure but without the learning part. In federated learning, the model definition can include not only the loss function that is to be optimized, but also code to compute metrics that indicate the quality of the model’s predictions. We could use this code to directly evaluate model quality on phones’ data.

As an example, Gboard engineers measured the overall quality of next word prediction models against raw typing data held on users’ phones. Participating phones downloaded a candidate model, locally computed a metric of how well the model’s predictions matched the words that were actually typed, and then uploaded the metric without any adjustment to the model’s weights or any change to the Gboard typing experience. By averaging the metrics uploaded by many phones, engineers learned a population-level summary of model performance. The technique also easily extended to estimate basic statistics like dataset sizes.

Federated Analytics for Song Recognition Measurement
Beyond model evaluation, federated analytics is used to support the Now Playing feature on Google’s Pixel phones, a tool that shows you what song is playing in the room around you. Under the hood, Now Playing uses an on-device database of song fingerprints to identify music playing near the phone without the need for a network connection. The architecture is good for privacy and for users — it is fast, works offline, and no raw or processed audio data leaves the phone. Because every phone in a region receives the same database, and only songs in the database can be recognized, it’s important for the database to hold the right songs.

To measure and improve each regional database quality, engineers needed to answer a basic question: which of its songs are most often recognized? Federated analytics provides an answer without revealing which songs are heard by any individual phone. It is enabled for users who agreed to send device related usage and diagnostics information to Google.

When Now Playing recognizes a song, it records the track name into the on-device Now Playing history, where users can see recently recognized songs and add them to a music app’s playlist. Later, when the phone is idle, plugged in, and connected to WiFi, Google’s federated learning and analytics server may invite the phone to join a “round” of federated analytics computation, along with several hundred other phones. Each phone in the round computes the recognition rate for the songs in its Now Playing History, and uses the secure aggregation protocol to encrypt the results. The encrypted rates are sent to the federated analytics server, which does not have the keys to decrypt them individually. But when combined with the encrypted counts from the other phones in the round, the final tally of all song counts (and nothing else) can be decrypted by the server.

The result enables Google engineers to improve the song database (for example, by making sure the database contains truly popular songs), without any phone revealing which songs were heard. In its first improvement iteration, this resulted in a 5% increase in overall song recognition across all Pixel phones globally.

Protecting Federated Analytics with Secure Aggregation
Secure aggregation can enable stronger privacy properties for federated analytics applications. For intuition about the secure aggregation protocol, consider a simpler version of the song recognition measurement problem. Let’s say that Rakshita wants to know how often her friends Emily and Zheng have listened to a particular song. Emily has heard it SEmily times and Zheng SZheng times, but neither is comfortable sharing their counts with Rakshita or each other. Instead, the trio could perform a secure aggregation: Emily and Zheng meet to decide on a random number M, which they keep secret from Rakshita. Emily reveals to Rakshita the sum SEmily + M, while Zheng reveals the difference SZheng - M. Rakshita sees two numbers that are effectively random (they are masked by M), but she can add them together (SEmily + M) + (SZheng - M) = SEmily + SZheng to reveal the total number of times that the song was heard by both Emily and Zheng.

The privacy properties of this approach can be strengthened by summing over more people or by adding small random values to the counts (e.g. in support of differential privacy). For Now Playing, song recognition rates from hundreds of devices are summed together, before the result is revealed to the engineers.
An illustration of the secure aggregation protocol, from the federated learning comic book.
Toward Learning and Analytics with Greater Privacy
The methods of federated analytics are an active area of research and already go beyond analyzing metrics and counts. Sometimes, training ML models with federated learning can be used for obtaining aggregate insights about on-device data, without any of the raw data leaving the devices. For example, Gboard engineers wanted to discover new words commonly typed by users and add them to dictionaries used for spell-checking and typing suggestions, all without being able to see any words that users typed. They did it by training a character-level recurrent neural network on phones, using only the words typed on these phones that were not already in the global dictionary. No typed words ever left the phones, but the resulting model could then be used in the datacenter to generate samples of frequently typed character sequences - the new words!

We are also developing techniques for answering even more ambiguous questions on decentralized datasets like “what patterns in the data are difficult for my model to recognize?” by training federated generative models. And we’re exploring ways to apply user-level differentially private model training to further ensure that these models do not encode information unique to any one user.

Google’s commitment to our privacy principles means pushing the state of the art in safeguarding user data, be it through differential privacy in the data center or advances in privacy during data collection. Google’s earliest system for decentralized data analysis, RAPPOR, was introduced in 2014, and we’ve learned a lot about making effective decisions even with a great deal of noise (often introduced for local differential privacy) since. Federated analytics continues this line of work.

It’s still early days for the federated analytics approach and more progress is needed to answer many common data science questions with good accuracy. The recent Advances and Open Problems in Federated Learning paper offers a comprehensive survey of federated research, while Federated Heavy Hitters Discovery with Differential Privacy introduces a federated analytics method for the discovery of most frequent items in the dataset. Federated analytics enables us to think about data science differently, with decentralized data and privacy-preserving aggregation in a central role. We welcome new contributions and extensions in this emerging field.

Acknowledgments
This post reflects the work of many people, including Blaise Agüera y Arcas, Galen Andrew, Sean Augenstein, Françoise Beaufays, Kallista Bonawitz, Mingqing Chen, Hubert Eichner, Úlfar Erlingsson, Christian Frank, Anna Goralska, Marco Gruteser, Alex Ingerman, Vladimir Ivanov, Peter Kairouz, Chloé Kiddon, Ben Kreuter, Alison Lentz, Wei Li, Xu Liu, Antonio Marcedone, Rajiv Mathews, Brendan McMahan, Tom Ouyang, Sarvar Patel, Swaroop Ramaswamy, Aaron Segal, Karn Seth, Haicheng Sun, Timon Van Overveldt, Sergei Vassilvitskii, Scott Wegner, Yuanbo Zhang, Li Zhang, and Wennan Zhu.

Source: Google AI Blog


New Insights into Human Mobility with Privacy Preserving Aggregation



Understanding human mobility is crucial for predicting epidemics, urban and transit infrastructure planning, understanding people’s responses to conflict and natural disasters and other important domains. Formerly, the state-of-the-art in mobility data was based on cell carrier logs or location "check-ins", and was therefore available only in limited areas — where the telecom provider is operating. As a result, cross-border movement and long-distance travel were typically not captured, because users tend not to use their SIM card outside the country covered by their subscription plan and datasets are often bound to specific regions. Additionally, such measures involved considerable time lags and were available only within limited time ranges and geographical areas.

In contrast, de-identified aggregate flows of populations around the world can now be computed from phones' location sensors at a uniform spatial resolution. This metric has the potential to be extremely useful for urban planning since it can be measured in a direct and timely way. The use of de-identified and aggregated population flow data collected at a global level via smartphones could shed additional light on city organization, for example, while requiring significantly fewer resources than existing methods.

In “Hierarchical Organization of Urban Mobility and Its Connection with City Livability”, we show that these mobility patterns — statistics on how populations move about in aggregate — indicate a higher use of public transportation, improved walkability, lower pollutant emissions per capita, and better health indicators, including easier accessibility to hospitals. This work, which appears in Nature Communications, contributes to a better characterization of city organization and supports a stronger quantitative perspective in the efforts to improve urban livability and sustainability.
Visualization of privacy-first computation of the mobility map. Individual data points are automatically aggregated together with differential privacy noise added. Then, flows of these aggregate and obfuscated populations are studied.
Computing a Global Mobility Map While Preserving User Privacy
In line with our AI principles, we have designed a method for analyzing population mobility with privacy-preserving techniques at its core. To ensure that no individual user’s journey can be identified, we create representative models of aggregate data by employing a technique called differential privacy, together with k-anonymity, to aggregate population flows over time. Initially implemented in 2014, this approach to differential privacy intentionally adds random “noise” to the data in a way that maintains both users' privacy and the data's accuracy at an aggregate level. We use this method to aggregate data collected from smartphones of users who have deliberately chosen to opt-in to Location History, in order to better understand global patterns of population movements.

The model only considers de-identified location readings aggregated to geographical areas of predetermined sizes (e.g., S2 cells). It "snaps" each reading into a spacetime bucket by discretizing time into longer intervals (e.g., weeks) and latitude/longitude into a unique identifier of the geographical area. Aggregating into these large spacetime buckets goes beyond protecting individual privacy — it can even protect the privacy of communities.

Finally, for each pair of geographical areas, the system computes the relative flow between the areas over a given time interval, applies differential privacy filters, and outputs the global, anonymized, and aggregated mobility map. The dataset is generated only once and only mobility flows involving a sufficiently large number of accounts are processed by the model. This design is limited to heavily aggregated flows of populations, such as that already used as a vital source of information for estimates of live traffic and parking availability, which protects individual data from being manually identified. The resulting map is indexed for efficient lookup and used to fuel the modeling described below.

Mobility Map Applications
Aggregate mobility of people in cities around the globe defines the city and, in turn, its impact on the people who live there. We define a metric, the flow hierarchy (Φ), derived entirely from the mobility map, that quantifies the hierarchical organization of cities. While hierarchies across cities have been extensively studied since Christaller’s work in the 1930s, for individual cities, the focus has been primarily on the differences between core and peripheral structures, as well as whether cities are mono- or poly-centric. Our results instead show that the reality is much more rich than previously thought. The mobility map enables a quantitative demonstration that cities lie across a spectrum of hierarchical organization that strongly correlates with a series of important quality of life indicators, including health and transportation.

Below we see an example of two cities — Paris and Los Angeles. Though they have almost the same population size, those two populations move in very different ways. Paris is mono-centric, with an "onion" structure that has a distinct high-mobility city center (red), which progressively decreases as we move away from the center (in order: orange, yellow, green, blue). On the other hand, Los Angeles is truly poly-centric, with a large number of high-mobility areas scattered throughout the region.
Mobility maps of Paris (left) and Los Angeles (right). Both cities have similar population sizes, but very different mobility patterns. Paris has an "onion" structure exhibiting a distinct center with a high degree of mobility (red) that progressively decreases as we move away from the center (in order: orange, yellow, green, blue). In contrast, Los Angeles has a large number of high-mobility areas scattered throughout the region.
More hierarchical cities — in terms of flows being primarily between hotspots of similar activity levels — have values of flow hierarchy Φ closer to the upper limit of 1 and tend to have greater levels of uniformity in their spatial distribution of movements, wider use of public transportation, higher levels of walkability, lower pollution emissions, and better indicators of various measures of health. Returning to our example, the flow hierarchy of Paris is Φ=0.93 (in the top quartile across all 174 cities sampled), while that of Los Angeles is 0.86 (bottom quartile).

We find that existing measures of urban structure, such as population density and sprawl composite indices, correlate with flow hierarchy, but in addition the flow hierarchy conveys comparatively more information that includes behavioral and socioeconomic factors.
Connecting flow hierarchy Φ with urban indicators in a sample of US cities. Proportion of trips as a function of Φ, broken down by model share: private car, public transportation, and walking. Sample city names that appear in the plot: ATL (Atlanta), CHA (Charlotte), CHI (Chicago), HOU (Houston), LA (Los Angeles), MIN (Minneapolis), NY (New York City), and SF (San Francisco). We see that cities with higher flow hierarchy exhibit significantly higher rates of public transportation use, less car use, and more walkability.
Measures of urban sprawl require composite indices built up from much more detailed information on land use, population, density of jobs, and street geography among others (sometimes up to 20 different variables). In addition to the extensive data requirements, such metrics are also costly to obtain. For example, censuses and surveys require a massive deployment of resources in terms of interviews, and are only standardized at a country level, hindering the correct quantification of sprawl indices at a global scale. On the other hand, the flow hierarchy, being constructed from mobility information alone, is significantly less expensive to compile (involving only computer processing cycles), and is available in real-time.

Given the ongoing debate on the optimal structure of cities, the flow hierarchy, introduces a different conceptual perspective compared to existing measures, and can shed new light on the organization of cities. From a public-policy point of view, we see that cities with greater degree of mobility hierarchy tend to have more desirable urban indicators. Given that this hierarchy is a measure of proximity and direct connectivity between socioeconomic hubs, a possible direction could be to shape opportunity and demand in a way that facilitates a greater degree of hub-to-hub movement than a hub-to-spoke architecture. The proximity of hubs can be generated through appropriate land use, that can be shaped by data-driven zoning laws in terms of business, residence or service areas. The presence of efficient public transportation and lower use of cars is another important factor. Perhaps a combination of policies, such as congestion-pricing, used to disincentivize private transportation to socioeconomic hubs, along with building public transportation in a targeted fashion to directly connect the hubs, may well prove useful.

Next Steps
This work is part of our larger AI for Social Good efforts, a program that focuses Google's expertise on addressing humanitarian and environmental challenges.These mobility maps are only the first step toward making an impact in epidemiology, infrastructure planning, and disaster response, while ensuring high privacy standards.

The work discussed here goes to great lengths to ensure privacy is maintained. We are also working on newer techniques, such as on-device federated learning, to go a step further and enable computing aggregate flows without personal data leaving the device at all. By using distributed secure aggregation protocols or randomized responses, global flows can be computed without even the aggregator having knowledge of individual data points being aggregated. This technique has also been applied to help secure Chrome from malicious attacks.

Acknowledgements
This work resulted from a collaboration of Aleix Bassolas and José J. Ramasco from the Institute for Cross-Disciplinary Physics and Complex Systems (IFISC, CSIC-UIB), Brian Dickinson, Hugo Barbosa-Filho, Gourab Ghoshal, Surendra A. Hazarie, and Henry Kautz from the Computer Science Department and Ghoshal Lab at the University of Rochester, Riccardo Gallotti from the Bruno Kessler Foundation, and Xerxes Dotiwalla, Paul Eastham, Bryant Gipson, Onur Kucuktunc, Allison Lieber, Adam Sadilek at Google.

The differential privacy library used in this work is open source and available on our GitHub repo.

Source: Google AI Blog


Understanding Bias in Peer Review



In the 1600’s, a series of practices came into being known collectively as the “scientific method.” These practices encoded verifiable experimentation as a path to establishing scientific fact. Scientific literature arose as a mechanism to validate and disseminate findings, and standards of scientific peer review developed as a means to control the quality of entrants into this literature. Over the course of development of peer review, one key structural question remains unresolved to the current day: should the reviewers of a piece of scientific work be made aware of the identify of the authors? Those in favor argue that such additional knowledge may allow the reviewer to set the work in perspective and evaluate it more completely. Those opposed argue instead that the reviewer may form an opinion based on past performance rather than the merit of the work at hand.

Existing academic literature on this subject describes specific forms of bias that may arise when reviewers are aware of the authors. In 1968, Merton proposed the Matthew effect, whereby credit goes to the best established researchers. More recently, Knobloch-Westerwick et al. proposed a Matilda effect, whereby papers from male-first authors were considered to have greater scientific merit that those from female-first authors. But with the exception of one classical study performed by Rebecca Blank in 1991 at the American Economic Review, there have been few controlled experimental studies of such effects on reviews of academic papers.

Last year we had the opportunity to explore this question experimentally, resulting in “Reviewer bias in single- versus double-blind peer review,” a paper that just appeared in the Proceedings of the National Academy of Sciences. Working with Professor Min Zhang of Tsinghua University, we performed an experiment during the peer review process of the 10th ACM Web Search and Data Mining Conference (WSDM 2017) to compare the behavior of reviewers under single-blind and double-blind review. Our experiment ran as follows:
  1. We invited a number of experts to join the conference Program Committee (PC).
  2. We randomly split these PC members into a single-blind cadre and a double-blind cadre.
  3. We asked all PC members to “bid” for papers they were qualified to review, but only the single-blind cadre had access to the names and institutions of the paper authors.
  4. Based on the resulting bids, we then allocated two single-blind and two double-blind PC members to each paper.
  5. Each PC member read his or her assigned papers and entered reviews, again with only single-blind PC members able to see the authors and institutions.
At this point, we closed our experiment and performed the remainder of the conference reviewing process under the single-blind model. As a result, we were able to assess the difference in bidding and reviewing behavior of single-blind and double-blind PC members on the same papers. We discovered a number of surprises.

Our first finding shows that compared to their double-blind counterparts, single-blind PC members tend to enter higher scores for papers from top institutions (the finding holds for both universities and companies) and for papers written by well-known authors. This suggests that a paper authored by an up-and-coming researcher might be reviewed more negatively (by a single-blind PC member) than exactly the same paper written by an established star of the field.

Digging a little deeper, we show some additional findings related to the “bidding process,” in which PC members indicate which papers they would like to review. We found that single-blind PC members (a) bid for about 22% fewer papers than their double-blind counterparts, and (b) bid preferentially for papers from top schools and companies. Finding (a) is especially intriguing; with no author information reviewers have less information, arguably making the job of weighing the merit of each paper more difficult. Yet, the double-blind reviewers bid for more work, not less, than their single-blind counterparts. This suggests that double-blind reviewers become more engaged in the review process. Finding (b) is less surprising, but nonetheless enlightening: In the presence of author names and institution, this information is incorporated into the reviewers’ bids. All else being equal, the odds that single-blind reviewers bid on papers from top institutions is about 15 percent above parity.

We also studied whether the actual or perceived gender of authors influenced the behavior of single-blind versus double-blind reviewers. Here the results are a little more nuanced. Compared to double-blind reviewers, we saw about a 22% decrease in the odds that a single-blind reviewer would give a female-authored paper a favorable review, but due to the smaller count of female-authored papers this result was not statistically significant. In an extended version of our paper, we consider our study as well as a range of other studies in the literature and perform a “meta-analysis” of all these results. From this larger pool of observations, the combined results do show a significant finding for the gender effect.

To conclude, we see that the practice of double-blind reviewing yields a denser landscape of bids, which may result in a better allocation of papers to qualified reviewers. We also see that reviewers who see author and institution information tend to bid more for papers from top institutions, and are more likely to vote to accept papers from top institutions or famous authors than their double-blind counterparts. This offers some evidence to suggest that a particular piece of work might be accepted under single-blind review if the authors are famous or come from top institutions, but rejected otherwise. Of course, the situation remains complex: double-blind review imposes an administrative burden on conference organizers, reduces the opportunity to detect several varieties of conflict of interest, and may in some cases be difficult to implement due to the existence of pre-prints or long-running research agendas that are well-known to experts in the field. Nonetheless, we recommend that journal editors and conference chairs carefully consider the merits of double-blind review.

Please take a look at our full paper for more details of our study.

Revisiting the Unreasonable Effectiveness of Data



There has been remarkable success in the field of computer vision over the past decade, much of which can be directly attributed to the application of deep learning models to this machine perception task. Furthermore, since 2012 there have been significant advances in representation capabilities of these systems due to (a) deeper models with high complexity, (b) increased computational power and (c) availability of large-scale labeled data. And while every year we get further increases in computational power and the model complexity (from 7-layer AlexNet to 101-layer ResNet), available datasets have not scaled accordingly. A 101-layer ResNet with significantly more capacity than AlexNet is still trained with the same 1M images from ImageNet circa 2011. As researchers, we have always wondered: if we scale up the amount of training data 10x, will the accuracy double? How about 100x or maybe even 300x? Will the accuracy plateau or will we continue to see increasing gains with more and more data?
While GPU computation power and model sizes have continued to increase over the last five years, the size of the largest training dataset has surprisingly remained constant.
In our paper, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, we take the first steps towards clearing the clouds of mystery surrounding the relationship between `enormous data' and deep learning. Our goal was to explore: (a) if visual representations can be still improved by feeding more and more images with noisy labels to currently existing algorithms; (b) the nature of the relationship between data and performance on standard vision tasks such as classification, object detection and image segmentation; (c) state-of-the-art models for all the tasks in computer vision using large-scale learning.

Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.

Our experimental results validate some of the hypotheses but also generate some unexpected surprises:
  • Better Representation Learning Helps. Our first observation is that large-scale data helps in representation learning which in-turn improves the performance on each vision task we study. Our findings suggest that a collective effort to build a large-scale dataset for pretraining is important. It also suggests a bright future for unsupervised and semi-supervised representation learning approaches. It seems the scale of data continues to overpower noise in the label space.
  • Performance increases linearly with orders of magnitude of training data.  Perhaps the most surprising finding is the relationship between performance on vision tasks and the amount of training data (log-scale) used for representation learning. We find that this relationship is still linear! Even at 300M training images, we do not observe any plateauing effect for the tasks studied.
  • Object detection performance when pre-trained on different subsets of JFT-300M from scratch. x-axis is the dataset size in log-scale, y-axis is the detection performance in mAP@[.5,.95] on COCO-minival subset.
  • Capacity is Crucial. We also observe that to fully exploit 300M images, one needs higher capacity (deeper) models. For example, in case of ResNet-50 the gain on COCO object detection benchmark is much smaller (1.87%) compared to (3%) when using ResNet-152.
  • New state of the art results. Our paper presents new state-of-the-art results on several benchmarks using the models learned from JFT-300M. For example, a single model (without any bells and whistles) can now achieve 37.4 AP as compared to 34.3 AP on the COCO detection benchmark.
It is important to highlight that the training regime, learning schedules and parameters we used are based on our understanding of training ConvNets with 1M images from ImageNet. Since we do not search for the optimal set of hyper-parameters in this work (which would have required considerable computational effort), it is highly likely that these results are not the best ones you can obtain when using this scale of data. Therefore, we consider the quantitative performance reported to be an underestimate of the actual impact of data.

This work does not focus on task-specific data, such as exploring if more bounding boxes affects model performance. We believe that, although challenging, obtaining large scale task-specific data should be the focus of future study. Furthermore, building a dataset of 300M images should not be a final goal - as a community, we should explore if models continue to improve in the regime of even larger (1 billion+ image) datasets.

Core Contributors
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta

Acknowledgments
This work would not have been possible without the significant efforts of the Image Understanding and Expander teams at Google who built the massive JFT dataset. We would specifically like to thank Tom Duerig, Neil Alldrin, Howard Zhou, Lu Chen, David Cai, Gal Chechik, Zheyun Feng, Xiangxin Zhu and Rahul Sukthankar for their help. Also big thanks to the VALE team for APIs and specifically, Jonathan Huang, George Papandreou, Liang-Chieh Chen and Kevin Murphy for helpful discussions.