Tag Archives: Collaboration

Croissant: a metadata format for ML-ready datasets

Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.

ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.

There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.

Today, we're introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn't change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.


Croissant

This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.


Why a shared format for ML data?

The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.

The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.

Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.


What can Croissant do today?

The Croissant ecosystem: Users can Search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets at:

With a Croissant dataset, it is possible to:

To publish a Croissant dataset, users can:

  • Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
  • Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
  • Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Future direction

We are excited about Croissant's potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.

We encourage the community to join us in contributing to the effort.


Acknowledgements

Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.

Source: Google AI Blog


Supporting benchmarks for AI safety with MLCommons

Standard benchmarks are agreed upon ways of measuring important product qualities, and they exist in many fields. Some standard benchmarks measure safety: for example, when a car manufacturer touts a “five-star overall safety rating,” they’re citing a benchmark. Standard benchmarks already exist in machine learning (ML) and AI technologies: for instance, the MLCommons Association operates the MLPerf benchmarks that measure the speed of cutting edge AI hardware such as Google’s TPUs. However, though there has been significant work done on AI safety, there are as yet no similar standard benchmarks for AI safety.

We are excited to support a new effort by the non-profit MLCommons Association to develop standard AI safety benchmarks. Developing benchmarks that are effective and trusted is going to require advancing AI safety testing technology and incorporating a broad range of perspectives. The MLCommons effort aims to bring together expert researchers across academia and industry to develop standard benchmarks for measuring the safety of AI systems into scores that everyone can understand. We encourage the whole community, from AI researchers to policy experts, to join us in contributing to the effort.


Why AI safety benchmarks?

Like most advanced technologies, AI has the potential for tremendous benefits but could also lead to negative outcomes without appropriate care. For example, AI technology can boost human productivity in a wide range of activities (e.g., improve health diagnostics and research into diseases, analyze energy usage, and more). However, without sufficient precautions, AI could also be used to support harmful or malicious activities and respond in biased or offensive ways.

By providing standard measures of safety across categories such as harmful use, out-of-scope responses, AI-control risks, etc., standard AI safety benchmarks could help society reap the benefits of AI while ensuring that sufficient precautions are being taken to mitigate these risks. Initially, nascent safety benchmarks could help drive AI safety research and inform responsible AI development. With time and maturity, they could help inform users and purchasers of AI systems. Eventually, they could be a valuable tool for policy makers.

In computer hardware, benchmarks (e.g., SPEC, TPC) have shown an amazing ability to align research, engineering, and even marketing across an entire industry in pursuit of progress, and we believe standard AI safety benchmarks could help do the same in this vital area.


What are standard AI safety benchmarks?

Academic and corporate research efforts have experimented with a range of AI safety tests (e.g., RealToxicityPrompts, Stanford HELM fairness, bias, toxicity measurements, and Google’s guardrails for generative AI). However, most of these tests focus on providing a prompt to an AI system and algorithmically scoring the output, which is a useful start but limited to the scope of the test prompts. Further, they usually use open datasets for the prompts and responses, which may already have been (often inadvertently) incorporated into training data.

MLCommons proposes a multi-stakeholder process for selecting tests and grouping them into subsets to measure safety for particular AI use-cases, and translating the highly technical results of those tests into scores that everyone can understand. MLCommons is proposing to create a platform that brings these existing tests together in one place and encourages the creation of more rigorous tests that move the state of the art forward. Users will be able to access these tests both through online testing where they can generate and review scores and offline testing with an engine for private testing.


AI safety benchmarks should be a collective effort

Responsible AI developers use a diverse range of safety measures, including automatic testing, manual testing, red teaming (in which human testers attempt to produce adversarial outcomes), software-imposed restrictions, data and model best-practices, and auditing. However, determining that sufficient precautions have been taken can be challenging, especially as the community of companies providing AI systems grows and diversifies. Standard AI benchmarks could provide a powerful tool for helping the community grow responsibly, both by helping vendors and users measure AI safety and by encouraging an ecosystem of resources and specialist providers focused on improving AI safety.

At the same time, development of mature AI safety benchmarks that are both effective and trusted is not possible without the involvement of the community. This effort will need researchers and engineers to come together and provide innovative yet practical improvements to safety testing technology that make testing both more rigorous and more efficient. Similarly, companies will need to come together and provide test data, engineering support, and financial support. Some aspects of AI safety can be subjective, and building trusted benchmarks supported by a broad consensus will require incorporating multiple perspectives, including those of public advocates, policy makers, academics, engineers, data workers, business leaders, and entrepreneurs.


Google’s support for MLCommons

Grounded in our AI Principles that were announced in 2018, Google is committed to specific practices for the safe, secure, and trustworthy development and use of AI (see our 2019, 2020, 2021, 2022 updates). We’ve also made significant progress on key commitments, which will help ensure AI is developed boldly and responsibly, for the benefit of everyone.

Google is supporting the MLCommons Association's efforts to develop AI safety benchmarks in a number of ways.

  1. Testing platform: We are joining with other companies in providing funding to support the development of a testing platform.
  2. Technical expertise and resources: We are providing technical expertise and resources, such as the Monk Skin Tone Examples Dataset, to help ensure that the benchmarks are well-designed and effective.
  3. Datasets: We are contributing an internal dataset for multilingual representational bias, as well as already externalized tests for stereotyping harms, such as SeeGULL and SPICE. Moreover, we are sharing our datasets that focus on collecting human annotations responsibly and inclusively, like DICES and SRP.

Future direction

We believe that these benchmarks will be very useful for advancing research in AI safety and ensuring that AI systems are developed and deployed in a responsible manner. AI safety is a collective-action problem. Groups like the Frontier Model Forum and Partnership on AI are also leading important standardization initiatives. We’re pleased to have been part of these groups and MLCommons since their beginning. We look forward to additional collective efforts to promote the responsible development of new generative AI tools.


Acknowledgements

Many thanks to the Google team that contributed to this work: Peter Mattson, Lora Aroyo, Chris Welty, Kathy Meier-Hellstern, Parker Barnes, Tulsee Doshi, Manvinder Singh, Brian Goldman, Nitesh Goyal, Alice Friend, Nicole Delange, Kerry Barker, Madeleine Elish, Shruti Sheth, Dawn Bloxwich, William Isaac, Christina Butterfield.

Source: Google AI Blog


Improving traffic evacuations: A case study

Some cities or communities develop an evacuation plan to be used in case of an emergency. There are a number of reasons why city officials might enact their plan, a primary one being a natural disaster, such as a tornado, flood, or wildfire. An evacuation plan can help the community more effectively respond to an emergency, and so could help save lives. However, it can be difficult for a city to evaluate such a plan because it is not practical to have an entire town or city rehearse a full blown evacuation. For example, Mill Valley, a city in northern California, created a wildfire evacuation plan but lacked an estimate for how long the evacuation would take.

Today we describe a case study in which we teamed up with the city of Mill Valley to test and improve their evacuation plan. We outline our approach in our paper, “Mill Valley Evacuation Study”. We started by using a traffic simulator to model a citywide evacuation. The research goal was to provide the city with detailed estimates for how long it would take to evacuate the city, and, by studying the egress pattern, to find modifications to make the plan more effective. While our prior work on this subject provided an estimate for the evacuation time and showed how the time could be reduced if certain road changes were implemented, it turns out the recommendations in that paper — such as changing the number of outgoing lanes on an arterial — were not feasible. The current round of research improves upon the initial study by more accurately modeling the number and starting locations of vehicles, by using a more realistic map, and by working closely with city officials to ensure that recommended changes to the plan are deemed viable.



Geography and methodology

Mill Valley is in Marin County, California, north of San Francisco. Many of the residences are located on the steep hillsides of several valleys surrounded by dense redwood forests.

Aerial views of Mill Valley, courtesy of the City of Mill Valley.

Many of those residences are in areas that have only one exit direction, toward the town center. From there the best evacuation route is toward Highway 101, which is in the flat part of the city and is the most likely area to be far from potential wildfires. Some neighborhoods have other routes that lead away from both the city and Highway 101, but those routes pass through hilly forested areas, which could be dangerous or impassable during a wildfire. So, the evacuation plan directs all vehicles west of Highway 101 to head east, to the highway (see map below). The neighborhoods east of Highway 101 are not included in the simulation because they are away from areas with a high fire hazard rating, and are close to the highway.

Mill Valley has about 11,400 households west of Highway 101. Most Mill Valley households have two vehicles. Evacuation times scale with the number of vehicles, so it is in the common interest to minimize the number of vehicles used during an evacuation. To that end, Mill Valley has a public awareness campaign aimed at having each household evacuate in one vehicle. While no one knows how many vehicles would be used during an evacuation, it is safe to assume it is on average between one and two per household. The basic evacuation problem, then, is how to efficiently get between 11 and 23 thousand vehicles from the various residences onto one of the three sets of Highway 101 on-ramps.

The simulated part of Mill Valley west of Highway 101 is inside the blue border. Highway 101 is shown in green. The red squares indicate the three sets of Highway 101 on-ramps. The pink area has the highest fire hazard rating.

The current work uses the same general methodology as the previous research, namely, running the open source SUMO agent-based traffic simulator on a map of Mill Valley. The traffic simulator models traffic by simulating each vehicle individually. The detailed behaviors of vehicles are dictated by a car-following model. Each vehicle is given a point and time at which to start and an initial route. The routes of most vehicles are updated throughout the simulation, depending on conditions. To consider potential changes in driver behavior under the high stress conditions of an evacuation, the effects of the “aggressiveness” of each car is also investigated, but in our case the impacts are minimal. Some simplifying assumptions are that vehicles originate at residential addresses and the roads and highways are initially empty. These assumptions correspond approximately to conditions that could be encountered if an evacuation happens in the middle of the night. The main inputs in the simulation are the road network, the household locations, the average number of vehicles per household, and a departure temporal distribution. We have to make assumptions about the departure distribution. After discussing with the city officials, we chose a distribution such that most vehicles depart within an hour.


Four bottlenecks

Mill Valley has three sets of Highway 101 on-ramps: northern, middle, and southern. All the vehicles must use one of these sets of on-ramps to reach their destination (either the northernmost or southernmost segment of Highway 101 included in our map). Given that we are only concerned with the majority of Mill Valley that lies west of the highway, there are two lanes that approach the northern on-ramps, and one lane that approaches each of the middle and southern on-ramps. Since every vehicle has to pass over one of these four lanes to reach the highway, they are the bottlenecks. Given the geography and existing infrastructure, adding more lanes is infeasible. The aim of this research, then, is to try to modify traffic patterns to maximize the rate of traffic on each of the four lanes.


Evacuation plan

When we started this research, Mill Valley had a preliminary evacuation plan. It included modifying traffic patterns — disabling traffic lights and changing traffic rules — on a few road segments, as well as specifying the resources (traffic officers, signage) necessary to implement the changes. As an example, a two-way road may be changed to a one-way road to double the number of outgoing lanes. Temporarily changing the direction of traffic is called contraflow.

The plot below shows the simulated fraction of vehicles that have departed or reached their destinations versus time, for 1, 1.5, and 2 vehicles per household (left to right). The dashed line on the far left shows the fraction that have departed. The solid black lines show the preliminary evacuation plan results and the dotted lines indicate the normal road network (baseline) results. The preliminary evacuation plan significantly speeds up the evacuation.

The cumulative fraction of vehicles vs. time in hours. The demand curve is shown in the dashed line on the far left. The solid lines show the preliminary evacuation plan curves for 1, 1.5 and 2 vehicles per household (left to right). The dotted lines show the same for the baseline case.

We can understand how effective the preliminary evacuation plan is by measuring the rates at the bottlenecks. The below plots show the rate of traffic on each of the four lanes leading to the highway on-ramps for the case of 1.5 vehicles per household for both the baseline case (the normal road rules; shown shaded in gray) and the preliminary evacuation plan (shown outlined in black). The average rate per lane varies greatly in the different cases. It is clear that, while the evacuation plan leads to increased evacuation rates, there is room for improvement. In particular, the middle on-ramps are quite underutilized.

The rates of traffic on the four lanes leading to Highway 101 on-ramps for both the baseline case (normal road rules; shown shaded in gray) and the preliminary evacuation plan (shown outlined in black).

Final evacuation plan

After studying the map and investigating different alternatives, we, working together with city officials, found a minimal set of new road changes that substantially lower the evacuation time compared to the preliminary evacuation plan (shown below). We call this the final evacuation plan. It extends the contraflow section of the preliminary plan 1000 feet further west, to a main intersection. Crucially, this allows for one of the (normally) two outgoing lanes to be dedicated to routing traffic to the middle on-ramps. It also creates two outgoing lanes from that main intersection clear through to the northern on-ramps, over ¾ of a mile to the east.

A map of the main changes in the final evacuation plan. The red line shows that traffic heading north on Camino Alto gets diverted to the middle Highway 101 on-ramps. The blue line shows traffic in the northern lane of E Blithedale Ave gets routed on the new contraflow section.

The rate per lane plots comparing the preliminary and final evacuation plans are shown below for 1.5 vehicles per household. The simulation indicates that the final plan increases the average rate of traffic on the lane leading to the middle on-ramps from about 4 vehicles per minute to about 18. It also increases the through rate of the northern on-ramps by over 60%.

The rates of traffic on the four lanes leading to Highway 101 on-ramps for both the preliminary case (shown shaded in gray) and the final evacuation plan (shown outlined in black).

The below plot shows the cumulative fraction of vehicles vs. time, comparing the cases of 1, 1.5 and 2 vehicles per household for the preliminary and final evacuation plans. The speedup is quite significant, on the scale of hours. For example, with 1.5 vehicles per household, it took 5.3 hours to evacuate the city using the preliminary evacuation plan, and only 3.5 hours using the final plan.

The cumulative fraction of vehicles vs. time in hours. The demand curve is shown in the dashed line on the far left. The solid lines show the final evacuation plan curves for 1, 1.5 and 2 vehicles per household (left to right). The dotted lines show the same for the preliminary evacuation plan.

Conclusion

Evacuation plans can be crucial in quickly getting many people to safety in emergency situations. While some cities have traffic evacuation plans in place, it can be difficult for officials to learn how well the plan works or whether it can be improved. Google Research helped Mill Valley test and evaluate their evacuation plan by running traffic simulations. We found that, while the preliminary plan did speed up the evacuation time, some minor changes to the plan significantly expedited evacuation. We worked closely with the city during this research, and Mill Valley has adopted the final plan. We were able to provide the city with more simulation details, including results for evacuating the city one area at a time. Full details can be found in the paper.

Detailed recommendations for a particular evacuation plan are necessarily specific to the area under study. So, the specific road network changes we found for Mill Valley are not directly applicable for other cities. However, we used only public data (road network from OpenStreetMap; household information from census data) and an open source simulator (SUMO), so any city or agency could use the methodology used in our paper to obtain results for their area.


Acknowledgements

We thank former Mayor John McCauley and City of Mill Valley personnel Tom Welch, Lindsay Haynes, Danielle Staude, Rick Navarro and Alan Piombo for numerous discussions and feedback, and Carla Bromberg for program management.

Source: Google AI Blog


Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges

Machine learning (ML) offers tremendous potential, from diagnosing cancer to engineering safe self-driving cars to amplifying human productivity. To realize this potential, however, organizations need ML solutions to be reliable with ML solution development that is predictable and tractable. The key to both is a deeper understanding of ML data — how to engineer training datasets that produce high quality models and test datasets that deliver accurate indicators of how close we are to solving the target problem.

The process of creating high quality datasets is complicated and error-prone, from the initial selection and cleaning of raw data, to labeling the data and splitting it into training and test sets. Some experts believe that the majority of the effort in designing an ML system is actually the sourcing and preparing of data. Each step can introduce issues and biases. Even many of the standard datasets we use today have been shown to have mislabeled data that can destabilize established ML benchmarks. Despite the fundamental importance of data to ML, it’s only now beginning to receive the same level of attention that models and learning algorithms have been enjoying for the past decade.

Towards this goal, we are introducing DataPerf, a set of new data-centric ML challenges to advance the state-of-the-art in data selection, preparation, and acquisition technologies, designed and built through a broad collaboration across industry and academia. The initial version of DataPerf consists of four challenges focused on three common data-centric tasks across three application domains; vision, speech and natural language processing (NLP). In this blogpost, we outline dataset development bottlenecks confronting researchers and discuss the role of benchmarks and leaderboards in incentivizing researchers to address these challenges. We invite innovators in academia and industry who seek to measure and validate breakthroughs in data-centric ML to demonstrate the power of their algorithms and techniques to create and improve datasets through these benchmarks.


Data is the new bottleneck for ML

Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense the model is a lossy compiler for the data. Though high-quality training datasets are vital to continued advancement in the field of ML, much of the data on which the field relies today is nearly a decade old (e.g., ImageNet or LibriSpeech) or scraped from the web with very limited filtering of content (e.g., LAION or The Pile).

Despite the importance of data, ML research to date has been dominated by a focus on models. Before modern deep neural networks (DNNs), there were no ML models sufficient to match human behavior for many simple tasks. This starting condition led to a model-centric paradigm in which (1) the training dataset and test dataset were “frozen” artifacts and the goal was to develop a better model, and (2) the test dataset was selected randomly from the same pool of data as the training set for statistical reasons. Unfortunately, freezing the datasets ignored the ability to improve training accuracy and efficiency with better data, and using test sets drawn from the same pool as training data conflated fitting that data well with actually solving the underlying problem.

Because we are now developing and deploying ML solutions for increasingly sophisticated tasks, we need to engineer test sets that fully capture real world problems and training sets that, in combination with advanced models, deliver effective solutions. We need to shift from today’s model-centric paradigm to a data-centric paradigm in which we recognize that for the majority of ML developers, creating high quality training and test data will be a bottleneck.

Shifting from today’s model-centric paradigm to a data-centric paradigm enabled by quality datasets and data-centric algorithms like those measured in DataPerf.

Enabling ML developers to create better training and test datasets will require a deeper understanding of ML data quality and the development of algorithms, tools, and methodologies for optimizing it. We can begin by recognizing common challenges in dataset creation and developing performance metrics for algorithms that address those challenges. For instance:

  • Data selection: Often, we have a larger pool of available data than we can label or train on effectively. How do we choose the most important data for training our models?
  • Data cleaning: Human labelers sometimes make mistakes. ML developers can’t afford to have experts check and correct all labels. How can we select the most likely-to-be-mislabeled data for correction?

We can also create incentives that reward good dataset engineering. We anticipate that high quality training data, which has been carefully selected and labeled, will become a valuable product in many industries but presently lack a way to assess the relative value of different datasets without actually training on the datasets in question. How do we solve this problem and enable quality-driven “data acquisition”?


DataPerf: The first leaderboard for data

We believe good benchmarks and leaderboards can drive rapid progress in data-centric technology. ML benchmarks in academia have been essential to stimulating progress in the field. Consider the following graph which shows progress on popular ML benchmarks (MNIST, ImageNet, SQuAD, GLUE, Switchboard) over time:

Performance over time for popular benchmarks, normalized with initial performance at minus one and human performance at zero. (Source: Douwe, et al. 2021; used with permission.)

Online leaderboards provide official validation of benchmark results and catalyze communities intent on optimizing those benchmarks. For instance, Kaggle has over 10 million registered users. The MLPerf official benchmark results have helped drive an over 16x improvement in training performance on key benchmarks.

DataPerf is the first community and platform to build leaderboards for data benchmarks, and we hope to have an analogous impact on research and development for data-centric ML. The initial version of DataPerf consists of leaderboards for four challenges focused on three data-centric tasks (data selection, cleaning, and acquisition) across three application domains (vision, speech and NLP):

  • Training data selection (Vision): Design a data selection strategy that chooses the best training set from a large candidate pool of weakly labeled training images.
  • Training data selection (Speech): Design a data selection strategy that chooses the best training set from a large candidate pool of automatically extracted clips of spoken words.
  • Training data cleaning (Vision): Design a data cleaning strategy that chooses samples to relabel from a “noisy” training set where some of the labels are incorrect.
  • Training dataset evaluation (NLP): Quality datasets can be expensive to construct, and are becoming valuable commodities. Design a data acquisition strategy that chooses which training dataset to “buy” based on limited information about the data.

For each challenge, the DataPerf website provides design documents that define the problem, test model(s), quality target, rules and guidelines on how to run the code and submit. The live leaderboards are hosted on the Dynabench platform, which also provides an online evaluation framework and submission tracker. Dynabench is an open-source project, hosted by the MLCommons Association, focused on enabling data-centric leaderboards for both training and test data and data-centric algorithms.


How to get involved

We are part of a community of ML researchers, data scientists and engineers who strive to improve data quality. We invite innovators in academia and industry to measure and validate data-centric algorithms and techniques to create and improve datasets through the DataPerf benchmarks. The deadline for the first round of challenges is May 26th, 2023.


Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

Source: Google AI Blog


Connecting and collaborating with Google Meet & G Suite

In this unprecedented time in history, we’ve seen millions of businesses across India adapt to the new normal of working remotely and learning from home. It is human nature to connect, and video conferencing plays a pivotal role here. 
We saw more than 3 million new users connecting on Google Meet every day as of this month, spending over 3 billion minutes a day together. That’s a massive 30x jump from the numbers we saw earlier in January. And today, we’re making our premium video conferencing tool, Google Meet free for everyone, with availability rolling out over the coming weeks. 
Stay home, stay safe, stay connected 
Starting May, anyone with an email address can create a Google Account and use Meet to schedule, join or start secure video meetings with anyone — whether it’s a virtual dance class, a weekly book club, neighborhood meetings, or any other reason to connect with your community, friends, and family. 
Until today, Google Meet was only available as part of G Suite, our collaboration and productivity solution for businesses, organizations and schools. Now, it is freely available on the web at meet.google.com and via mobile apps for iOS or Android. And if you use Google Calendar, you can easily start or join from there too. 
We’re also rolling out new features including tiled layout for larger calls, the option to present a Chrome tab (instead of just presenting their window or entire screen), low-light mode and eventually noise cancellation. 
Secure by design
Meet is designed, built and operated to be secure at scale — for everyone. We employ a vast array of safe-by-default measures to keep your meetings safe without doing a thing, everytime. We don't require or ask for any plugins to be installed, reducing the amount of software users and businesses need to patch with security updates on their machines. 
We also ensure that only authorised users can use and access Meet services by using a 2-Step Verification option for account — making them secure and convenient. Google Meet users can enroll their accounts in our Advanced Protection Program (APP), which provides our strongest protections available against phishing and account hijacking, and is specifically designed for the highest-risk accounts.
Helping businesses collaborate with G Suite
We’re not just connecting over video. We’ve also seen huge spikes in the use of our entire G Suite offering as more people create, share, and connect together while working remotely. Earlier this year we marked another major milestone — surpassing six million paying businesses and organisations who use G Suite. 
“Employees are able to access every business application via Google Cloud Platform and continue to communicate as usual not only between themselves but also with customers, vendors and other stakeholders with G Suite. The Google Cloud team is always accessible and supportive to help us ease things. The use of collaborative tools has facilitated important human contact and responsiveness in an unprecedented time of remote work,” said V M Samir, Group CIO, Rustomjee, a leading real estate company in Mumbai.
Mathan Babu Kasilingam, CISO of National Payments Corporation of India, an umbrella organisation for all retail payment systems in India says, “Google Meet has played a good role in helping our teams stay connected. It’s great to see that it is possible to work across various remote locations and manage to carry on business as usual through video conferencing.” 
Here is what TR Chadha & Co, one of India’s prominent chartered accountancy firms, had to say. “G Suite has been a lifeline for the teams for the past month since we have transitioned to a work from home set-up due to the pandemic. With G Suite, our teams can securely log-in from any device to work wherever they are at any time,” said Gautam Kumar, IT Manager, TR Chadha.
Securely stay connected and productive not just today but also in the future
We’re carefully rolling out Meet incrementally over the coming weeks to ensure we can provide everyone with the reliability and security they expect from Google. This means you might not be able to create meetings right away, but you can sign up to be notified when it’s available.
Meetings are limited to 60 minutes for the free product, though we will not enforce this time limit until after September 30.  Creating a trusted meeting space is important, so being mindful when sharing meeting links in public forums can help create a safe experience for all attendees. For more tips on how to use Meet securely and effectively, visit our Help Center.

Posted by Karan Bajwa, Managing Director, Google Cloud India

A Summary of the Google Flood Forecasting Meets Machine Learning Workshop



Recently, we hosted the Google Flood Forecasting Meets Machine Learning workshop in our Tel Aviv office, which brought hydrology and machine learning experts from Google and the broader research community to discuss existing efforts in this space, build a common vocabulary between these groups, and catalyze promising collaborations. In line with our belief that machine learning has the potential to significantly improve flood forecasting efforts and help the hundreds of millions of people affected by floods every year, this workshop discussed improving flood forecasting by aggregating and sharing large data sets, automating calibration and modeling processes, and applying modern statistical and machine learning tools to the problem.

Panel on challenges and opportunities in flood forecasting, featuring (from left to right): Prof. Paolo Burlando (ETH Zürich), Dr. Tyler Erickson (Google Earth Engine), Dr. Peter Salamon (Joint Research Centre) and Prof. Dawei Han (University of Bristol).
The event was kicked off by Google's Yossi Matias, who discussed recent machine learning work and its potential relevance for flood forecasting, crisis response and AI for Social Good, followed by two introductory sessions aimed at bridging some of the knowledge gap between the two fields - introduction to hydrology for computer scientists by Prof. Peter Molnar of ETH Zürich, and introduction to machine learning for hydrologists by Prof. Yishay Mansour of Tel Aviv University and Google

Included in the 2-day event was a wide range of fascinating talks and posters across the flood forecasting landscape, from both hydrologic and machine learning points of view.

An overview of research areas in flood forecasting addressed in the workshop.
Presentations from the research community included:
Alongside these talks, we presented the various efforts across Google to try and improve flood forecasting and foster collaborations in the field, including:
Additionally, at this workshop we piloted an experimental "ML Consultation" panel, where Googlers Gal Elidan, Sasha Goldshtein and Doron Kukliansky gave advice on how to best use machine learning in several hydrology-related tasks. Finally, we concluded the workshop with a moderated panel on the greatest challenges and opportunities in flood forecasting, with hydrology experts Prof. Paolo Burlando of ETH Zürich, Prof. Dawei Han of the University of Bristol, Dr. Peter Salamon of the Joint Research Centre and Dr. Tyler Erickson of Google Earth Engine.
Flood forecasting is an incredibly important and challenging task that is one part of our larger AI for Social Good efforts. We believe that effective global-scale solutions can be achieved by combining modern techniques with the domain expertise already existing in the field. The workshop was a great first step towards creating much-needed understanding, communication and collaboration between the flood forecasting community and the machine learning community, and we look forward to our continued engagement with the broad research community to tackle this challenge.

Acknowledgements
We would like to thank Avinatan Hassidim, Carla Bromberg, Doron Kukliansky, Efrat Morin, Gal Elidan, Guy Shalev, Jennifer Ye, Nadav Rabani and Sasha Goldshtein for their contributions to making this workshop happen.

Source: Google AI Blog


Harnessing Organizational Knowledge for Machine Learning



One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled datasets used to train modern ML models. Creating these datasets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, due to the evolution of real-world applications, labeled datasets often need to be thrown out or re-labeled.

In collaboration with Stanford and Brown University, we present "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale," which explores how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training datasets. In this study, we use an experimental internal system, Snorkel Drybell, which adapts the open-source Snorkel framework to use diverse organizational knowledge resources—like internal models, ontologies, legacy rules, knowledge graphs and more—in order to generate training data for machine learning models at web scale. We find that this approach can match the efficacy of hand-labeling tens of thousands of data points, and reveals some core lessons about how training datasets for modern machine learning models can be created in practice.

Rather than labeling training data by hand, Snorkel DryBell enables writing labeling functions that label training data programmatically. In this work, we explored how these labeling functions can capture engineers' knowledge about how to use existing resources as heuristics for weak supervision. As an example, suppose our goal is to identify content related to celebrities. One can leverage an existing named-entity recognition (NER) model for this task by labeling any content that does not contain a person as not related to celebrities. This illustrates how existing knowledge resources (in this case, a trained model) can be combined with simple programmatic logic to label training data for a new model. Note also, importantly, that this labeling function returns None---i.e. abstains---in many cases, and thus only labels some small part of the data; our overall goal is to use these labels to train a modern machine learning model that can generalize to new data.

In our example of a labeling function, rather than hand-labeling a data point (1), one utilizes an existing knowledge resource—in this case, a NER model (2)—together with some simple logic expressed in code (3) to heuristically label data.
This programmatic interface for labeling training data is much faster and more flexible than hand-labeling individual data points, but the resulting labels are obviously of much lower quality than manually-specified labels. The labels generated by these labeling functions will often overlap and disagree, as the labeling functions may not only have arbitrary unknown accuracies, but may also be correlated in arbitrary ways (for example, from sharing a common data source or heuristic).

To solve the problem of noisy and correlated labels, Snorkel DryBell uses a generative modeling technique to automatically estimate the accuracies and correlations of the labeling functions in a provably consistent way—without any ground truth training labels—then uses this to re-weight and combine their outputs into a single probabilistic label per data point. At a high level, we rely on the observed agreements and disagreements between the labeling functions (the covariance matrix), and learn the labeling function accuracy and correlation parameters that best explain this observed output using a new matrix completion-style approach. The resulting labels can then be used to train an arbitrary model (e.g. in TensorFlow), as shown in the system diagram below.

Using Diverse Knowledge Sources as Weak Supervision
To study the efficacy of Snorkel Drybell, we used three production tasks and corresponding datasets, aimed at classifying topics in web content, identifying mentions of certain products, and detecting certain real-time events. Using Snorkel DryBell, we were able to make use of various existing or quickly specified sources of information such as:
  • Heuristics and rules: e.g. existing human-authored rules about the target domain.
  • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain.
  • Aggregate statistics: e.g. tracked metrics about the target domain.
  • Knowledge or entity graphs: e.g. databases of facts about the target domain.
In Snorkel DryBell, the goal is to train a machine learning model (C), for example to do content or event classification over web data. Rather than hand-labeling training data to do this, in Snorkel DryBell users write labeling functions that express various organizational knowledge resources (A), which are then automatically reweighted and combined (B).
We used these organizational knowledge resources to write labeling functions in a MapReduce template-based pipeline. Each labeling function takes in a data point and either abstains, or outputs a label. The result is a large set of programmatically-generated training labels. However, many of these labels were very noisy (e.g. from the heuristics), conflicted with each other, or were far too coarse-grained (e.g. the topic models) for our task, leading to the next stage of Snorkel DryBell, aimed at automatically cleaning and integrating the labels into a final training set.

Modeling the Accuracies to Combine & Repurpose Existing Sources
To handle these noisy labels, the next stage of Snorkel DryBell combines the outputs from the labeling functions into a single, confidence-weighted training label for each data point. The challenging technical aspect is that this must be done without any ground-truth labels. We use a generative modeling technique that learns the accuracy of each labeling function using only unlabeled data. This technique learns by observing the matrix of agreements and disagreements between the labeling functions' outputs, taking into account known (or statistically estimated) correlation structures between them. In Snorkel DryBell, we also implement a new faster, sampling-free version of this modeling approach, implemented in TensorFlow, in order to handle web-scale data.

By combining and modeling the output of the labeling functions using this procedure in Snorkel DryBell, we were able to generate high-quality training labels. In fact, on the two applications where hand-labeled training data was available for comparison, we achieved the same predictive accuracy training a model with Snorkel DryBell's labels as we did when training that same model with 12,000 and 80,000 hand-labeled training data points.

Transferring Non-Servable Knowledge to Servable Models
In many settings, there is also an important distinction between servable features—which can be used in production—and non-servable features, that are too slow or expensive to be used in production. These non-servable features may have very rich signal, but a general question is how to use them to train or otherwise help servable models that can be deployed in production?


In many settings, users write labeling functions that leverage organizational knowledge resources that are not servable in production (a)—e.g. aggregate statistics, internal models, or knowledge graphs that are too slow or expensive to use in production—in order to train models that are only defined over production-servable features (b), e.g. cheap, real-time web signals.
In Snorkel DryBell, we found that users could write the labeling functions—i.e. express their organizational knowledge—over one feature set that was not servable, and then use the resulting training labels output by Snorkel DryBell to train a model defined over a different, servable feature set. This cross-feature transfer boosted our performance by an average 52% on the benchmark datasets we created. More broadly, it represents a simple but powerful way to use resources that are too slow (e.g. expensive models or aggregate statistics), private (e.g. entity or knowledge graphs), or otherwise unsuitable for deployment, to train servable models over cheap, real-time features. This approach can be viewed as a new type of transfer learning, where instead of transferring a model between different datasets, we're transferring domain knowledge between different feature sets- an approach which has potential use cases not just in industry, but in medical settings and beyond.

Next Steps
Moving forward, we're excited to see what other types of organizational knowledge can be used as weak supervision, and how the approach used by Snorkel DryBell can enable new modes of information reuse and sharing across organizations. For more details, check out our paper, and for further technical details, blog posts, and tutorials, check out the open-source Snorkel implementation at snorkel.stanford.edu.

Acknowledgments
This research was done in collaboration between Google, Stanford, and Brown. We would like to thank all the people who were involved, including Stephen Bach (Brown), Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Souvik Sen, Braden Hancock (Stanford), Houman Alborzi, Rahul Kuchhal, Christopher Ré (Stanford), Rob Malkin.

Source: Google AI Blog


Investing in France’s AI Ecosystem



Recently, we announced the launch of a new AI research team in our Paris office. And today DeepMind has also announced a new AI research presence in Paris. We are excited about expanding Google’s research presence in Europe, which bolsters the efforts of the existing groups in our Zürich and London offices. As strong supporters of academic research, we are also excited to foster collaborations with France’s vibrant academic ecosystem.

Our research teams in Paris will focus on fundamental AI research, as well as important applications of these ideas to areas such as Health, Science or Arts. They will publish and open-source their results to advance the state-of-the-art in core areas such as Deep Learning and Reinforcement Learning.

Our approach to research is based on building a strong connection with the academic community; contributing to training the next generation of scientists and establishing a bridge between academic and industrial research. We believe that both objectives are key to fostering a healthy research ecosystem that will flourish in the long term. These ideas are very much aligned with some of the recommendations that Fields Medalist and member of French Parliament Cédric Villani is putting forward in his report on AI to the French government.

As we expand our teams in France, we have several initiatives that illustrate our commitment to these goals:
  • We are sponsoring “Artificial Intelligence and Visual Computing” Chair at École Polytechnique (one of the leading higher education institutions in France) which will support their education initiatives in AI
  • We just established a partnership with INRIA for conducting collaborative research projects
  • We are funding academic research with unrestricted grants mostly dedicated to the support of PhD and postdoc positions through our Faculty Research Awards and PhD Fellowship programs, as well as our Focused Research Awards. As one example, we have recently funded a project on large scale optimization of neural networks led by Francis Bach (INRIA and ENS) and Alexandre d’Aspremont (CNRS and ENS)
  • We are working on offering CIFRE PhD positions (joint PhD positions between Google and an academic lab) as well as internships for PhD students
Additionally, we are pleased to announce that one of the world’s leading experts in computer vision, Cordelia Schmid, will begin a dual appointment at INRIA and Google Paris. These kind of appointments, together with our Visiting Faculty program, are a great way to share ideas and research challenges, and utilize Google's world-class computing infrastructure to explore new projects at industrial scale.

France has a long tradition of research and educational excellence, and has a very dynamic and active machine learning community. This makes it a great place to pursue our goal of building AI-enabled technologies that can benefit everyone, through fundamental advances in machine learning and related fields.

Investing in France’s AI Ecosystem



Recently, we announced the launch of a new AI research team in our Paris office. And today DeepMind has also announced a new AI research presence in Paris. We are excited about expanding Google’s research presence in Europe, which bolsters the efforts of the existing groups in our Zürich and London offices. As strong supporters of academic research, we are also excited to foster collaborations with France’s vibrant academic ecosystem.

Our research teams in Paris will focus on fundamental AI research, as well as important applications of these ideas to areas such as Health, Science or Arts. They will publish and open-source their results to advance the state-of-the-art in core areas such as Deep Learning and Reinforcement Learning.

Our approach to research is based on building a strong connection with the academic community; contributing to training the next generation of scientists and establishing a bridge between academic and industrial research. We believe that both objectives are key to fostering a healthy research ecosystem that will flourish in the long term. These ideas are very much aligned with some of the recommendations that Fields Medalist and member of French Parliament Cédric Villani is putting forward in his report on AI to the French government.

As we expand our teams in France, we have several initiatives that illustrate our commitment to these goals:
  • We are sponsoring “Artificial Intelligence and Visual Computing” Chair at École Polytechnique (one of the leading higher education institutions in France) which will support their education initiatives in AI
  • We just established a partnership with INRIA for conducting collaborative research projects
  • We are funding academic research with unrestricted grants mostly dedicated to the support of PhD and postdoc positions through our Faculty Research Awards and PhD Fellowship programs, as well as our Focused Research Awards. As one example, we have recently funded a project on large scale optimization of neural networks led by Francis Bach (INRIA and ENS) and Alexandre d’Aspremont (CNRS and ENS)
  • We are working on offering CIFRE PhD positions (joint PhD positions between Google and an academic lab) as well as internships for PhD students
Additionally, we are pleased to announce that one of the world’s leading experts in computer vision, Cordelia Schmid, will begin a dual appointment at INRIA and Google Paris. These kind of appointments, together with our Visiting Faculty program, are a great way to share ideas and research challenges, and utilize Google's world-class computing infrastructure to explore new projects at industrial scale.

France has a long tradition of research and educational excellence, and has a very dynamic and active machine learning community. This makes it a great place to pursue our goal of building AI-enabled technologies that can benefit everyone, through fundamental advances in machine learning and related fields.

Source: Google AI Blog


Federated Learning: Collaborative Machine Learning without Centralized Training Data



Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we're introducing an additional approach: Federated Learning.

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.

It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.
Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated.
Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. And this approach has another immediate benefit: in addition to providing an update to the shared model, the improved model on your phone can also be used immediately, powering experiences personalized by the way you use your phone.

We're currently testing Federated Learning in Gboard on Android, the Google Keyboard. When Gboard shows a suggested query, your phone locally stores information about the current context and whether you clicked the suggestion. Federated Learning processes that history on-device to suggest improvements to the next iteration of Gboard’s query suggestion model.
To make Federated Learning possible, we had to overcome many algorithmic and technical challenges. In a typical machine learning system, an optimization algorithm like Stochastic Gradient Descent (SGD) runs on a large dataset partitioned homogeneously across servers in the cloud. Such highly iterative algorithms require low-latency, high-throughput connections to the training data. But in the Federated Learning setting, the data is distributed across millions of devices in a highly uneven fashion. In addition, these devices have significantly higher-latency, lower-throughput connections and are only intermittently available for training.

These bandwidth and latency limitations motivate our Federated Averaging algorithm, which can train deep networks using 10-100x less communication compared to a naively federated version of SGD. The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps. Since it takes fewer iterations of high-quality updates to produce a good model, training can use much less communication. As upload speeds are typically much slower than download speeds, we also developed a novel way to reduce upload communication costs up to another 100x by compressing updates using random rotations and quantization. While these approaches are focused on training deep networks, we've also designed algorithms for high-dimensional sparse convex models which excel on problems like click-through-rate prediction.

Deploying this technology to millions of heterogenous phones running Gboard requires a sophisticated technology stack. On device training uses a miniature version of TensorFlow. Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance.
Your phone participates in Federated Learning only
when it won't negatively impact your experience.
The system then needs to communicate and aggregate the model updates in a secure, efficient, scalable, and fault-tolerant way. It's only the combination of research with this infrastructure that makes the benefits of Federated Learning possible.

Federated learning works without the need to store user data in the cloud, but we're not stopping there. We've developed a Secure Aggregation protocol that uses cryptographic techniques so a coordinating server can only decrypt the average update if 100s or 1000s of users have participated — no individual phone's update can be inspected before averaging. It's the first protocol of its kind that is practical for deep-network-sized problems and real-world connectivity constraints. We designed Federated Averaging so the coordinating server only needs the average update, which allows Secure Aggregation to be used; however the protocol is general and can be applied to other problems as well. We're working hard on a production implementation of this protocol and expect to deploy it for Federated Learning applications in the near future.

Our work has only scratched the surface of what is possible. Federated Learning can't solve all machine learning problems (for example, learning to recognize different dog breeds by training on carefully labeled examples), and for many other models the necessary training data is already stored in the cloud (like training spam filters for Gmail). So Google will continue to advance the state-of-the-art for cloud-based ML, but we are also committed to ongoing research to expand the range of problems we can solve with Federated Learning. Beyond Gboard query suggestions, for example, we hope to improve the language models that power your keyboard based on what you actually type on your phone (which can have a style all its own) and photo rankings based on what kinds of photos people look at, share, or delete.

Applying Federated Learning requires machine learning practitioners to adopt new tools and a new way of thinking: model development, training, and evaluation with no direct access to or labeling of raw data, with communication cost as a limiting factor. We believe the user benefits of Federated Learning make tackling the technical challenges worthwhile, and are publishing our work with hopes of a widespread conversation within the machine learning community.

Acknowledgements
This post reflects the work of many people in Google Research, including Blaise Agüera y Arcas, Galen Andrew, Dave Bacon, Keith Bonawitz, Chris Brumme, Arlie Davis, Jac de Haan, Hubert Eichner, Wolfgang Grieskamp, Wei Huang, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Nicholas Kong, Ben Kreuter, Alison Lentz, Stefano Mazzocchi, Sarvar Patel, Martin Pelikan, Aaron Segal, Karn Seth, Ananda Theertha Suresh, Iulia Turc, Felix Yu, and our partners in the Gboard team.