Tag Archives: ML

Croissant: a metadata format for ML-ready datasets

Posted by Omar Benjelloun, Software Engineer, Google Research, and Peter Mattson, Software Engineer, Google Core ML and President, MLCommons Association

Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.

ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.

There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.

Today, we're introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn't change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.

Croissant

This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.

Why a shared format for ML data?

The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.

The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.

Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.

What can Croissant do today?

The Croissant ecosystem: Users can Search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets at:

Google Dataset Search, which offers a Croissant filter.
HuggingFace
Kaggle
OpenML

With a Croissant dataset, it is possible to:

Ingest data easily via TensorFlow Datasets for use in popular ML frameworks like TensorFlow, PyTorch, and JAX.
Inspect and modify the metadata using the Croissant editor UI (github).

To publish a Croissant dataset, users can:

Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Future direction

We are excited about Croissant's potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.

We encourage the community to join us in contributing to the effort.

Acknowledgements

Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.

Source: Google AI Blog

Can large language models identify and correct their mistakes?

Posted by Gladys Tyen, Intern, Google Research

LLMs are increasingly popular for reasoning tasks, such as multi-turn QA, task completion, code generation, or mathematics. Yet much like people, they do not always solve problems correctly on the first try, especially on tasks for which they were not trained. Therefore, for such systems to be most useful, they should be able to 1) identify where their reasoning went wrong and 2) backtrack to find another solution.

This has led to a surge in methods related to self-correction, where an LLM is used to identify problems in its own output, and then produce improved results based on the feedback. Self-correction is generally thought of as a single process, but we decided to break it down into two components, mistake finding and output correction.

In “LLMs cannot find reasoning errors, but can correct them!”, we test state-of-the-art LLMs on mistake finding and output correction separately. We present BIG-Bench Mistake, an evaluation benchmark dataset for mistake identification, which we use to address the following questions:

Can LLMs find logical mistakes in Chain-of-Thought (CoT) style reasoning?
Can mistake-finding be used as a proxy for correctness?
Knowing where the mistake is, can LLMs then be prompted to backtrack and arrive at the correct answer?
Can mistake finding as a skill generalize to tasks the LLMs have never seen?

About our dataset

Mistake finding is an underexplored problem in natural language processing, with a particular lack of evaluation tasks in this domain. To best assess the ability of LLMs to find mistakes, evaluation tasks should exhibit mistakes that are non-ambiguous. To our knowledge, most current mistake-finding datasets do not go beyond the realm of mathematics for this reason.

To assess the ability of LLMs to reason about mistakes outside of the math domain, we produce a new dataset for use by the research community, called BIG-Bench Mistake. This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench. Each trace is annotated with the location of the first logical mistake.

To maximize the number of mistakes in our dataset, we sample 255 traces where the answer is incorrect (so we know there is definitely a mistake), and 45 traces where the answer is correct (so there may or may not be a mistake). We then ask human labelers to go through each trace and identify the first mistake step. Each trace has been annotated by at least three labelers, whose answers had inter-rater reliability levels of >0.98 (using Krippendorff’s α). The labeling was done for all tasks except the Dyck Languages task, which involves predicting the sequence of closing parentheses for a given input sequence. This task we labeled algorithmically.

The logical errors made in this dataset are simple and unambiguous, providing a good benchmark for testing an LLM’s ability to find its own mistakes before using them on harder, more ambiguous tasks.

Core questions about mistake identification

1. Can LLMs find logical mistakes in Chain-of-Thought style reasoning?

First, we want to find out if LLMs can identify mistakes independently of their ability to correct them. We attempt multiple prompting methods to test GPT series models for their ability to locate mistakes (prompts here) under the assumption that they are generally representative of modern LLM performance.

Generally, we found these state-of-the-art models perform poorly, with the best model achieving 52.9% accuracy overall. Hence, there is a need to improve LLMs’ ability in this area of reasoning.

In our experiments, we try three different prompting methods: direct (trace), direct (step) and CoT (step). In direct (trace), we provide the LLM with the trace and ask for the location step of the mistake or no mistake. In direct (step), we prompt the LLM to ask itself this question for each step it takes. In CoT (step), we prompt the LLM to give its reasoning for whether each step is a mistake or not a mistake.

A diagram showing the three prompting methods direct (trace), direct (step) and CoT (step).

Our finding is in line and builds upon prior results, but goes further in showing that LLMs struggle with even simple and unambiguous mistakes (for comparison, our human raters without prior expertise solve the problem with a high degree of agreement). We hypothesize that this is a big reason why LLMs are unable to self-correct reasoning errors. See the paper for the full results.

2. Can mistake-finding be used as a proxy for correctness of the answer?

When people are confronted with a problem where we are unsure of the answer, we can work through our solutions step-by-step. If no error is found, we can make the assumption that we did the right thing.

While we hypothesized that this would work similarly for LLMs, we discovered that this is a poor strategy. On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78.

A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.

3. Can LLMs backtrack knowing where the error is?

Since we’ve shown that LLMs exhibit poor performance in finding reasoning errors in CoT traces, we want to know whether LLMs can even correct errors at all, even if they know where the error is.

Note that knowing the mistake location is different from knowing the right answer: CoT traces can contain logical mistakes even if the final answer is correct, or vice versa. In most real-world situations, we won’t know what the right answer is, but we might be able to identify logical errors in intermediate steps.

We propose the following backtracking method:

Generate CoT traces as usual, at temperature = 0. (Temperature is a parameter that controls the randomness of generated responses, with higher values producing more diverse and creative outputs, usually at the expense of quality.)
Identify the location of the first logical mistake (for example with a classifier, or here we just use labels from our dataset).
Re-generate the mistake step at temperature = 1 and produce a set of eight outputs. Since the original output is known to lead to incorrect results, the goal is to find an alternative generation at this step that is significantly different from the original.
From these eight outputs, select one that is different from the original mistake step. (We just use exact matching here, but in the future this can be something more sophisticated.)
Using the new step, generate the rest of the trace as normal at temperature = 0.

It’s a very simple method that does not require any additional prompt crafting and avoids having to re-generate the entire trace. We test it using the mistake location data from BIG-Bench Mistake, and we find that it can correct CoT errors.

Recent work showed that self-correction methods, like Reflexion and RCI, cause deterioration in accuracy scores because there are more correct answers becoming incorrect than vice versa. Our method, on the other hand, produces more gains (by correcting wrong answers) than losses (by changing right answers to wrong answers).

We also compare our method with a random baseline, where we randomly assume a step to be a mistake. Our results show that this random baseline does produce some gains, but not as much as backtracking with the correct mistake location, and with more losses.

A diagram showing the gains and losses in accuracy for our method as well as a random baseline on each dataset.

4. Can mistake finding generalize to tasks the LLMs have never seen?

To answer this question, we fine-tuned a small model on four of the BIG-Bench tasks and tested it on the fifth, held-out task. We do this for every task, producing five fine-tuned models in total. Then we compare the results with just zero-shot prompting PaLM 2-L-Unicorn, a much larger model.

Bar chart showing the accuracy improvement of the fine-tuned small model compared to zero-shot prompting with PaLM 2-L-Unicorn.

Our results show that the much smaller fine-tuned reward model generally performs better than zero-shot prompting a large model, even though the reward model has never seen data from the task in the test set. The only exception is logical deduction, where it performs on par with zero-shot prompting.

This is a very promising result as we can potentially just use a small fine-tuned reward model to perform backtracking and improve accuracy on any task, even if we don’t have the data for it. This smaller reward model is completely independent of the generator LLM, and can be updated and further fine-tuned for individual use cases.

An illustration showing how our backtracking method works.

Conclusion

In this work, we created an evaluation benchmark dataset that the wider academic community can use to evaluate future LLMs. We further showed that LLMs currently struggle to find logical errors. However, if they could, we show the effectiveness of backtracking as a strategy that can provide gains on tasks. Finally, a smaller reward model can be trained on general mistake-finding tasks and be used to improve out-of-domain mistake finding, showing that mistake-finding can generalize.

Acknowledgements

Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing ideas and helping with the experiments and data collection. We would also like to thank Sian Gooding and Vicky Zayats for their comments and suggestions on the paper.

Source: Google AI Blog

Navigating AI Safety & Compliance: A guide for CTOs

Posted by Fergus Hurley – Co-Founder & GM, Checks, and Pedro Rodriguez – Head of Engineering, Checks

The rapid advances in generative artificial intelligence (GenAI) have brought about transformative opportunities across many industries. However, these advances have raised concerns about risks, such as privacy, misuse, bias, and unfairness. Responsible development and deployment is, therefore, a must.

AI applications are becoming more sophisticated, and developers are integrating them into critical systems. Therefore, the onus is on technology leaders, particularly CTOs and Heads of Engineering and AI – those responsible for leading the adoption of AI across their products and stacks – to ensure they use AI safely, ethically, and in compliance with relevant policies, regulations, and laws.

While comprehensive AI safety regulations are nascent, CTOs cannot wait for regulatory mandates before they act. Instead, they must adopt a forward-thinking approach to AI governance, incorporating safety and compliance considerations into the entire product development cycle.

This article is the first in a series to explore these challenges. To start, this article presents four key proposals for integrating AI safety and compliance practices into the product development lifecycle:

1. Establish a robust AI governance framework

Formulate a comprehensive AI governance framework that clearly defines the organization’s principles, policies, and procedures for developing, deploying, and operating AI systems. This framework should establish clear roles, responsibilities, accountability mechanisms, and risk assessment protocols.

Examples of emerging frameworks include the US National Institute of Standards and Technologies’ AI Risk Management Framework, the OSTP Blueprint for an AI Bill of Rights, the EU AI Act, as well as Google’s Secure AI Framework (SAIF).

As your organization adopts an AI governance framework, it is crucial to consider the implications of relying on third-party foundation models. These considerations include the data from your app that the foundation model uses and your obligations based on the foundation model provider's terms of service.

2. Embed AI safety principles into the design phase

Incorporate AI safety principles, such as Google’s responsible AI principles, into the design process from the outset.

AI safety principles involve identifying and mitigating potential risks and challenges early in the development cycle. For example, mitigate bias in training or model inferences and ensure explainability of models behavior. Use techniques such as adversarial training – red teaming testing of LLMs using prompts that look for unsafe outputs – to help ensure that AI models operate in a fair, unbiased, and robust manner.

3. Implement continuous monitoring and auditing

Track the performance and behavior of AI systems in real time with continuous monitoring and auditing. The goal is to identify and address potential safety issues or anomalies before they escalate into larger problems.

Look for key metrics like model accuracy, fairness, and explainability, and establish a baseline for your app and its monitoring. Beyond traditional metrics, look for unexpected changes in user behavior and AI model drift using a tool such as Vertex AI Model Monitoring. Do this using data logging, anomaly detection, and human-in-the-loop mechanisms to ensure ongoing oversight.

4. Foster a culture of transparency and explainability

Drive AI decision-making through a culture of transparency and explainability. Encourage this culture by defining clear documentation guidelines, metrics, and roles so that all the team members developing AI systems participate in the design, training, deployment, and operations.

Also, provide clear and accessible explanations to cross-functional stakeholders about how AI systems operate, their limitations, and the available rationale behind their decisions. This information fosters trust among users, regulators, and stakeholders.

Final word

As AI's role in core and critical systems grows, proper governance is essential for its success and that of the systems and organizations using AI. The four proposals in this article should be a good start in that direction.

However, this is a broad and complex domain, which is what this series of articles is about. So, look out for deeper dives into the tools, techniques, and processes you need to safely integrate AI into your development and the apps you create.

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

7 dos and don’ts of using ML on the web with MediaPipe

Posted by Jen Person, Developer Relations Engineer

If you're a web developer looking to bring the power of machine learning (ML) to your web apps, then check out MediaPipe Solutions! With MediaPipe Solutions, you can deploy custom tasks to solve common ML problems in just a few lines of code. View the guides in the docs and try out the web demos on Codepen to see how simple it is to get started. While MediaPipe Solutions handles a lot of the complexity of ML on the web, there are still a few things to keep in mind that go beyond the usual JavaScript best practices. I've compiled them here in this list of seven dos and don'ts. Do read on to get some good tips!

❌ DON'T bundle your model in your app

As a web developer, you're accustomed to making your apps as lightweight as possible to ensure the best user experience. When you have larger items to load, you already know that you want to download them in a thoughtful way that allows the user to interact with the content quickly rather than having to wait for a long download. Strategies like quantization have made ML models smaller and accessible to edge devices, but they're still large enough that you don't want to bundle them in your web app. Store your models in the cloud storage solution of your choice. Then, when you initialize your task, the model and WebAssembly binary will be downloaded and initialized. After the first page load, use local storage or IndexedDB to cache the model and binary so future page loads run even faster. You can see an example of this in this touchless ATM sample app on GitHub.

✅ DO initialize your task early

Task initialization can take a bit of time depending on model size, connection speed, and device type. Therefore, it's a good idea to initialize the solution before user interaction. In the majority of the code samples on Codepen, initialization takes place on page load. Keep in mind that these samples are meant to be as simple as possible so you can understand the code and apply it to your own use case. Initializing your model on page load might not make sense for you. Just focus on finding the right place to spin up the task so that processing is hidden from the user.

After initialization, you should warm up the task by passing a placeholder image through the model. This example shows a function for running a 1x1 pixel canvas through the Pose Landmarker task:

function dummyDetection(poseLandmarker: PoseLandmarker) {
const width = 1;
const height = 1;
const canvas = document.createElement('canvas');
canvas.width = width;
canvas.height = height;
const ctx = canvas.getContext('2d');
ctx.fillStyle = 'rgba(0, 0, 0, 1)';
ctx.fillRect(0, 0, width, height);
poseLandmarker.detect(canvas);
}

✅ DO clean up resources

One of my favorite parts of JavaScript is automatic garbage collection. In fact, I can't remember the last time memory management crossed my mind. Hopefully you've cached a little information about memory in your own memory, as you'll need just a bit of it to make the most of your MediaPipe task. MediaPipe Solutions for web uses WebAssembly (WASM) to run C++ code in-browser. You don't need to know C++, but it helps to know that C++ makes you take out your own garbage. If you don't free up unused memory, you will find that your web page uses more and more memory over time. It can have performance issues or even crash.

When you're done with your solution, free up resources using the .close() method.

For example, I can create a gesture recognizer using the following code:

const createGestureRecognizer = async () => {
const vision = await FilesetResolver.forVisionTasks(
"https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/wasm"
);
gestureRecognizer = await GestureRecognizer.createFromOptions(vision, {
baseOptions: {
modelAssetPath:
"https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
delegate: "GPU"
},
});
};
createGestureRecognizer();

Once I'm done recognizing gestures, I dispose of the gesture recognizer using the close() method:

gestureRecognizer.close();

Each task has a close method, so be sure to use it where relevant! Some tasks have close() methods for the returned results, so refer to the API docs for details.

✅ DO try out tasks in MediaPipe Studio

When deciding on or customizing your solution, it's a good idea to try it out in MediaPipe Studio before writing your own code. MediaPipe Studio is a web-based application for evaluating and customizing on-device ML models and pipelines for your applications. The app lets you quickly test MediaPipe solutions in your browser with your own data, and your own customized ML models. Each solution demo also lets you experiment with model settings for the total number of results, minimum confidence threshold for reporting results, and more. You'll find this especially useful when customizing solutions so you can see how your model performs without needing to create a test web page.

Screenshot of Image Classification page in MediaPipe Studio

✅ DO test on different devices

It's always important to test your web apps on various devices and browsers to ensure they work as expected, but I think it's worth adding a reminder here to test early and often on a variety of platforms. You can use MediaPipe Studio to test devices as well so you know right away that a solution will work on your users' devices.

❌ DON'T default to the biggest model

Each task lists one or more recommended models. For example, the Object Detection task lists three different models, each with benefits and drawbacks based on speed, size and accuracy. It can be tempting to think that the most important thing is to choose the model with the very highest accuracy, but if you do so, you will be sacrificing speed and increasing the size of your model. Depending on your use case, your users might benefit from a faster result rather than a more accurate one. The best way to compare model options is in MediaPipe Studio. I realize that this is starting to sound like an advertisement for MediaPipe Studio, but it really does come in handy here!

photo of a whale breeching against a background of clouds in a deep, vibrant blue sky

✅ DO reach out!

Do you have any dos or don'ts of ML on the web that you think I missed? Do you have questions about how to get started? Or do you have a cool project you want to share? Reach out to me on LinkedIn and tell me all about it!

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

WeatherBench 2: A benchmark for the next generation of data-driven weather models

Posted by Stephan Rasp, Research Scientist, and Carla Bromberg, Program Lead, Google Research

In 1950, weather forecasting started its digital revolution when researchers used the first programmable, general-purpose computer ENIAC to solve mathematical equations describing how weather evolves. In the more than 70 years since, continuous advancements in computing power and improvements to the model formulations have led to steady gains in weather forecast skill: a 7-day forecast today is about as accurate as a 5-day forecast in 2000 and a 3-day forecast in 1980. While improving forecast accuracy at the pace of approximately one day per decade may not seem like a big deal, every day improved is important in far reaching use cases, such as for logistics planning, disaster management, agriculture and energy production. This “quiet” revolution has been tremendously valuable to society, saving lives and providing economic value across many sectors.

Now we are seeing the start of yet another revolution in weather forecasting, this time fueled by advances in machine learning (ML). Rather than hard-coding approximations of the physical equations, the idea is to have algorithms learn how weather evolves from looking at large volumes of past weather data. Early attempts at doing so go back to 2018 but the pace picked up considerably in the last two years when several large ML models demonstrated weather forecasting skill comparable to the best physics-based models. Google’s MetNet [1, 2], for instance, demonstrated state-of-the-art capabilities for forecasting regional weather one day ahead. For global prediction, Google DeepMind created GraphCast, a graph neural network to make 10 day predictions at a horizontal resolution of 25 km, competitive with the best physics-based models in many skill metrics.

Apart from potentially providing more accurate forecasts, one key advantage of such ML methods is that, once trained, they can create forecasts in a matter of minutes on inexpensive hardware. In contrast, traditional weather forecasts require large super-computers that run for hours every day. Clearly, ML represents a tremendous opportunity for the weather forecasting community. This has also been recognized by leading weather forecasting centers, such as the European Centre for Medium-Range Weather Forecasts’ (ECMWF) machine learning roadmap or the National Oceanic and Atmospheric Administration’s (NOAA) artificial intelligence strategy.

To ensure that ML models are trusted and optimized for the right goal, forecast evaluation is crucial. Evaluating weather forecasts isn’t straightforward, however, because weather is an incredibly multi-faceted problem. Different end-users are interested in different properties of forecasts, for example, renewable energy producers care about wind speeds and solar radiation, while crisis response teams are concerned about the track of a potential cyclone or an impending heat wave. In other words, there is no single metric to determine what a “good” weather forecast is, and the evaluation has to reflect the multi-faceted nature of weather and its downstream applications. Furthermore, differences in the exact evaluation setup — e.g., which resolution and ground truth data is used — can make it difficult to compare models. Having a way to compare novel and established methods in a fair and reproducible manner is crucial to measure progress in the field.

To this end, we are announcing WeatherBench 2 (WB2), a benchmark for the next generation of data-driven, global weather models. WB2 is an update to the original benchmark published in 2020, which was based on initial, lower-resolution ML models. The goal of WB2 is to accelerate the progress of data-driven weather models by providing a trusted, reproducible framework for evaluating and comparing different methodologies. The official website contains scores from several state-of-the-art models (at the time of writing, these are Keisler (2022), an early graph neural network, Google DeepMind’s GraphCast and Huawei's Pangu-Weather, a transformer-based ML model). In addition, forecasts from ECMWF’s high-resolution and ensemble forecasting systems are included, which represent some of the best traditional weather forecasting models.

Making evaluation easier

The key component of WB2 is an open-source evaluation framework that allows users to evaluate their forecasts in the same manner as other baselines. Weather forecast data at high-resolutions can be quite large, making even evaluation a computational challenge. For this reason, we built our evaluation code on Apache Beam, which allows users to split computations into smaller chunks and evaluate them in a distributed fashion, for example using DataFlow on Google Cloud. The code comes with a quick-start guide to help people get up to speed.

Additionally, we provide most of the ground-truth and baseline data on Google Cloud Storage in cloud-optimized Zarr format at different resolutions, for example, a comprehensive copy of the ERA5 dataset used to train most ML models. This is part of a larger Google effort to provide analysis-ready, cloud-optimized weather and climate datasets to the research community and beyond. Since downloading these data from the respective archives and converting them can be time-consuming and compute-intensive, we hope that this should considerably lower the entry barrier for the community.

Assessing forecast skill

Together with our collaborators from ECMWF, we defined a set of headline scores that best capture the quality of global weather forecasts. As the figure below shows, several of the ML-based forecasts have lower errors than the state-of-the-art physical models on deterministic metrics. This holds for a range of variables and regions, and underlines the competitiveness and promise of ML-based approaches.

This scorecard shows the skill of different models compared to ECMWF’s Integrated Forecasting System (IFS), one of the best physics-based weather forecasts, for several variables. IFS forecasts are evaluated against IFS analysis. All other models are evaluated against ERA5. The order of ML models reflects publication date.

Toward reliable probabilistic forecasts

However, a single forecast often isn’t enough. Weather is inherently chaotic because of the butterfly effect. For this reason, operational weather centers now run ~50 slightly perturbed realizations of their model, called an ensemble, to estimate the forecast probability distribution across various scenarios. This is important, for example, if one wants to know the likelihood of extreme weather.

Creating reliable probabilistic forecasts will be one of the next key challenges for global ML models. Regional ML models, such as Google’s MetNet already estimate probabilities. To anticipate this next generation of global models, WB2 already provides probabilistic metrics and baselines, among them ECMWF’s IFS ensemble, to accelerate research in this direction.

As mentioned above, weather forecasting has many aspects, and while the headline metrics try to capture the most important aspects of forecast skill, they are by no means sufficient. One example is forecast realism. Currently, many ML forecast models tend to “hedge their bets” in the face of the intrinsic uncertainty of the atmosphere. In other words, they tend to predict smoothed out fields that give lower average error but do not represent a realistic, physically consistent state of the atmosphere. An example of this can be seen in the animation below. The two data-driven models, Pangu-Weather and GraphCast (bottom), predict the large-scale evolution of the atmosphere remarkably well. However, they also have less small-scale structure compared to the ground truth or the physical forecasting model IFS HRES (top). In WB2 we include a range of these case studies and also a spectral metric that quantifies such blurring.

Forecasts of a front passing through the continental United States initialized on January 3, 2020. Maps show temperature at a pressure level of 850 hPa (roughly equivalent to an altitude of 1.5km) and geopotential at a pressure level of 500 hPa (roughly 5.5 km) in contours. ERA5 is the corresponding ground-truth analysis, IFS HRES is ECMWF’s physics-based forecasting model.

Conclusion

WeatherBench 2 will continue to evolve alongside ML model development. The official website will be updated with the latest state-of-the-art models. (To submit a model, please follow these instructions). We also invite the community to provide feedback and suggestions for improvements through issues and pull requests on the WB2 GitHub page.

Designing evaluation well and targeting the right metrics is crucial in order to make sure ML weather models benefit society as quickly as possible. WeatherBench 2 as it is now is just the starting point. We plan to extend it in the future to address key issues for the future of ML-based weather forecasting. Specifically, we would like to add station observations and better precipitation datasets. Furthermore, we will explore the inclusion of nowcasting and subseasonal-to-seasonal predictions to the benchmark.

We hope that WeatherBench 2 can aid researchers and end-users as weather forecasting continues to evolve.

Acknowledgements

WeatherBench 2 is the result of collaboration across many different teams at Google and external collaborators at ECMWF. From ECMWF, we would like to thank Matthew Chantry, Zied Ben Bouallegue and Peter Dueben. From Google, we would like to thank the core contributors to the project: Stephan Rasp, Stephan Hoyer, Peter Battaglia, Alex Merose, Ian Langmore, Tyler Russell, Alvaro Sanchez, Antonio Lobato, Laurence Chiu, Rob Carver, Vivian Yang, Shreya Agrawal, Thomas Turnbull, Jason Hickey, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. We also would like to thank Kunal Shah, Rahul Mahrsee, Aniket Rawat, and Satish Kumar. Thanks to John Anderson for sponsoring WeatherBench 2. Furthermore, we would like to thank Kaifeng Bi from the Pangu-Weather team and Ryan Keisler for their help in adding their models to WeatherBench 2.

Source: Google AI Blog

Accelerate AI development for Digital Pathology using EZ WSI DICOMWeb Python library

Overview

Digital pathology is changing the way pathology is practiced by making it easier to share images, collaborate with colleagues, and develop new AI algorithms that can improve the quality and cost of medical care. One of the biggest challenges of digital pathology is storing and managing the large volume of data generated. The Google Cloud Healthcare API provides a solution for this with a managed DICOM store, which is a secure, scalable, and performant way to store digital pathology images in a manner that is both standardized and interoperable.

However, performing image retrieval of specific patches (i.e. regions of interest) of a whole slide image (WSI) from the managed DICOM store using DICOMweb can be complex and requires DICOM format expertise. To address this, we are open sourcing EZ WSI (Whole Slide Image) DICOMWeb, a Python library that makes fetching these patches both efficient and easy-to-use.

How EZ WSI DICOMWeb works

EZ WSI DICOMweb facilitates the retrieval of arbitrary and sequential patches of a DICOM WSI from a DICOMWeb compliant Google Cloud Healthcare API DICOM store. Unlike downloading the entire DICOM series WSI and extracting patches locally from that file, which can increase network traffic, latency and storage space usage, EZ WSI DICOMweb retrieves only the necessary tiles for the desired patch directly through the DICOMweb APIs. This is simpler to use and abstracts away the following:

The need to fetch many tiles, which requires an understanding of DICOM data structure (e.g. offset & data hierarchy).
The need for a detailed understanding of the DICOMWeb APIs, REST payloads, and authentication, as well as addressing the possibility of redundant requests if several patches are fetched and there are overlapping tiles.
The need to decode images on the server if client side decoding is not supported, which increases the time it takes to transfer data and the size of the data being transferred.

EZ WSI DICOMWeb allows researchers and developers to focus on their ML tasks rather than the intricacies of DICOM. Developers do not need to have an in-depth understanding of DICOM data structuring or the DICOM API. The library provides a simple and intuitive functionality that allows developers to efficiently fetch DICOM images using only the Google Cloud Platform (GCP) Resource Name and DICOM Series path without any pixel recompression.

Case Study: Generating Patches for AI Workflows

A typical pathology WSI could be on the order of 40,000 pixels in length or width. However, an AI model that is trained to assess that WSI may only analyze a patch that is 512 x 512 pixels at a time. The way the model can operate over the entire WSI is by using a sliding windows approach. We demonstrate how that can be done using EZ WSI DICOMWeb.

First, we create a DicomSlide object using the DICOMweb client and interface. This can be done with just a few lines of code.

dicom_web_client = dicom_web.DicomWebClientImpl()
dwi = dicom_web_interface.DicomWebInterface(dicom_web_client)
ds = dicom_slide.DicomSlide(
dwi=dwi,
path=gcp_resource_name+dicom_series_path,
enable_client_slide_frame_decompression = True
)
ds.get_image(desired_magnification) # e.g. '0.625X'

This DicomSlide represents the entire WSI, as illustrated below.

Image of a WSI at the magnitude of 0.625X rendered by matplotlib

The above image leverages EZ WSI’s DicomSlide module to fetch an entire WSI at the requested magnification of 0.625X and uses matplotlib to render it, see the sample code for more details.

By providing coordinates, DicomSlide’s get_patch() method allows us to manually extract just the two sections of tissue at supported magnification with coordinates as pictured below.

tissue_patch = ds.get_patch(
desired_magnification, x=x_origin, y=y_origin, width=patch_width, height=patch_ height
)

Left tissue sample and right tissue sample at 0.625X magnitude, rendered by matplotlib

We can effectively zoom in on patches programmatically by reducing the window size and increasing the magnification using the same get patch method from above.

image of three panels showing the same interesting patch at 0.625, 2.5X, and 40X magnitude, rendered by matplotlib

Our ultimate goal is to generate a set of patches that can be used in a downstream AI application from this WSI.

image showing patch generation at 10X with 0.625X mask, rendered by matplotlib

To do this, we call PatchGenerator. It works by sliding a window of a specified size with a specified stride size across the image, heuristically ignoring tissue-less regions at a specified magnification level.

patch_gen = patch_generator.PatchGenerator(
slide=ds,
stride_size=stride_size, # the number of pixels between patches 
patch_size=patch_size, # the length and width of the patch in pixels
magnification=patch_magnification, # magnification to generate patches at
max_luminance=0.8, # defaults to .8, heuristic to evaluate where tissue is.
tissue_mask_magnification=mask_magnification,
)

The result is a list of patches that can be used as input into a machine learning algorithm.

image showing patch generation at 40X with 0.625X mask, rendered by matplotlib

Conclusion

We have built this library to make it easy to directly interact with DICOM WSIs that are stored in Google's DICOMWeb compliant Healthcare API DICOM store and extract image patches for AI workflows. Our hope is that by making this available, we can help accelerate the development of cutting edge AI for digital pathology in Google Cloud and beyond.

Links: Github, GCP-DICOMWeb

By Google HealthAI and Google Cloud Healthcare teams

Source: Google Open Source Blog

Using reinforcement learning for dynamic planning in open-ended conversations

Posted by Deborah Cohen, Staff Research Scientist, and Craig Boutilier, Principal Scientist, Google Research

As virtual assistants become ubiquitous, users increasingly interact with them to learn about new topics or obtain recommendations and expect them to deliver capabilities beyond narrow dialogues of one or two turns. Dynamic planning, namely the capability to look ahead and replan based on the flow of the conversation, is an essential ingredient for the making of engaging conversations with the deeper, open-ended interactions that users expect.

While large language models (LLMs) are now beating state-of-the-art approaches in many natural language processing benchmarks, they are typically trained to output the next best response, rather than planning ahead, which is required for multi-turn interactions. However, in the past few years, reinforcement learning (RL) has delivered incredible results addressing specific problems that involve dynamic planning, such as winning games and protein folding.

Today, we are sharing our recent advances in dynamic planning for human-to-assistant conversations, in which we enable an assistant to plan a multi-turn conversation towards a goal and adapt that plan in real-time by adopting an RL-based approach. Here we look at how to improve long interactions by applying RL to compose answers based on information extracted from reputable sources, rather than relying on content generated by a language model. We expect that future versions of this work could combine LLMs and RL in multi-turn dialogues. The deployment of RL “in the wild” in a large-scale dialogue system proved a formidable challenge due to the modeling complexity, tremendously large state and action spaces, and significant subtlety in designing reward functions.

What is dynamic planning?

Many types of conversations, from gathering information to offering recommendations, require a flexible approach and the ability to modify the original plan for the conversation based on its flow. This ability to shift gears in the middle of a conversation is known as dynamic planning, as opposed to static planning, which refers to a more fixed approach. In the conversation below, for example, the goal is to engage the user by sharing interesting facts about cool animals. To begin, the assistant steers the conversation to sharks via a sound quiz. Given the user's lack of interest in sharks, the assistant then develops an updated plan and pivots the conversation to sea lions, lions, and then cheetahs.

The assistant dynamically modifies its original plan to talk about sharks and shares facts about other animals.

Dynamic composition

To cope with the challenge of conversational exploration, we separate the generation of assistant responses into two parts: 1) content generation, which extracts relevant information from reputable sources, and 2) flexible composition of such content into assistant responses. We refer to this two-part approach as dynamic composition. Unlike LLM methods, this approach gives the assistant the ability to fully control the source, correctness, and quality of the content that it may offer. At the same time, it can achieve flexibility via a learned dialogue manager that selects and combines the most appropriate content.

In an earlier paper, “Dynamic Composition for Conversational Domain Exploration”, we describe a novel approach which consists of: (1) a collection of content providers, which offer candidates from different sources, such as news snippets, knowledge graph facts, and questions; (2) a dialogue manager; and (3) a sentence fusion module. Each assistant response is incrementally constructed by the dialogue manager, which selects candidates proposed by the content providers. The selected sequence of utterances is then fused into a cohesive response.

Dynamic planning using RL

At the core of the assistant response composition loop is a dialogue manager trained using off-policy RL, namely an algorithm that evaluates and improves a policy that is different from the policy used by the agent (in our case, the latter is based on a supervised model). Applying RL to dialogue management presents several challenges, including a large state space (as the state represents the conversation state, which needs to account for the whole conversation history) and an effectively unbounded action space (that may include all existing words or sentences in natural language).

We address these challenges using a novel RL construction. First, we leverage powerful supervised models — specifically, recurrent neural networks (RNNs) and transformers — to provide a succinct and effective dialogue state representation. These state encoders are fed with the dialogue history, composed of a sequence of user and assistant turns, and output a representation of the dialogue state in the form of a latent vector.

Second, we use the fact that a relatively small set of reasonable candidate utterances or actions can be generated by content providers at each conversation turn, and limit the action space to these. Whereas the action space is typically fixed in RL settings, because all states share the same action space, ours is a non-standard space in which the candidate actions may differ with each state, since content providers generate different actions depending on the dialogue context. This puts us in the realm of stochastic action sets, a framework that formalizes cases where the set of actions available in each state is governed by an exogenous stochastic process, which we address using Stochastic Action Q-Learning, a variant of the Q-learning approach. Q-learning is a popular off-policy RL algorithm, which does not require a model of the environment to evaluate and improve the policy. We trained our model on a corpus of crowd-compute–rated conversations obtained using a supervised dialogue manager.

Given the current dialogue history and a new user query, content providers generate candidates from which the assistant selects one. This process runs in a loop, and at the end the selected utterances are fused into a cohesive response.

Reinforcement learning model evaluation

We compared our RL dialogue manager with a launched supervised transformer model in an experiment using Google Assistant, which conversed with users about animals. A conversation starts when a user triggers the experience by asking an animal-related query (e.g., “How does a lion sound?”). The experiment was conducted using an A/B testing protocol, in which a small percentage of Assistant users were randomly sampled to interact with our RL-based assistant while other users interacted with the standard assistant.

We found that the RL dialogue manager conducts longer, more engaging conversations. It increases conversation length by 30% while improving user engagement metrics. We see an increase of 8% in cooperative responses to the assistant’s questions — e.g., “Tell me about lions,” in response to “Which animal do you want to hear about next?” Although there is also a large increase in nominally “non-cooperative” responses (e.g., “No,” as a reply to a question proposing additional content, such as “Do you want to hear more?”), this is expected as the RL agent takes more risks by asking pivoting questions. While a user may not be interested in the conversational direction proposed by the assistant (e.g., pivoting to another animal), the user will often continue to engage in a dialogue about animals.

From the non-cooperative user response in the 3rd turn (“No.”) and the query “Make a dog sound,” in the 5th turn, the assistant recognizes that the user is mostly interested in animal sounds and modifies its plan, providing sounds and sound quizzes.

In addition, some user queries contain explicit positive (e.g., “Thank you, Google,” or “I’m happy.”) or negative (e.g., “Shut up,” or “Stop.”) feedback. While an order of magnitude fewer than other queries, they offer a direct measure of user (dis)satisfaction. The RL model increases explicit positive feedback by 32% and reduces negative feedback by 18%.

Learned dynamic planning characteristics and strategies

We observe several characteristics of the (unseen) RL plan to improve user engagement while conducting longer conversations. First, the RL-based assistant ends 20% more turns in questions, prompting the user to choose additional content. It also better harnesses content diversity, including facts, sounds, quizzes, yes/no questions, open questions, etc. On average, the RL assistant uses 26% more distinct content providers per conversation than the supervised model.

Two observed RL planning strategies are related to the existence of sub-dialogues with different characteristics. Sub-dialogues about animal sounds are poorer in content and exhibit entity pivoting at every turn (i.e., after playing the sound of a given animal, we can either suggest the sound of a different animal or quiz the user about other animal sounds). In contrast, sub-dialogues involving animal facts typically contain richer content and have greater conversation depth. We observe that RL favors the richer experience of the latter, selecting 31% more fact-related content. Lastly, when restricting analysis to fact-related dialogues, the RL assistant exhibits 60% more focus-pivoting turns, that is, conversational turns that change the focus of the dialogue.

Below, we show two example conversations, one conducted by the supervised model (left) and the second by the RL model (right), in which the first three user turns are identical. With a supervised dialogue manager, after the user declined to hear about “today’s animal”, the assistant pivots back to animal sounds to maximize the immediate user satisfaction. While the conversation conducted by the RL model begins identically, it exhibits a different planning strategy to optimize the overall user engagement, introducing more diverse content, such as fun facts.

In the left conversation, conducted by the supervised model, the assistant maximizes the immediate user satisfaction. The right conversation, conducted by the RL model, shows different planning strategies to optimize the overall user engagement.

Future research and challenges

In the past few years, LLMs trained for language understanding and generation have demonstrated impressive results across multiple tasks, including dialogue. We are now exploring the use of an RL framework to empower LLMs with the capability of dynamic planning so that they can dynamically plan ahead and delight users with a more engaging experience.

Acknowledgements

The work described is co-authored by: Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor and Gal Elidan. We would like to thank: Roee Aharoni, Moran Ambar, John Anderson, Ido Cohn, Mohammad Ghavamzadeh, Lotem Golany, Ziv Hodak, Adva Levin, Fernando Pereira, Shimi Salant, Shachar Shimoni, Ronit Slyper, Ariel Stolovich, Hagai Taitelbaum, Noam Velan, Avital Zipori and the CrowdCompute team led by Ashwin Kakarla. We thank Sophie Allweis for her feedback on this blogpost and Tom Small for the visualization.

Source: Google AI Blog

Get ready for Google I/O

Posted by Timothy Jordan, Director, Developer Relations & Open Source

I/O is just a few days away and we couldn’t be more excited to share the latest updates across Google’s developer products, solutions, and technologies. From keynotes to technical sessions and hands-on workshops, these announcements aim to help you build smarter and ship faster.

Here are some helpful tips to maximize your experience online.

Start building your personal I/O agenda

Starting now, you can save the Google and developer keynotes to your calendar and explore the program to preview content. Here are just a few noteworthy examples of what you’ll find this year:

What's new in Android
Get the latest news in Android development: Android 14, form factors, Jetpack + Compose libraries, Android Studio, and performance.

What’s new in Web
Explore new features and APIs that became stable across browsers on the Web Platform this year.

What’s new in Generative AI
Discover a new suite of tools that make it easy for developers to leverage and build on top of Google's large language models.

What’s new in Google Cloud
Learn how Google Cloud and generative AI will help you develop faster and more efficiently.

For the best experience, create or connect a developer profile and start saving content to My I/O to build your personal agenda. With over 200 sessions and other learning material, there’s a lot to cover, so we hope this will help you get organized.

This year we’ve introduced development focus filters to help you navigate content faster across mobile, web, AI, and cloud technologies. You can also peruse content by topic, type, or experience level so you can find what you’re interested in, faster.

Connect with the community

After the keynotes, you can talk to Google experts and other developers online in I/O Adventure chat. Here you can ask questions about new releases and learn best practices from the global developer community.

If you’re craving community now, visit the Community page to meet people with similar interests in your area or find a watch party to attend.

We hope these updates are useful, and we can’t wait to connect online in May!

Source: Google Developers Blog

Natural Language Assessment: A New Framework to Promote Education

Posted by Kedem Snir, Software Engineer, and Gal Elidan, Senior Staff Research Scientist, Google Research

Whether it's a professional honing their skills or a child learning to read, coaches and educators play a key role in assessing the learner's answer to a question in a given context and guiding them towards a goal. These interactions have unique characteristics that set them apart from other forms of dialogue, yet are not available when learners practice alone at home. In the field of natural language processing, this type of capability has not received much attention and is technologically challenging. We set out to explore how we can use machine learning to assess answers in a way that facilitates learning.

In this blog, we introduce an important natural language understanding (NLU) capability called Natural Language Assessment (NLA), and discuss how it can be helpful in the context of education. While typical NLU tasks focus on the user's intent, NLA allows for the assessment of an answer from multiple perspectives. In situations where a user wants to know how good their answer is, NLA can offer an analysis of how close the answer is to what is expected. In situations where there may not be a “correct” answer, NLA can offer subtle insights that include topicality, relevance, verbosity, and beyond. We formulate the scope of NLA, present a practical model for carrying out topicality NLA, and showcase how NLA has been used to help job seekers practice answering interview questions with Google's new interview prep tool, Interview Warmup.

Overview of Natural Language Assessment (NLA)

The goal of NLA is to evaluate the user's answer against a set of expectations. Consider the following components for an NLA system interacting with students:

A question presented to the student
Expectations that define what we expect to find in the answer (e.g., a concrete textual answer, a set of topics we expect the answer to cover, conciseness)
An answer provided by the student
An assessment output (e.g., correctness, missing information, too specific or general, stylistic feedback, pronunciation, etc.)
[Optional] A context (e.g., a chapter in a book or an article)

With NLA, both the expectations about the answer and the assessment of the answer can be very broad. This enables teacher-student interactions that are more expressive and subtle. Here are two examples:

A question with a concrete correct answer: Even in situations where there is a clear correct answer, it can be helpful to assess the answer more subtly than simply correct or incorrect. Consider the following:
Context: Harry Potter and the Philosopher's Stone
Question: “What is Hogwarts?”
Expectation: “Hogwarts is a school of Witchcraft and Wizardry” [expectation is given as text]
Answer: “I am not exactly sure, but I think it is a school.”
The answer may be missing salient details but labeling it as incorrect wouldn’t be entirely true or useful to a user. NLA can offer a more subtle understanding by, for example, identifying that the student’s answer is too general, and also that the student is uncertain.

Illustration of the NLA process from input question, answer and expectation to assessment output

This kind of subtle assessment, along with noting the uncertainty the student expressed, can be important in helping students build skills in conversational settings.
Topicality expectations: There are many situations in which a concrete answer is not expected. For example, if a student is asked an opinion question, there is no concrete textual expectation. Instead, there's an expectation of relevance and opinionation, and perhaps some level of succinctness and fluency. Consider the following interview practice setup:
Question: “Tell me a little about yourself?”
Expectations: { “Education”, “Experience”, “Interests” } (a set of topics)
Answer: “Let’s see. I grew up in the Salinas valley in California and went to Stanford where I majored in economics but then got excited about technology so next I ….”
In this case, a useful assessment output would map the user’s answer to a subset of the topics covered, possibly along with a markup of which parts of the text relate to which topic. This can be challenging from an NLP perspective as answers can be long, topics can be mixed, and each topic on its own can be multi-faceted.

A Topicality NLA Model

In principle, topicality NLA is a standard multi-class task for which one can readily train a classifier using standard techniques. However, training data for such scenarios is scarce and it would be costly and time consuming to collect for each question and topic. Our solution is to break each topic into granular components that can be identified using large language models (LLMs) with a straightforward generic tuning.

We map each topic to a list of underlying questions and define that if the sentence contains an answer to one of those underlying questions, then it covers that topic. For the topic “Experience” we might choose underlying questions such as:

Where did you work?
What did you study?
…

While for the topic “Interests” we might choose underlying questions such as:

What are you interested in?
What do you enjoy doing?
…

These underlying questions are designed through an iterative manual process. Importantly, since these questions are sufficiently granular, current language models (see details below) can capture their semantics. This allows us to offer a zero-shot setting for the NLA topicality task: once trained (more on the model below), it is easy to add new questions and new topics, or adapt existing topics by modifying their underlying content expectation without the need to collect topic specific data. See below the model’s predictions for the sentence “I’ve worked in retail for 3 years” for the two topics described above:

A diagram of how the model uses underlying questions to predict the topic most likely to be covered by the user’s answer.

Since an underlying question for the topic “Experience” was matched, the sentence would be classified as “Experience”.

Application: Helping Job Seekers Prepare for Interviews

Interview Warmup is a new tool developed in collaboration with job seekers to help them prepare for interviews in fast-growing fields of employment such as IT Support and UX Design. It allows job seekers to practice answering questions selected by industry experts and to become more confident and comfortable with interviewing. As we worked with job seekers to understand their challenges in preparing for interviews and how an interview practice tool could be most useful, it inspired our research and the application of topicality NLA.

We build the topicality NLA model (once for all questions and topics) as follows: we train an encoder-only T5 model (EncT5 architecture) with 350 million parameters on Question-Answers data to predict the compatibility of an <underlying question, answer> pair. We rely on data from SQuAD 2.0 which was processed to produce <question, answer, label> triplets.

In the Interview Warmup tool, users can switch between talking points to see which ones were detected in their answer.

The tool does not grade or judge answers. Instead it enables users to practice and identify ways to improve on their own. After a user replies to an interview question, their answer is parsed sentence-by-sentence with the Topicality NLA model. They can then switch between different talking points to see which ones were detected in their answer. We know that there are many potential pitfalls in signaling to a user that their response is “good”, especially as we only detect a limited set of topics. Instead, we keep the control in the user’s hands and only use ML to help users make their own discoveries about how to improve.

So far, the tool has had great results helping job seekers around the world, including in the US, and we have recently expanded it to Africa. We plan to continue working with job seekers to iterate and make the tool even more helpful to the millions of people searching for new jobs.

A short film showing how Interview Warmup and its NLA capabilities were developed in collaboration with job seekers.

Conclusion

Natural Language Assessment (NLA) is a technologically challenging and interesting research area. It paves the way for new conversational applications that promote learning by enabling the nuanced assessment and analysis of answers from multiple perspectives. Working together with communities, from job seekers and businesses to classroom teachers and students, we can identify situations where NLA has the potential to help people learn, engage, and develop skills across an array of subjects, and we can build applications in a responsible way that empower users to assess their own abilities and discover ways to improve.

Acknowledgements

This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Google Research Israel, Google Creative Lab, and Grow with Google teams among others.

Source: Google AI Blog

Stepping up as a Machine Learning Developer —My Experience With the Google Machine Learning Bootcamp

Posted by Hyunkil Kim, Software Quality Engineer at Line Corp.

This article is written by Hyunkil Kim who participated in the Machine Learning Bootcamp which is a machine learning training program conducted in Korea to nurture next-generation ML engineers and help them to find jobs.

banner image with text that reads google developers machine learning bootcamp

As a developer, I had developed a certain level of curiosity about machine learning. I had also heard that many former developers were switching their specialization over to machine learning. Thus, I signed up for the <Google Machine Learning Bootcamp>, thinking it would be a good chance to get my feet wet.

I was a bit nervous and excited at the same time after getting the acceptance notification. Wondering if I should go over my Python skills one more time in preparation, I installed the newest version of TensorFlow on my machine. I also skimmed through documents on the basics of machine learning. Those were all unnecessary. To put it bluntly, I had to relearn everything from scratch over the course of the bootcamp. It was quite challenging to be introduced to new concepts I wasn't familiar with, such as functional API and the concept of functional programming in general, various visualization libraries, and data processing frameworks and services that were new to me. I worked very hard with the mindset of starting fresh.

Journey to Becoming a Machine Learning Engineer

There were three main objectives for the participants: completing the Deep Learning Specialization on Coursera which is based on TensorFlow, acquiring ML certifications(TensorFlow certificate or Google Cloud ML(or Data Science) Engineer certification), and participating in Kaggle competitions. Google Developers team provided the course fee for Coursera and the certification fee and offered many benefits to those who completed the course. You could really make it worth your while as long as you took the initiative and applied your passion.

<Coursera Deep Learning Specialization>

The Coursera class is based on TensorFlow 2.x and requires watching a set amount of instructor Andrew Ng's lectures on AI every week with screenshots and proof. It was pretty tough at first as the lectures were not in Korean. However, because the class was so famous, I was able to find posts on the internet that broke down the lectures and made them easier to understand. The class also provided reference links, so you could study more on your own once you got used to the class.

While this is not really related to the Coursera class, I also participated in online coding meetups by the bootcamp participants in-between classes as in the picture below, and it was a memorable experience. These are basically sessions held in coffee shops or study rooms where people got together and worked individually on their own coding projects in normal times. Because of the pandemic, we could not meet in person obviously and used Google Meet or Gather town and left our cameras on as we coded. It felt like I was studying with other people, and I liked the solidarity of relating to others.

animated image of cartoon figures in a dining room

<Machine Learning Certifications>

You were required to acquire at least one certification during the bootcamp. I chose to work on the GCP ML Engineer certification. As I used Google Cloud, I had wondered how ML services could be used on cloud. Coursera happened to have a specialization program for the GCP ML certification, so I took it, too. However, in the end, Google's website offering GCP AI operations and use cases helped me more with the certification than the course on Coursera.

Image of Google Cloud certification awarded to Hyunkil Kim

<Kaggle Competition>

I didn't get to spend as much time on Kaggle. I didn't see any current projects that interested me, so I tried the TPS to review what I had learned so far. TPS stands for Tabular Playground Series, which is a beginner-to-intermediate level competition for new-ish Kagglers that are just getting the hang of it. You're required to predict the value of the target from the provided tabular data. It is slightly more difficult than Titanic Survival Predictions, which is a beginner competition. I chose this competition because I figured it would be a good practice of things I had learned so far, like data analysis, feature engineering, and hyperparameter tuning.

This was the part where I personally felt like I could have done better. I had many ideas for improving the model or enhancing the performance, but it took way more time to apply and experiment with them than I had expected. If I had known that model learning would take this much time, I would have started working on Coursera, the certification, and the Kaggle competition all at once from the beginning. Maybe I was too nervous about entering a Kaggle competition and put it off until the end. I should have just tried without getting so nervous. I hesitated too long and ended up regretting it a little too late.

<Tech Talk and Career Talk>

The bootcamp also included many other activities, including a weekly Tech Talk on specific themes and recruiting sessions of potential employers. Companies looking for ML talents were invited and had a chance to introduce themselves, explain the available positions, and take questions about joining their workforce. Some companies sent their current Machine Learning engineers to explain how they solved business problems with which models or what kind of data. Some companies focused more on describing the type of people they were looking for in detail. I didn't know at the time, but I heard that some of the speakers were big names in the industry. Personally, I found these talks very helpful in terms of both finding employment and familiarizing myself with the trends in the industry. The sessions were very inspiring as new ideas kept flowing as I heard about applications of technologies I only knew in theory or thought about what kind of investments in AI would be promising.

Besides the Tech Talks, there were also more relaxed sessions for things like career consultation and resume/CV reviews. There were even sessions by the Googlers, where they personally answered participants' questions and offered some advice. As I attended various sessions, I noticed that the bootcamp crew and many Tech Talk speakers from hiring companies offered authentic and valuable advice and were very eager to help out the bootcamp participants. Nobody talked about the cold reality of the world out there. Knowing how rare it is to find mentors that offer genuinely constructive feedback and guidance, I personally was very touched and grateful about that.

Concluding the Machine Learning Bootcamp.

The Google Machine Learning Bootcamp captured the essence of what it would be like to work for Google. I felt like they expected you to take your own initiative to do what you wanted. They showed that they were willing and able to help you grow as much as possible as long as you did your best. For example, one of the world's most famous programmers Jeff Dean was at the kickoff session, and there was even an AMA session with Laurence Moroney, who had developed the training course for TensorFlow. They also allowed maximum freedom about finding teammates for the Kaggle competitions so that you didn't have to worry about having to carry your team. Things covered in the Tech Talks or recruitment sessions were not included in assignments. They let the participants do their thing freely while promising the best support possible in the industry if needed. I could see how some people would find it too lax that Google lets you study on your own at your own pace.

Image of video conference call with Andrew Ng, Jeff Dean, and Laurence Moroney

I think this was a rare chance to meet people from various backgrounds with the common goal of becoming machine learning engineers or developers. It was a unique experience where I got to talk and study with good people and even do something strange like the online coding meetup. There were also times when I was vainly taking pride in what little knowledge I had, but I ended up putting a lot of work into the bootcamp, wanting to make the most of it and to come ahead of others.

In the end, the take-home message is to "try anything."

Personally, I was very happy with the experience. I got to be a little more comfortable with machine learning. As a result, I'm able to pay more attention to details related to machine learning at my new job. The challenge of facing something new is a constant of a developer's life. Still, participating in this bootcamp felt especially meaningful to me, and I enjoyed it thoroughly.

While the bootcamp is over, I heard that some participants are still continuing with their study groups or projects. Wanting to study as a group myself, I also had asked around and volunteered to join a study group, but I ended up studying alone because none of the groups covered the area I was interested in. Even so, many people sharing useful information on Slack helped me as I studied alone, and they are still helping me even after the bootcamp.

At any rate, I keep coming up with various ideas that I want to try in my current job or as a personal project. It feels like I found a new toy that I can have fun with for a while without getting tired of it. I think I'll start slowly with a small toy project.