Tag Archives: Environment

Using AI to expand global access to reliable flood forecasts

Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-related disasters has more than doubled since the year 2000 partly due to climate change. Nearly 1.5 billion people, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations can save thousands of lives per year.

Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this multi-year journey, we advanced research over the years hand-in-hand with building a real-time operational flood forecasting system that provides alerts on Google Search, Maps, Android notifications and through the Flood Hub. However, in order to scale globally, especially in places where accurate local data is not available, more research advances were required.

In “Global prediction of extreme floods in ungauged watersheds”, published in Nature, we demonstrate how machine learning (ML) technologies can significantly improve global-scale flood forecasting relative to the current state-of-the-art for countries where flood-related data is scarce. With these AI-based technologies we extended the reliability of currently-available global nowcasts, on average, from zero to five days, and improved forecasts across regions in Africa and Asia to be similar to what are currently available in Europe. The evaluation of the models was conducted in collaboration with the European Center for Medium Range Weather Forecasting (ECMWF).

These technologies also enable Flood Hub to provide real-time river forecasts up to seven days in advance, covering river reaches across over 80 countries. This information can be used by people, communities, governments and international organizations to take anticipatory action to help protect vulnerable populations.




Flood forecasting at Google

The ML models that power the FloodHub tool are the product of many years of research, conducted in collaboration with several partners, including academics, governments, international organizations, and NGOs.

In 2018, we launched a pilot early warning system in the Ganges-Brahmaputra river basin in India, with the hypothesis that ML could help address the challenging problem of reliable flood forecasting at scale. The pilot was further expanded the following year via the combination of an inundation model, real-time water level measurements, the creation of an elevation map and hydrologic modeling.

In collaboration with academics, and, in particular, with the JKU Institute for Machine Learning we explored ML-based hydrologic models, showing that LSTM-based models could produce more accurate simulations than traditional conceptual and physics-based hydrology models. This research led to flood forecasting improvements that enabled the expansion of our forecasting coverage to include all of India and Bangladesh. We also worked with researchers at Yale University to test technological interventions that increase the reach and impact of flood warnings.

Our hydrological models predict river floods by processing publicly available weather data like precipitation and physical watershed information. Such models must be calibrated to long data records from streamflow gauging stations in individual rivers. A low percentage of global river watersheds (basins) have streamflow gauges, which are expensive but necessary to supply relevant data, and it’s challenging for hydrological simulation and forecasting to provide predictions in basins that lack this infrastructure. Lower gross domestic product (GDP) is correlated with increased vulnerability to flood risks, and there is an inverse correlation between national GDP and the amount of publicly available data in a country. ML helps to address this problem by allowing a single model to be trained on all available river data and to be applied to ungauged basins where no data are available. In this way, models can be trained globally, and can make predictions for any river location.

There is an inverse (log-log) correlation between the amount of publicly available streamflow data in a country and national GDP. Streamflow data from the Global Runoff Data Center.

Our academic collaborations led to ML research that developed methods to estimate uncertainty in river forecasts and showed how ML river forecast models synthesize information from multiple data sources. They demonstrated that these models can simulate extreme events reliably, even when those events are not part of the training data. In an effort to contribute to open science, in 2023 we open-sourced a community-driven dataset for large-sample hydrology in Nature Scientific Data.


The river forecast model

Most hydrology models used by national and international agencies for flood forecasting and river modeling are state-space models, which depend only on daily inputs (e.g., precipitation, temperature, etc.) and the current state of the system (e.g., soil moisture, snowpack, etc.). LSTMs are a variant of state-space models and work by defining a neural network that represents a single time step, where input data (such as current weather conditions) are processed to produce updated state information and output values (streamflow) for that time step. LSTMs are applied sequentially to make time-series predictions, and in this sense, behave similarly to how scientists typically conceptualize hydrologic systems. Empirically, we have found that LSTMs perform well on the task of river forecasting.

A diagram of the LSTM, which is a neural network that operates sequentially in time. An accessible primer can be found here.

Our river forecast model uses two LSTMs applied sequentially: (1) a “hindcast” LSTM ingests historical weather data (dynamic hindcast features) up to the present time (or rather, the issue time of a forecast), and (2) a “forecast” LSTM ingests states from the hindcast LSTM along with forecasted weather data (dynamic forecast features) to make future predictions. One year of historical weather data are input into the hindcast LSTM, and seven days of forecasted weather data are input into the forecast LSTM. Static features include geographical and geophysical characteristics of watersheds that are input into both the hindcast and forecast LSTMs and allow the model to learn different hydrological behaviors and responses in various types of watersheds.

Output from the forecast LSTM is fed into a “head” layer that uses mixture density networks to produce a probabilistic forecast (i.e., predicted parameters of a probability distribution over streamflow). Specifically, the model predicts the parameters of a mixture of heavy-tailed probability density functions, called asymmetric Laplacian distributions, at each forecast time step. The result is a mixture density function, called a Countable Mixture of Asymmetric Laplacians (CMAL) distribution, which represents a probabilistic prediction of the volumetric flow rate in a particular river at a particular time.

LSTM-based river forecast model architecture. Two LSTMs are applied in sequence, one ingesting historical weather data and one ingesting forecasted weather data. The model outputs are the parameters of a probability distribution over streamflow at each forecasted timestep.


Input and training data

The model uses three types of publicly available data inputs, mostly from governmental sources:

  1. Static watershed attributes representing geographical and geophysical variables: From the HydroATLAS project, including data like long-term climate indexes (precipitation, temperature, snow fractions), land cover, and anthropogenic attributes (e.g., a nighttime lights index as a proxy for human development).
  2. Historical meteorological time-series data: Used to spin up the model for one year prior to the issue time of a forecast. The data comes from NASA IMERG, NOAA CPC Global Unified Gauge-Based Analysis of Daily Precipitation, and the ECMWF ERA5-land reanalysis. Variables include daily total precipitation, air temperature, solar and thermal radiation, snowfall, and surface pressure.
  3. Forecasted meteorological time series over a seven-day forecast horizon: Used as input for the forecast LSTM. These data are the same meteorological variables listed above, and come from the ECMWF HRES atmospheric model.

Training data are daily streamflow values from the Global Runoff Data Center over the time period 1980 - 2023. A single streamflow forecast model is trained using data from 5,680 diverse watershed streamflow gauges (shown below) to improve accuracy.

Location of 5,680 streamflow gauges that supply training data for the river forecast model from the Global Runoff Data Center.


Improving on the current state-of-the-art

We compared our river forecast model with GloFAS version 4, the current state-of-the-art global flood forecasting system. These experiments showed that ML can provide accurate warnings earlier and over larger and more impactful events.

The figure below shows the distribution of F1 scores when predicting different severity events at river locations around the world, with plus or minus 1 day accuracy. F1 scores are an average of precision and recall and event severity is measured by return period. For example, a 2-year return period event is a volume of streamflow that is expected to be exceeded on average once every two years. Our model achieves reliability scores at up to 4-day or 5-day lead times that are similar to or better, on average, than the reliability of GloFAS nowcasts (0-day lead time).

Distributions of F1 scores over 2-year return period events in 2,092 watersheds globally during the time period 2014-2023 from GloFAS (blue) and our model (orange) at different lead times. On average, our model is statistically as accurate as GloFAS nowcasts (0–day lead time) up to 5 days in advance over 2-year (shown) and 1-year, 5-year, and 10-year events (not shown).

Additionally (not shown), our model achieves accuracies over larger and rarer extreme events, with precision and recall scores over 5-year return period events that are similar to or better than GloFAS accuracies over 1-year return period events. See the paper for more information.


Looking into the future

The flood forecasting initiative is part of our Adaptation and Resilience efforts and reflects Google's commitment to address climate change while helping global communities become more resilient. We believe that AI and ML will continue to play a critical role in helping advance science and research towards climate action.

We actively collaborate with several international aid organizations (e.g., the Centre for Humanitarian Data and the Red Cross) to provide actionable flood forecasts. Additionally, in an ongoing collaboration with the World Meteorological Organization (WMO) to support early warning systems for climate hazards, we are conducting a study to help understand how AI can help address real-world challenges faced by national flood forecasting agencies.

While the work presented here demonstrates a significant step forward in flood forecasting, future work is needed to further expand flood forecasting coverage to more locations globally and other types of flood-related events and disasters, including flash floods and urban floods. We are looking forward to continuing collaborations with our partners in the academic and expert communities, local governments and the industry to reach these goals.

Source: Google AI Blog


Directing ML toward natural hazard mitigation through collaboration

Floods are the most common type of natural disaster, affecting more than 250 million people globally each year. As part of Google's Crisis Response and our efforts to address the climate crisis, we are using machine learning (ML) models for Flood Forecasting to alert people in areas that are impacted before disaster strikes.

Collaboration between researchers in the industry and academia is essential for accelerating progress towards mutual goals in ML-related research. Indeed, Google's current ML-based flood forecasting approach was developed in collaboration with researchers (1, 2) at the Johannes Kepler University in Vienna, Austria, the University of Alabama, and the Hebrew University of Jerusalem, among others.

Today we discuss our recent Machine Learning Meets Flood Forecasting Workshop, which highlights efforts to bring together researchers from Google and other universities and organizations to advance our understanding of flood behavior and prediction, and build more robust solutions for early detection and warning. We also discuss the Caravan project, which is helping to create an open-source repository for global streamflow data, and is itself an example of a collaboration that developed from the previous Flood Forecasting Meets Machine Learning Workshop.


2023 Machine Learning Meets Flood Forecasting Workshop

The fourth annual Google Machine Learning Meets Flood Forecasting Workshop was held in January. This 2-day virtual workshop hosted over 100 participants from 32 universities, 20 governmental and non-governmental agencies, and 11 private companies. This forum provided an opportunity for hydrologists, computer scientists, and aid workers to discuss challenges and efforts toward improving global flood forecasts, to keep up with state-of-the-art technology advances, and to integrate domain knowledge into ML-based forecasting approaches.

The event included talks from six invited speakers, a series of small-group discussion sessions focused on hydrological modeling, inundation mapping, and hazard alerting–related topics, as well as a presentation by Google on the FloodHub, which provides free, public access to Google’s flood forecasts, up to 7 days in advance.

Invited speakers at the workshop included:

The presentations can be viewed on YouTube:

2023 Flood Forecasting Meets Machine Learning Talks Day 1



2023 Flood Forecasting Meets Machine Learning Talks Day 2



Some of the top challenges highlighted during the workshop were related to the integration of physical and hydrological science with ML to help build trust and reliability; filling gaps in observations of inundated areas with models and satellite data; measuring the skill and reliability of flood warning systems; and improving the communication of flood warnings to diverse, global populations. In addition, participants stressed that addressing these and other challenges will require collaboration between a number of different organizations and scientific disciplines.


The Caravan project

One of the main challenges in conducting successful ML research and creating advanced tools for flood forecasting is the need for large amounts of data for computationally expensive training and evaluation. Today, many countries and organizations collect streamflow data (typically either water levels or flow rates), but it is not standardized or held in a central repository, which makes it difficult for researchers to access.

During the 2019 Machine Learning Meets Flood Forecasting Workshop, a group of researchers identified the need for an open source, global streamflow data repository, and developed ideas around leveraging free computational resources from Google Earth Engine to address the flood forecasting community’s challenge of data collection and accessibility. Following two years of collaborative work between researchers from Google, the school of Geography at the University of Exeter, the Institute for Machine Learning at Johannes Kepler University, and the Institute for Atmospheric and Climate Science at ETH Zurich, the Caravan project was created.

In “Caravan - A global community dataset for large-sample hydrology”, published in Nature Scientific Data, we describe the project in more detail. Based on a global dataset for the development and training of hydrological models (see figure below), Caravan provides open-source Python scripts that leverage essential weather and geographical data that was previously made public on Google Earth Engine to match streamflow data that users upload to the repository. This repository originally contained data from more than 13,000 watersheds in Central Europe, Brazil, Chile, Australia, the United States, Canada, and Mexico. It has further benefited from community contributions from the Geological Survey of Denmark and Greenland that includes streamflow data from most of the watersheds in Denmark. The goal is to continue to develop and grow this repository to enable researchers to access most of the world’s streamflow data. For more information regarding contributing to the Caravan dataset, reach out to [email protected].

Locations of the 13,000 streamflow gauges in the Caravan dataset and the distribution of those gauges in GEnS global climate zones.

The path forward

Google plans to continue to host these workshops to help broaden and deepen collaboration between industry and academia in the development of environmental AI models. We are looking forward to seeing what advances might come out of the most recent workshop. Hydrologists and researchers interested in participating in future workshops are encouraged to contact [email protected].

Source: Google AI Blog


The BirdCLEF 2023 Challenge: Pushing the frontiers of biodiversity monitoring

Worldwide bird populations are declining at an alarming rate, with approximately 48% of existing bird species known or suspected to be experiencing population declines. For instance, the U.S. and Canada have reported 29% fewer birds since 1970.

Effective monitoring of bird populations is essential for the development of solutions that promote conservation. Monitoring allows researchers to better understand the severity of the problem for specific bird populations and evaluate whether existing interventions are working. To scale monitoring, bird researchers have started analyzing ecosystems remotely using bird sound recordings instead of physically in-person via passive acoustic monitoring. Researchers can gather thousands of hours of audio with remote recording devices, and then use machine learning (ML) techniques to process the data. While this is an exciting development, existing ML models struggle with tropical ecosystem audio data due to higher bird species diversity and overlapping bird sounds.

Annotated audio data is needed to understand model quality in the real world. However, creating high-quality annotated datasets — especially for areas with high biodiversity — can be expensive and tedious, often requiring tens of hours of expert analyst time to annotate a single hour of audio. Furthermore, existing annotated datasets are rare and cover only a small geographic region, such as Sapsucker Woods or the Peruvian rainforest. Thousands of unique ecosystems in the world still need to be analyzed.

In an effort to tackle this problem, over the past 3 years, we've hosted ML competitions on Kaggle in partnership with specialized organizations focused on high-impact ecologies. In each competition, participants are challenged with building ML models that can take sounds from an ecology-specific dataset and accurately identify bird species by sound. The best entries can train reliable classifiers with limited training data. Last year’s competition focused on Hawaiian bird species, which are some of the most endangered in the world.


The 2023 BirdCLEF ML competition

This year we partnered with The Cornell Lab of Ornithology's K. Lisa Yang Center for Conservation Bioacoustics and NATURAL STATE to host the 2023 BirdCLEF ML competition focused on Kenyan birds. The total prize pool is $50,000, the entry deadline is May 17, 2023, and the final submission deadline is May 24, 2023. See the competition website for detailed information on the dataset to be used, timelines, and rules.

Kenya is home to over 1,000 species of birds, covering a wide range of ecosystems, from the savannahs of the Maasai Mara to the Kakamega rainforest, and even alpine regions on Kilimanjaro and Mount Kenya. Tracking this vast number of species with ML can be challenging, especially with minimal training data available for many species.

NATURAL STATE is working in pilot areas around Northern Mount Kenya to test the effect of various management regimes and states of degradation on bird biodiversity in rangeland systems. By using the ML algorithms developed within the scope of this competition, NATURAL STATE will be able to demonstrate the efficacy of this approach in measuring the success and cost-effectiveness of restoration projects. In addition, the ability to cost-effectively monitor the impact of restoration efforts on biodiversity will allow NATURAL STATE to test and build some of the first biodiversity-focused financial mechanisms to channel much-needed investment into the restoration and protection of this landscape upon which so many people depend. These tools are necessary to scale this cost-effectively beyond the project area and achieve their vision of restoring and protecting the planet at scale.

In previous competitions, we used metrics like the F1 score, which requires choosing specific detection thresholds for the models. This requires significant effort, and makes it difficult to assess the underlying model quality: A bad thresholding strategy on a good model may underperform. This year we are using a threshold-free model quality metric: class mean average precision. This metric treats each bird species output as a separate binary classifier to compute an average AUC score for each, and then averages these scores. Switching to an uncalibrated metric should increase the focus on core model quality by removing the need to choose a specific detection threshold.


How to get started

This will be the first Kaggle competition where participants can use the recently launched Kaggle Models platform that provides access to over 2,300 public, pre-trained models, including most of the TensorFlow Hub models. This new resource will have deep integrations with the rest of Kaggle, including Kaggle notebook, datasets, and competitions.

If you are interested in participating in this competition, a great place to get started quickly is to use our recently open-sourced Bird Vocalization Classifier model that is available on Kaggle Models. This global bird embedding and classification model provides output logits for more than 10k bird species and also creates embedding vectors that can be used for other tasks. Follow the steps shown in the figure below to use the Bird Vocalization Classifier model on Kaggle.

To try the model on Kaggle, navigate to the model here. 1) Click “New Notebook”; 2) click on the "Copy Code" button to copy the example lines of code needed to load the model; 3) click on the "Add Model" button to add this model as a data source to your notebook; and 4) paste the example code in the editor to load the model.

Alternatively, the competition starter notebook includes the model and extra code to more easily generate a competition submission.

We invite the research community to consider participating in the BirdCLEF competition. As a result of this effort, we hope that it will be easier for researchers and conservation practitioners to survey bird population trends and build effective conservation strategies.


Acknowledgements

Compiling these extensive datasets was a major undertaking, and we are very thankful to the many domain experts who helped to collect and manually annotate the data for this competition. Specifically, we would like to thank (institutions and individual contributors in alphabetic order): Julie Cattiau and Tom Denton on the Brain team, Maximilian Eibl and Stefan Kahl at Chemnitz University of Technology, Stefan Kahl and Holger Klinck from the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology, Alexis Joly and Henning Müller at LifeCLEF, Jonathan Baillie from NATURAL STATE, Hendrik Reers, Alain Jacot and Francis Cherutich from OekoFor GbR, and Willem-Pier Vellinga from xeno-canto. We would also like to thank Ian Davies from the Cornell Lab of Ornithology for allowing us to use the hero image in this post.

Source: Google AI Blog


Real-time tracking of wildfire boundaries using satellite imagery

As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke.

As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Canada and Australia. They are displayed, with additional information from local authorities, on Google Search and Google Maps, allowing people to keep safe and stay informed about potential dangers near them, their homes or loved ones.

Real-time boundary tracking of the 2021-2022 Wrattonbully bushfire, shown as a red polygon in Google Maps.

Inputs

Wildfire boundary tracking requires balancing spatial resolution and update frequency. The most scalable method to obtain frequent boundary updates is to use geostationary satellites, i.e., satellites that orbit the earth once every 24 hours. These satellites remain at a fixed point above Earth, providing continual coverage of the area surrounding that point. Specifically, our wildfire tracker models use the GOES-16 and GOES-18 satellites to cover North America, and the Himawari-9 and GK2A satellites to cover Australia. These provide continent-scale images every 10 minutes. The spatial resolution is 2km at nadir (the point directly below the satellite), and lower as one moves away from nadir. The goal here is to provide people with warnings as soon as possible, and refer them to authoritative sources for spatially precise, on-the-ground data, as necessary.

Smoke plumes obscuring the 2018 Camp Fire in California. [Image from NASA Worldview]

Determining the precise extent of a wildfire is nontrivial, since fires emit massive smoke plumes, which can spread far from the burn area and obscure the flames. Clouds and other meteorological phenomena further obscure the underlying fire. To overcome these challenges, it is common to rely on infrared (IR) frequencies, particularly in the 3–4 μm wavelength range. This is because wildfires (and similar hot surfaces) radiate considerably at this frequency band, and these emissions diffract with relatively minor distortions through smoke and other particulates in the atmosphere. This is illustrated in the figure below, which shows a multispectral image of a wildfire in Australia. The visible channels (blue, green, and red) mostly show the triangular smoke plume, while the 3.85 μm IR channel shows the ring-shaped burn pattern of the fire itself. Even with the added information from the IR bands, however, determining the exact extent of the fire remains challenging, as the fire has variable emission strength, and multiple other phenomena emit or reflect IR radiation.

Himawari-8 hyperspectral image of a wildfire. Note the smoke plume in the visible channels (blue, green, and red), and the ring indicating the current burn area in the 3.85μm band.

Model

Prior work on fire detection from satellite imagery is typically based on physics-based algorithms for identifying hotspots from multispectral imagery. For example, the National Oceanic and Atmospheric Administration (NOAA) fire product identifies potential wildfire pixels in each of the GOES satellites, primarily by relying on the 3.9 μm and 11.2 μm frequencies (with auxiliary information from two other frequency bands).

In our wildfire tracker, the model is trained on all satellite inputs, allowing it to learn the relative importance of different frequency bands. The model receives a sequence of the three most recent images from each band so as to compensate for temporary obstructions such as cloud cover. Additionally, the model receives inputs from two geostationary satellites, achieving a super-resolution effect whereby the detection accuracy improves upon the pixel size of either satellite. In North America, we also supply the aforementioned NOAA fire product as input. Finally, we compute the relative angles of the sun and the satellites, and provide these as additional input to the model.

All inputs are resampled to a uniform 1 km–square grid and fed into a convolutional neural network (CNN). We experimented with several architectures and settled on a CNN followed by a 1x1 convolutional layer to yield separate classification heads for fire and cloud pixels (shown below). The number of layers and their sizes are hyperparameters, which are optimized separately for Australia and North America. When a pixel is identified as a cloud, we override any fire detection since heavy clouds obscure underlying fires. Even so, separating the cloud classification task improves the performance of fire detection as we incentivize the system to better identify these edge cases.

CNN architecture for the Australia model; a similar architecture was used for North America. Adding a cloud classification head improves fire classification performance.

To train the network, we used thermal anomalies data from the MODIS and VIIRS polar-orbiting satellites as labels. MODIS and VIIRS have higher spatial accuracy (750–1000 meters) than the geostationary satellites we use as inputs. However, they cover a given location only once every few hours, which occasionally causes them to miss rapidly-advancing fires. Therefore, we use MODIS and VIIRS to construct a training set, but at inference time we rely on the high-frequency imagery from geostationary satellites.

Even when limiting attention to active fires, most pixels in an image are not currently burning. To reduce the model's bias towards non-burning pixels, we upsampled fire pixels in the training set and applied focal loss to encourage improvements in the rare misclassified fire pixels.

The progressing boundary of the 2022 McKinney fire, and a smaller nearby fire.

Evaluation

High-resolution fire signals from polar-orbiting satellites are a plentiful source for training data. However, such satellites use sensors that are similar to geostationary satellites, which increases the risk of systemic labeling errors (e.g., cloud-related misdetections) being incorporated into the model. To evaluate our wildfire tracker model without such bias, we compared it against fire scars (i.e., the shape of the total burnt area) measured by local authorities. Fire scars are obtained after a fire has been contained and are more reliable than real-time fire detection techniques. We compare each fire scar to the union of all fire pixels detected in real time during the wildfire to obtain an image such as the one shown below. In this image, green represents correctly identified burn areas (true positive), yellow represents unburned areas detected as burn areas (false positive), and red represents burn areas that were not detected (false negative).

Example evaluation for a single fire. Pixel size is 1km x 1km.

We compare our models to official fire scars using the precision and recall metrics. To quantify the spatial severity of classification errors, we take the maximum distance between a false positive or false negative pixel and the nearest true positive fire pixel. We then average each metric across all fires. The results of the evaluation are summarized below. Most severe misdetections were found to be a result of errors in the official data, such as a missing scar for a nearby fire.

Test set metrics comparing our models to official fire scars.

We performed two additional experiments on wildfires in the United States (see table below). First, we evaluated an earlier model that relies only on NOAA's GOES-16 and GOES-17 fire products. Our model outperforms this approach in all metrics considered, demonstrating that the raw satellite measurements can be used to enhance the existing NOAA fire product.

Next, we collected a new test set consisting of all large fires in the United States in 2022. This test set was not available during training because the model launched before the fire season began. Evaluating the performance on this test set shows performance in line with expectations from the original test set.

Comparison between models on fires in the United States.


Conclusion

Boundary tracking is part of Google’s wider commitment to bring accurate and up-to-date information to people in critical moments. This demonstrates how we use satellite imagery and ML to track wildfires, and provide real time support to affected people in times of crisis. In the future, we plan to keep improving the quality of our wildfire boundary tracking, to expand this service to more countries and continue our work helping fire authorities access critical information in real time.


Acknowledgements

This work is a collaboration between teams from Google Research, Google Maps and Crisis Response, with support from our partnerships and policy teams. We would also like to thank the fire authorities whom we partner with around the world.



Source: Google AI Blog


Optimizing Airline Tail Assignments for Cleaner Skies

Airlines around the world are exploring several tactics to meet aggressive CO2 commitments set by the International Civil Aviation Organization (ICAO). This effort has been emphasized in Europe, where aviation accounts for 13.9% of the transportation industry’s carbon emissions. The largest push comes from the European Green Deal, which aims to decrease carbon emissions from transportation by 90% by 2051. The Lufthansa Group has gone even further, committing to a 50% reduction in emissions compared to 2019 by the year 2030 and to reach net-zero emissions by 2050.

One unexpected approach that airlines can use to lower carbon emissions is through optimizing their tail assignment, i.e., how to assign aircraft (identified by the aircraft registration painted on their tails) to legs in a way that minimizes the total operating cost, of which fuel is a major contributor. More fuel needed to operate the aircraft means higher operating costs and more carbon ejected into the atmosphere. For example, a typical long-haul flight (longer than ~4,100km or ~2,500mi) emits about a ton of CO2.

The amount of fuel needed to fly between origin and destination can vary widely — e.g., larger aircraft weigh more and therefore require more fuel, while modern and younger aircraft tend to be more fuel-efficient because they use newer technology. The mass of the fuel itself is also significant. Aircraft are less fuel-efficient early in their flights when their fuel tanks are full than later when the volume of fuel is reduced. Another important factor for the tail assignment is the number of passengers on board; as the number of bookings changes, a smaller or larger aircraft might be required. Other factors can affect fuel consumption, both negative (e.g., headwinds or the age of the engines) or positive (e.g., tailwinds, sharklets, skin).

During the past year, Google’s Operations Research team has been working with the Lufthansa Group to optimize their tail assignment to reduce carbon emissions and the cost of operating their flights. As part of this collaboration, we developed and launched a mathematical tail assignment solver that has been fully integrated to optimize the fleet schedule for SWISS International Air Lines (a Lufthansa Group subsidiary), which we estimate will result in significant reductions in carbon emissions. This solver is the first step of a multi-phase project that started at SWISS.

A Mathematical Model for Tail Assignment
We structure the task of tail assignment optimization as a network flow problem, which is essentially a directed graph characterized by a set of nodes and a set of arcs, with additional constraints related to the problem at hand. Nodes may have either a supply or a demand for a commodity, while arcs have a flow capacity and a cost per unit of flow. The goal is to determine flows for every arc that minimize the total flow cost of each commodity, while maintaining flow balance in the network.

We decided to use a flow network because it is the most common way of modeling this problem in literature, and the commodities, arcs, and nodes of the flow network have a simple one-to-one correspondence to tails, legs, and airports in the real-life problem. In this case, the arcs of the network correspond to each leg of the flight schedule, and each individual tail is a single instance of a commodity that “flows” along the network. Each leg and tail pair in the network has an associated assignment cost, and the model’s objective is to pick valid leg and tail pairs such that these assignment costs are minimized.

A simple example of the tail assignment problem. There are four legs in this schedule and four possible tails that one can assign to those legs. Each tail and leg pair has an associated operational cost. For example, for Leg 1, it costs $50 to assign Tail 1 to it but $100 to assign Tail 2. The optimal solution, with the minimum cost, is to assign Tail 4 to Legs 3 and 2 and Tail 1 to Legs 1 and 4.

Aside from the standard network flow constraints, the model takes into account additional airline-specific constraints so that the solution is tailored to Lufthansa Group airlines. For example, aircraft turnaround times — i.e., the amount of time an aircraft spends on the ground between two consecutive flights — are airline-specific and can vary for a number of reasons. Catering might be loaded at an airline's hub, reducing the turnaround time needed at outstations, or a route could have a higher volume of vacation travelers who often take longer to board and disembark than business travelers. Another constraint is that each aircraft must be on the ground for a nightly check at a specified airport’s maintenance hub to receive mandated maintenance work or cleaning. Furthermore, each airline has their own maintenance schedule, which can require aircraft to undergo routine maintenance checks every few nights, in part to help maintain the aircraft’s fuel efficiency.

Preliminary Results & Next Steps
After using our solver to optimize their fleet schedule in Europe, SWISS Airlines estimates an annual savings of over 3.5 million Swiss Francs and a 6500 ton reduction in CO2 emitted. We expect these savings will multiply when the model is rolled out to the rest of the airlines in the Lufthansa Group and again when traffic returns to pre-COVID levels. Future work will include ensuring this model is usable with larger sets of data, and adding crew and passenger assignment to the optimization system to improve the flight schedules for both passengers and flight crew.

If you are interested in experimenting with your own network flow models, check out OR-Tools, our open source software suite that can be used to build optimization solutions similar to the solver presented in this post. Refer to OR-Tools related documentation for more information.

Acknowledgements
Thanks to Jon Orwant for collaborating extensively on this blog post and for establishing the partnership with Lufthansa and SWISS, along with Alejandra Estanislao. Thanks to the Operations Research Team and to the folks at SWISS, this work could not be possible without their hard work and contributions.

Source: Google AI Blog


Separating Birdsong in the Wild for Classification

Birds are all around us, and just by listening, we can learn many things about our environment. Ecologists use birds to understand food systems and forest health — for example, if there are more woodpeckers in a forest, that means there’s a lot of dead wood. Because birds communicate and mark territory with songs and calls, it’s most efficient to identify them by ear. In fact, experts may identify up to 10x as many birds by ear as by sight.

In recent years, autonomous recording units (ARUs) have made it easy to capture thousands of hours of audio in forests that could be used to better understand ecosystems and identify critical habitat. However, manually reviewing the audio data is very time consuming, and experts in birdsong are rare. But an approach based on machine learning (ML) has the potential to greatly reduce the amount of expert review needed for understanding a habitat.

However, ML-based audio classification of bird species can be challenging for several reasons. For one, birds often sing over one another, especially during the “dawn chorus” when many birds are most active. Also, there aren’t clear recordings of individual birds to learn from — almost all of the available training data is recorded in noisy outdoor conditions, where other sounds from the wind, insects, and other environmental sources are often present. As a result, existing birdsong classification models struggle to identify quiet, distant and overlapping vocalizations. Additionally, some of the most common species often appear unlabeled in the background of training recordings for less common species, leading models to discount the common species. These difficult cases are very important for ecologists who want to identify endangered or invasive species using automated systems.

To address the general challenge of training ML models to automatically separate audio recordings without access to examples of isolated sounds, we recently proposed a new unsupervised method called mixture invariant training (MixIT) in our paper, “Unsupervised Sound Separation Using Mixture Invariant Training”. Moreover, in our new paper, “Improving Bird Classification with Unsupervised Sound Separation,” we use MixIT training to separate birdsong and improve species classification. We found that including the separated audio in the classification improves precision and classification quality on three independent soundscape datasets. We are also happy to announce the open-source release of the birdsong separation models on GitHub.

Bird Song Audio Separation
MixIT learns to separate single-channel recordings into multiple individual tracks, and can be trained entirely with noisy, real-world recordings. To train the separation model, we create a “mixture of mixtures” (MoM) by mixing together two real-world recordings. The separation model then learns to take the MoM apart into many channels to minimize a loss function that uses the two original real-world recordings as ground-truth references. The loss function uses these references to group the separated channels such that they can be mixed back together to recreate the two original real-world recordings. Since there’s no way to know how the different sounds in the MoM were grouped together in the original recordings, the separation model has no choice but to separate the individual sounds themselves, and thus learns to place each singing bird in a different output audio channel, also separate from wind and other background noise.

We trained a new MixIT separation model using birdsong recordings from Xeno-Canto and the Macaulay Library. We found that for separating birdsong, this new model outperformed a MixIT separation model trained on a large amount of general audio from the AudioSet dataset. We measure the quality of the separation by mixing two recordings together, applying separation, and then remixing the separated audio channels such that they reconstruct the original two recordings. We measure the signal-to-noise ratio (SNR) of the remixed audio relative to the original recordings. We found that the model trained specifically for birds achieved 6.1 decibels (dB) better SNR than the model trained on AudioSet (10.5 dB vs 4.4 dB). Subjectively, we also found many examples where the system worked incredibly well, separating very difficult to distinguish calls in real-world data.

The following videos demonstrate separation of birdsong from two different regions (Caples and the High Sierras). The videos show the mel-spectrogram of the mixed audio (a 2D image that shows the frequency content of the audio over time) and highlight the audio separated into different tracks.

High Sierras
  
Caples

Classifying Bird Species
To classify birds in real-world audio captured with ARUs, we first split the audio into five-second segments and then create a mel-spectrogram of each segment. We then train an EfficientNet classifier to identify bird species from the mel-spectrogram images, training on audio from Xeno-Canto and the Macaulay Library. We trained two separate classifiers, one for species in the Sierra Nevada mountains and one for upstate New York. Note that these classifiers are not trained on separated audio; that’s an area for future improvement.

We also introduced some new techniques to improve classifier training. Taxonomic training asks the classifier to provide labels for each level of the species taxonomy (genus, family, and order), which allows the model to learn groupings of species before learning the sometimes-subtle differences between similar species. Taxonomic training also allows the model to benefit from expert information about the taxonomic relationships between different species. We also found that random low-pass filtering was helpful for simulating distant sounds during training: As an audio source gets further away, the high-frequency parts fade away before the low-frequency parts. This was particularly effective for identifying species from the High Sierras region, where bird songs cover very long distances, unimpeded by trees.

Classifying Separated Audio
We found that separating audio with the new MixIT model before classification improved the classifier performance on three independent real-world datasets. The separation was particularly successful for identification of quiet and background birds, and in many cases helped with overlapping vocalizations as well.

Top: A mel-spectrogram of two birds, an American pipit (amepip) and gray-crowned rosy finch (gcrfin), from the Sierra Nevadas. The legend shows the log-probabilities for the two species given by the pre-trained classifiers. Higher values indicate more confidence, and values greater than -1.0 are usually correct classifications. Bottom: A mel-spectrogram for the automatically separated audio, with the classifier log probabilities from the separated channels. Note that the classifier only identifies the gcrfin once the audio is separated.
Top: A complex mixture with three vocalizations: A golden-crowned kinglet (gockin), mountain chickadee (mouchi), and Steller’s jay (stejay). Bottom: Separation into three channels, with classifier log probabilities for the three species. We see good visual separation of the Steller’s jay (shown by the distinct pink marks), even though the classifier isn’t sure what it is.

The separation model does have some potential limitations. Occasionally we observe over-separation, where a single song is broken into multiple channels, which can cause misclassifications. We also notice that when multiple birds are vocalizing, the most prominent song often gets a lower score after separation. This may be due to loss of environmental context or other artifacts introduced by separation that do not appear during classifier training. For now, we get the best results by running the classifier on the separated channels and the original audio, and taking the maximum score for each species. We expect that further work will allow us to reduce over-separation and find better ways to combine separation and classification. You can see and hear more examples of the full system at our GitHub repo.

Future Directions
We are currently working with partners at the California Academy of Sciences to understand how habitat and species mix changes after prescribed fires and wildfires, applying these models to ARU audio collected over many years.

We also foresee many potential applications for the unsupervised separation models in ecology, beyond just birds. For example, the separated audio can be used to create better acoustic indices, which could measure ecosystem health by tracking the total activity of birds, insects, and amphibians without identifying particular species. Similar methods could also be adapted for use underwater to track coral reef health.

Acknowledgements
We would like to thank Mary Clapp, Jack Dumbacher, and Durrell Kapan from the California Academy of Sciences for providing extensive annotated soundscapes from the Sierra Nevadas. Stefan Kahl and Holger Klinck from the Cornell Lab of Ornithology provided soundscapes from Sapsucker Woods. Training data for both the separation and classification models came from Xeno-Canto and the Macaulay Library. Finally, we would like to thank Julie Cattiau, Lauren Harrell, Matt Harvey, and our co-author, John Hershey, from the Google Bioacoustics and Sound Separation teams.

Source: Google AI Blog


Environmental Insights Explorer Expands: 100 Australian councils and counting

Environmental Insights Explorer 

Reducing greenhouse gas (GHG) emissions is a crucial step in fighting the climate crisis. And cities now account for more than 70 percent of global emissions. But measuring exactly which activities—whether it’s buildings, cars, or public transport—are contributing to emissions, and by how much, is complex. Without this information, cities can neither understand the challenges they face, nor the impact of their environmental policies. 

This is the problem we’re working to solve with Environmental Insights Explorer (EIE), an online platform that provides building and transportation emissions, as well as solar potential analysis to make it easier for cities to measure progress against their climate action plans. Launched in 2018 for a handful of cities around the world including Melbourne, with Sydney, Canberra and Adelaide then added in 2019, EIE has helped councils accelerate GHG reduction efforts. Today, we’ve expanded EIE data access to thousands of cities worldwide, including 100+ Australian councils. 

To scale data access to local governments, policy makers and community groups, we’re also developing partnerships with leading Australian organisations, councils, and climate change experts. This includes a new partnership with Ironbark Sustainability and Beyond Zero Emissions to make EIE transportation data available for 100+ councils in Snapshot—a free tool that calculates major sources of carbon emissions, including stationary energy, transport, waste, agriculture, and land-use change. Snapshot allows municipalities to easily compare their sources of carbon emissions. This data integration will provide updated GHG profiles and enable local government policy decision-making for more than 86 percent of the country's population to put councils on a fast track for delivering commitments, building local resilience, and ensuring economic recovery. 
Accelerated city-wide analysis 
By analysing Google’s comprehensive global mapping data together with GHG emission factors, EIE estimates city-scale building and transportation carbon emissions data with the ability to drill down into more specific data such as vehicle-kilometres travelled by mode (automobiles, public transit, biking, etc.) and the percentage of emissions generated by residential or non-residential buildings. 
The insights that EIE provides have traditionally required many months of research, and a lot of resources for cities undertaking a climate action plan. Using Google’s proprietary data coupled with machine learning capabilities, we can produce a complete survey of a city that can be assessed very quickly. In this way, EIE allows cities to leapfrog tedious and costly data collection and analysis. 
EIE transport data now available in Snapshot for 100+ councils 

The next chapter 
Over the next few months, we’ll be working together to help Australian councils learn more about data insights from EIE and expand data coverage to more councils. We hope that by making EIE data accessible to more councils across Australia, we’ll help nurture an ecosystem that can bring climate action plans to life. 


Environmental Insights Explorer Expands: 100 Australian councils and counting

Environmental Insights Explorer 

Reducing greenhouse gas (GHG) emissions is a crucial step in fighting the climate crisis. And cities now account for more than 70 percent of global emissions. But measuring exactly which activities—whether it’s buildings, cars, or public transport—are contributing to emissions, and by how much, is complex. Without this information, cities can neither understand the challenges they face, nor the impact of their environmental policies. 

This is the problem we’re working to solve with Environmental Insights Explorer (EIE), an online platform that provides building and transportation emissions, as well as solar potential analysis to make it easier for cities to measure progress against their climate action plans. Launched in 2018 for a handful of cities around the world including Melbourne, with Sydney, Canberra and Adelaide then added in 2019, EIE has helped councils accelerate GHG reduction efforts. Today, we’ve expanded EIE data access to thousands of cities worldwide, including 100+ Australian councils. 

To scale data access to local governments, policy makers and community groups, we’re also developing partnerships with leading Australian organisations, councils, and climate change experts. This includes a new partnership with Ironbark Sustainability and Beyond Zero Emissions to make EIE transportation data available for 100+ councils in Snapshot—a free tool that calculates major sources of carbon emissions, including stationary energy, transport, waste, agriculture, and land-use change. Snapshot allows municipalities to easily compare their sources of carbon emissions. This data integration will provide updated GHG profiles and enable local government policy decision-making for more than 86 percent of the country's population to put councils on a fast track for delivering commitments, building local resilience, and ensuring economic recovery. 
Accelerated city-wide analysis 
By analysing Google’s comprehensive global mapping data together with GHG emission factors, EIE estimates city-scale building and transportation carbon emissions data with the ability to drill down into more specific data such as vehicle-kilometres travelled by mode (automobiles, public transit, biking, etc.) and the percentage of emissions generated by residential or non-residential buildings. 
The insights that EIE provides have traditionally required many months of research, and a lot of resources for cities undertaking a climate action plan. Using Google’s proprietary data coupled with machine learning capabilities, we can produce a complete survey of a city that can be assessed very quickly. In this way, EIE allows cities to leapfrog tedious and costly data collection and analysis. 
EIE transport data now available in Snapshot for 100+ councils 

The next chapter 
Over the next few months, we’ll be working together to help Australian councils learn more about data insights from EIE and expand data coverage to more councils. We hope that by making EIE data accessible to more councils across Australia, we’ll help nurture an ecosystem that can bring climate action plans to life. 


Leveraging Temporal Context for Object Detection



Ecological monitoring helps researchers to understand the dynamics of global ecosystems, quantify biodiversity, and measure the effects of climate change and human activity, including the efficacy of conservation and remediation efforts. In order to monitor effectively, ecologists need high-quality data, often expending significant efforts to place monitoring sensors, such as static cameras, in the field. While it is increasingly cost effective to build and operate networks of such sensors, the manual data analysis of global biodiversity data remains a bottleneck to accurate, global, real-time ecological monitoring. While there are ways to automate this analysis via machine learning, the data from static cameras, widely used to monitor the world around us for purposes ranging from mountain pass road conditions to ecosystem phenology, still pose a strong challenge for traditional computer vision systems — due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger.

In order to perform well in this setting, computer vision models must be robust to objects of interest that are often off-center, out of focus, poorly lit, or at a variety of scales. In addition, a static camera will always take images of the same scene unless it is moved, which causes the data from any one camera to be highly repetitive. Without sufficient data variability, machine learning models may learn to focus on correlations in the background, leading to poor generalization to novel deployments. The machine learning and ecological communities have been working together through venues like LILA BC and Wildlife Insights to curate expert-labeled training data from many research groups, each of which may operate anywhere from one to hundreds of camera traps, in order to increase data variability. This process of data collection and annotation is slow, and is confounded by the need to have diverse, representative data across geographic regions and taxonomic groups.
What’s in this image? Objects in images from static cameras can be very challenging to detect and categorize. Here, a foggy morning has made it very difficult to see a herd of wildebeest walking along the crest of a hill. [Image from Snapshot Serengeti]
In Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection, we present a complementary approach that increases global scalability by improving generalization to novel camera deployments algorithmically. This new object detection architecture leverages contextual clues across time for each camera deployment in a network, improving recognition of objects in novel camera deployments without relying on additional training data from a large number of cameras. Echoing the approach a person might use when faced with challenging images, Context R-CNN leverages up to a month’s worth of images from the same camera for context to determine what objects might be present and identify them. Using this method, the model outperforms a single-frame Faster R-CNN baseline by significant margins across multiple domains, including wildlife camera traps. We have open sourced the code and models for this work as part of the TF Object Detection API to make it easy to train and test Context R-CNN models on new static camera datasets.
Here, we can see how additional examples from the same scene help experts determine that the object is an animal and not background. Context such as the shape & size of the object, its attachment to a herd, and habitual grazing at certain times of day help determine that the species is a wildebeest. Useful examples occur throughout the month.
The Context R-CNN Model
Context R-CNN is designed to take advantage of the high degree of correlation within images taken by a static camera to boost performance on challenging data and improve generalization to new camera deployments without additional human data labeling. It is an adaptation of Faster R-CNN, a popular two-stage object detection architecture. To extract context for a camera, we first use a frozen feature extractor to build up a contextual memory bank from images across a large time horizon (up to a month or more). Next, objects are detected in each image using Context R-CNN which aggregates relevant context from the memory bank to help detect objects under challenging conditions (such as the heavy fog obscuring the wildebeests in our previous example). This aggregation is performed using attention, which is robust to the sparse and irregular sampling rates often seen in static monitoring cameras.
High-level architecture diagram, showing how Context R-CNN incorporates long-term context within the Faster R-CNN model architecture.
The first stage of Faster R-CNN proposes potential objects, and the second stage categorizes each proposal as either background or one of the target classes. In Context R-CNN, we take the proposed objects from the first stage of Faster R-CNN, and for each one we use similarity-based attention to determine how relevant each of the features in our memory bank (M) is to the current object, and construct a per-object context feature by taking a relevance-weighted sum over M and adding it back to the original object features. Then each object, now with added contextual information, is finally categorized using the second stage of Faster R-CNN.
Context R-CNN is able to leverage context (spanning up to 1 month) to correctly categorize the challenging wildebeest example we saw above. The green values are the corresponding attention weights for each boxed object.
Compared to a Faster R-CNN baseline (left), Context R-CNN (right) is able to capture challenging objects such as an elephant occluded by a tree, two poorly-lit impala, and a vervet monkey leaving the frame. [Images from Snapshot Serengeti]
Results
We have tested Context R-CNN on Snapshot Serengeti (SS) and Caltech Camera Traps (CCT), both ecological datasets of animal species in camera traps but from highly different geographic regions (Tanzania vs. the Southwestern United States). Improvements over the Faster R-CNN baseline for each dataset can be seen in the table below. Notably, we see a 47.5% relative increase in mean average precision (mAP) on SS, and a 34.3% relative mAP increase on CCT. We also compare Context R-CNN to S3D (a 3D convolution based baseline) and see performance improve from 44.7% mAP to 55.9% mAP (a 25.1% relative increase). Finally, we find that the performance increases as the contextual time horizon increases, from a minute of context to a month.
Comparison to a single frame Faster R-CNN baseline, showing both mean average precision (mAP) and average recall (AR) detection metrics.
Ongoing and Future Work
We are working to implement Context R-CNN within the Wildlife Insights platform, to facilitate large-scale, global ecological monitoring via camera traps. We also host competitions such as the yearly iWildCam species identification competition at the CVPR Fine-Grained Visual Recognition Workshop to help bring these challenges to the attention of the computer vision community. The challenges seen in automatic species identification in static cameras are shared by numerous applications of static cameras outside of the ecological monitoring domain, as well as other static sensors used to monitor biodiversity, such as audio and sonar devices. Our method is general, and we anticipate the per-sensor context approach taken by Context R-CNN would be beneficial for any static sensor.

Acknowledgements
This post reflects the work of the authors as well as the following group of core contributors: Vivek Rathod, Guanhang Wu, Ronny Votel. We are also grateful to Zhichao Lu, David Ross, Tanya Birch and the Wildlife Insights AI team, and Pietro Perona and the Caltech Computational Vision Lab.

Source: Google AI Blog


Leveraging Temporal Context for Object Detection



Ecological monitoring helps researchers to understand the dynamics of global ecosystems, quantify biodiversity, and measure the effects of climate change and human activity, including the efficacy of conservation and remediation efforts. In order to monitor effectively, ecologists need high-quality data, often expending significant efforts to place monitoring sensors, such as static cameras, in the field. While it is increasingly cost effective to build and operate networks of such sensors, the manual data analysis of global biodiversity data remains a bottleneck to accurate, global, real-time ecological monitoring. While there are ways to automate this analysis via machine learning, the data from static cameras, widely used to monitor the world around us for purposes ranging from mountain pass road conditions to ecosystem phenology, still pose a strong challenge for traditional computer vision systems — due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger.

In order to perform well in this setting, computer vision models must be robust to objects of interest that are often off-center, out of focus, poorly lit, or at a variety of scales. In addition, a static camera will always take images of the same scene unless it is moved, which causes the data from any one camera to be highly repetitive. Without sufficient data variability, machine learning models may learn to focus on correlations in the background, leading to poor generalization to novel deployments. The machine learning and ecological communities have been working together through venues like LILA BC and Wildlife Insights to curate expert-labeled training data from many research groups, each of which may operate anywhere from one to hundreds of camera traps, in order to increase data variability. This process of data collection and annotation is slow, and is confounded by the need to have diverse, representative data across geographic regions and taxonomic groups.
What’s in this image? Objects in images from static cameras can be very challenging to detect and categorize. Here, a foggy morning has made it very difficult to see a herd of wildebeest walking along the crest of a hill. [Image from Snapshot Serengeti]
In Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection, we present a complementary approach that increases global scalability by improving generalization to novel camera deployments algorithmically. This new object detection architecture leverages contextual clues across time for each camera deployment in a network, improving recognition of objects in novel camera deployments without relying on additional training data from a large number of cameras. Echoing the approach a person might use when faced with challenging images, Context R-CNN leverages up to a month’s worth of images from the same camera for context to determine what objects might be present and identify them. Using this method, the model outperforms a single-frame Faster R-CNN baseline by significant margins across multiple domains, including wildlife camera traps. We have open sourced the code and models for this work as part of the TF Object Detection API to make it easy to train and test Context R-CNN models on new static camera datasets.
Here, we can see how additional examples from the same scene help experts determine that the object is an animal and not background. Context such as the shape & size of the object, its attachment to a herd, and habitual grazing at certain times of day help determine that the species is a wildebeest. Useful examples occur throughout the month.
The Context R-CNN Model
Context R-CNN is designed to take advantage of the high degree of correlation within images taken by a static camera to boost performance on challenging data and improve generalization to new camera deployments without additional human data labeling. It is an adaptation of Faster R-CNN, a popular two-stage object detection architecture. To extract context for a camera, we first use a frozen feature extractor to build up a contextual memory bank from images across a large time horizon (up to a month or more). Next, objects are detected in each image using Context R-CNN which aggregates relevant context from the memory bank to help detect objects under challenging conditions (such as the heavy fog obscuring the wildebeests in our previous example). This aggregation is performed using attention, which is robust to the sparse and irregular sampling rates often seen in static monitoring cameras.
High-level architecture diagram, showing how Context R-CNN incorporates long-term context within the Faster R-CNN model architecture.
The first stage of Faster R-CNN proposes potential objects, and the second stage categorizes each proposal as either background or one of the target classes. In Context R-CNN, we take the proposed objects from the first stage of Faster R-CNN, and for each one we use similarity-based attention to determine how relevant each of the features in our memory bank (M) is to the current object, and construct a per-object context feature by taking a relevance-weighted sum over M and adding it back to the original object features. Then each object, now with added contextual information, is finally categorized using the second stage of Faster R-CNN.
Context R-CNN is able to leverage context (spanning up to 1 month) to correctly categorize the challenging wildebeest example we saw above. The green values are the corresponding attention weights for each boxed object.
Compared to a Faster R-CNN baseline (left), Context R-CNN (right) is able to capture challenging objects such as an elephant occluded by a tree, two poorly-lit impala, and a vervet monkey leaving the frame. [Images from Snapshot Serengeti]
Results
We have tested Context R-CNN on Snapshot Serengeti (SS) and Caltech Camera Traps (CCT), both ecological datasets of animal species in camera traps but from highly different geographic regions (Tanzania vs. the Southwestern United States). Improvements over the Faster R-CNN baseline for each dataset can be seen in the table below. Notably, we see a 47.5% relative increase in mean average precision (mAP) on SS, and a 34.3% relative mAP increase on CCT. We also compare Context R-CNN to S3D (a 3D convolution based baseline) and see performance improve from 44.7% mAP to 55.9% mAP (a 25.1% relative increase). Finally, we find that the performance increases as the contextual time horizon increases, from a minute of context to a month.
Comparison to a single frame Faster R-CNN baseline, showing both mean average precision (mAP) and average recall (AR) detection metrics.
Ongoing and Future Work
We are working to implement Context R-CNN within the Wildlife Insights platform, to facilitate large-scale, global ecological monitoring via camera traps. We also host competitions such as the yearly iWildCam species identification competition at the CVPR Fine-Grained Visual Recognition Workshop to help bring these challenges to the attention of the computer vision community. The challenges seen in automatic species identification in static cameras are shared by numerous applications of static cameras outside of the ecological monitoring domain, as well as other static sensors used to monitor biodiversity, such as audio and sonar devices. Our method is general, and we anticipate the per-sensor context approach taken by Context R-CNN would be beneficial for any static sensor.

Acknowledgements
This post reflects the work of the authors as well as the following group of core contributors: Vivek Rathod, Guanhang Wu, Ronny Votel. We are also grateful to Zhichao Lu, David Ross, Tanya Birch and the Wildlife Insights AI team, and Pietro Perona and the Caltech Computational Vision Lab.

Source: Google AI Blog