According to the National Center for Atmospheric Research (NCAR), routine weather events such as rain and unusually warm and cool days directly affect 3.4% of the US Gross Domestic Product, impacting everyone from ice-cream stores, clothing retailers, delivery services, farmers, resorts and business travelers. The NCAR estimate considers routine weather only — it doesn’t take into account, for example, how weather impacts people’s moods, nor the impact of destructive weather such as tornadoes and hurricanes. If you analyze data to make better business decisions (or if you build machine learning models to provide such guidance automatically), weather should be one of your inputs.
The GHCN data has long been freely available from the National Oceanic and Atmospheric Association (NOAA) website to download and analyze. However, because the dataset changes daily, anyone wishing to analyze that data over time would need to repeat the process the following day. Having the data already loaded and continually refreshed in BigQuery makes it easier for researchers and data scientists to incorporate weather information in analytics and machine learning projects. The fact that BigQuery analysis can be done using standard SQL makes it very convenient to start analyzing the data.
Let’s explore the GHCN dataset and how to interact with it using BigQuery.
Where are the GHCN weather stations?
The GHCN data is global. For example, let’s look at all the stations from which we have good minimum-temperature data on August 15, 2016:
SELECT name, value/10 AS min_temperature, latitude, longitude FROM [bigquery-public-data:ghcn_d.ghcnd_stations] AS stn JOIN [bigquery-public-data:ghcn_d.ghcnd_2016] AS wx ON wx.id = stn.id WHERE wx.element = 'TMIN' AND wx.qflag IS NULL AND STRING(wx.date) = '2016-08-15'
Google Cloud Datalab, we notice that the density of stations is very good in North America, Europe and Japan and quite reasonable in most of Asia. Most of the gaps correspond to sparsely populated areas such as the Australian outback, Siberia and North Africa. Brazil is the only gaping hole. (For the rest of this post, I’ll show only code snippets — for complete BigQuery queries and Python plotting commands, please see the full Datalab notebook on github.)
|Blue dots represent GHCN weather stations around the world.|
Using GHCN weather data in your applicationsHere’s a simple example of how to incorporate GHCN data into an application. Let’s say you're a pizza chain based in Chicago and want to explore some weather variables that might affect demand for pizza and pizza delivery times. The first thing to do is to find the GHCN station closest to you. You go to Google Maps and find that your latitude and longitude is 42 degrees latitude and -87.9 degrees longitude, and run a BigQuery query that computes the great-circle distance between a station and (42, -87.9) to get the distance from your pizza shop in kilometers (see the Datalab notebook for what this query looks like). The result looks like this:
SELECT wx.date, wx.value/10.0 AS prcp FROM [bigquery-public-data:ghcn_d.ghcnd_2015] AS wx WHERE id = 'USW00094846' AND qflag IS NULL AND element = 'PRCP' ORDER BY wx.date
Note that we divide
wx.valueby 10 because the GHCN reports rainfall in tenths of millimeters. We ensure that the quality-control flag (qflag) associated with the data is null, indicating that the observation passed spatio-temporal quality-control checks.
Typically, though, you’d want a few more weather variables. Here’s a more complete query that pulls rainfall amount, minimum temperature, maximum temperature and the presence of some weather phenomenon (fog, hail, rain, etc.) on each day:
SELECT wx.date, MAX(prcp) AS prcp, MAX(tmin) AS tmin, MAX(tmax) AS tmax, IF(MAX(haswx) = 'True', 'True', 'False') AS haswx FROM ( SELECT wx.date, IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp, IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin, IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax, IF (SUBSTR(wx.element, 0, 2) = 'WT', 'True', NULL) AS haswx FROM [bigquery-public-data:ghcn_d.ghcnd_2015] AS wx WHERE id = 'USW00094846' AND qflag IS NULL ) GROUP BY wx.date ORDER BY wx.date
The query returns rainfall amounts in millimeters, maximum and minimum temperatures in degrees Celsius and a column that indicates whether there was impactful weather on that day:
Pandas DataFrame and easily graph them in Datalab (see notebook in github for queries and plotting code):
BigQuery Views and Data Studio 360 dashboardsSince the previous query pivoted and transformed some fields, you can save the query as a View. Simply copy-paste this query into the BigQuery console and select “Save View”:
SELECT REPLACE(date,"-","") AS date, MAX(prcp) AS prcp, MAX(tmin) AS tmin, MAX(tmax) AS tmax FROM ( SELECT STRING(wx.date) AS date, IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp, IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin, IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax FROM [bigquery-public-data:ghcn_d.ghcnd_2016] AS wx WHERE id = 'USW00094846' AND qflag IS NULL AND value IS NOT NULL AND DATEDIFF(CURRENT_DATE(), date) < 15 ) GROUP BY date ORDER BY date ASC
Notice my use of DATEDIFF and CURRENT_DATE functions to get weather data from the past two weeks. Saving this query as a View allows me to query and visualize this View as if it were a BigQuery table.
Since visualization is on my mind, I can go over to Data Studio and easily create a dashboard from this View, for example:
Mashing datasets in BigQueryIt’s quite easy to execute a weather query from your analytics program and merge the result with other corporate data.
If that other data is on BigQuery, you can combine it all in a single query! For example, another BigQuery dataset that’s publicly available is airline on-time arrival data. Let’s mash the GHCN and on-time arrivals datasets together:
SELECT wx.date, wx.prcp, f.departure_delay, f.arrival_airport FROM ( SELECT STRING(date) AS date, value/10 AS prcp FROM [bigquery-public-data:ghcn_d.ghcnd_2005] WHERE id = 'USW00094846' AND qflag IS NULL AND element = 'PRCP') AS wx JOIN [bigquery-samples:airline_ontime_data.flights] AS f ON f.date = wx.date WHERE f.departure_airport = 'ORD' LIMIT 100
This yields a table with both flight delay and weather information:
GHCN data in BigQuery democratizes weather data and opens it up to all sorts of data analytics and machine learning applications. We can’t wait to see how you use this data to build what’s next.