Tag Archives: bigquery

Live from the Firebase Dev Summit in Berlin: Firebase, six months after I/O

Posted by Francis Ma, Firebase Product Manager

Originally posted to the Firebase blog

Our goal with Firebase is to help developers build better apps and grow them into successful businesses. Six months ago at Google I/O, we took our well-loved backend-as-a-service (BaaS) and expanded it to 15 features to make it Google’s unified app development platform, available across iOS, Android, and the web.

We launched many new features at Google I/O, but our work didn’t stop there. Since then, we’ve learned a lot from you (750,000+ projects created on Firebase to date!) about how you’re using our platform and how we can improve it. Thanks to your feedback, today we’re launching a number of enhancements to Crash Reporting, Analytics, support for game developers and more. For more information on our announcements, tune in to the livestream video from Firebase Dev Summit in Berlin. They’re also listed here:

Improve App Quality to Deliver Better User Experiences

Firebase Crash Reporting comes out of Beta and adds a new feature that helps you diagnose and reproduce app crashes.

Often the hardest part about fixing an issue is reproducing it, so we’ve added rich context to each crash to make the process simple. Firebase Crash Reporting now shows Firebase Analytics event data in the logs for each crash. This gives you clarity into the state of your app leading up to an error. Things like which screens of your app were visited are automatically logged with no instrumentation code required. Crash logs will also display any custom events and parameters you explicitly log using Firebase Analytics. Firebase Crash Reporting works for both iOS and Android apps.

Glide, a popular live video messaging app, relies on Firebase Crash Reporting to ensure user quality and release agility. “No matter how much effort you put into testing, it will never be as thorough as millions of active users in different locations, experiencing a variety of network conditions and real life situations. Firebase allows us to rapidly gain trust in our new version during phased release, as well as accelerate the process of identifying core issues and providing quick solutions.” - Roi Ginat, Founder, Glide.

Firebase Test Lab for Android supports more devices and introduces a free tier.

We want to help you deliver high-quality experiences, so testing your app before it goes into the wild is incredibly important. Firebase Test Lab allows you to easily test your app on many physical and virtual devices in the cloud, without writing a single line of test code. Beginning today, developers on the Spark service tier (which is free!) can run five tests per day on physical devices and ten tests per day on virtual devices—with no credit card setup required. We’ve also heard that you want more device options, so we’ve added 11 new popular Android device models to Test Lab, available today.

Illustration of Firebase Crash Reporting

Make Faster Data Driven Decisions with Firebase Analytics

Firebase Analytics now offers live reporting, a new integration with Google “Data Studio”, and real-time exporting to BigQuery.

We know that your data is most actionable when you can see and process it as quickly as possible. Therefore, we’re announcing a number of features to help you maximize the potential of your analytics events:

  1. Real-time uploading of conversion events
  2. Real-time exporting to BigQuery
  3. DebugView for validation of your analytics instrumentation
  4. StreamView, which will offer a live, dynamic view of your analytics data as we receive it

To further enhance your targeting options, we’ve improved the connection between Firebase Analytics and other Firebase features, such as Dynamic Links and Remote Config. For example, you can now use Dynamic Links on your Facebook business page, and we can identify Facebook as a source in Firebase Analytics reporting. As well, you can now target Remote Config changes by User Properties, in addition to Audiences.

Build Better Games using Firebase

Firebase now has a Unity plugin!

Game developers are building great apps, and we want Firebase to work for you, too. We’ve built an entirely new plugin for Unity that supports Analytics, the Realtime Database, Authentication, Dynamic Links, Remote Config, Notifications and more. We've also expanded our C++ SDK with Realtime Database support.

Integrate Firebase Even Easier with Open-Sourced UI Library

FirebaseUI is updated to v1.0.

FirebaseUI is a library that provides common UI elements when building apps, and it’s a quick way to integrate with Firebase. FirebaseUI 1.0 includes a drop-in UI flow for Firebase Authentication, with common identity providers such as Google, Facebook, and Twitter. FirebaseUI 1.0 also added features such as client-side joins and intersections for the Realtime Database, plus integrations with Glide and SDWebImage that make downloading and displaying images from Firebase Storage a cinch. Follow our progress or contribute to our Android, iOS, and Web components on Github.

Learn More via Udacity and Join the Firebase Community

We want to provide the best tool for developers, but it’s also important that we give resources and training to help you get more out of the platform. As such, we’ve created a new Udacity course: Firebase in a Weekend! It’s an instructor-led video course to help all developers get up and running with Firebase on iOS and Android, in two days.

Finally, to help wrap your head around all our announcements, we’ve created a new demo app. This is an easy way to see how Analytics, Crash Reporting, Test Lab, Notifications, and Remote Config work in a live environment, without having to write a line of code.

Helping developers build better apps and successful businesses is at the core of Firebase. We work hard on it every day. We love hearing your feedback and ideas for new features and improvements—and we hope you can see from the length of this post that we take them to heart! Follow us on Twitter, join our Slack channel, participate in our Google Group, and let us know what you think. We’re excited to see what you’ll build next!

Using BigQuery and Firebase Analytics to understand your mobile app

Originally posted on Google Cloud Platform Blog

At Google I/O this May, Firebase announced a new suite of products to help developers build mobile apps. Firebase Analytics, a part of the new Firebase platform, is a tool that automatically captures data on how people are using your iOS and Android app, and lets you define your own custom app events. When the data's captured, it’s available through a dashboard in the Firebase console. One of my favorite cloud integrations with the new Firebase platform is the ability to export raw data from Firebase Analytics to Google BigQuery for custom analysis. This custom analysis is particularly useful for aggregating data from the iOS and Android versions of your app, and accessing custom parameters passed in your Firebase Analytics events. Let’s take a look at what you can do with this powerful combination.

How does the BigQuery export work?


After linking your Firebase project to BigQuery, Firebase automatically exports a new table to an associated BigQuery dataset every day. If you have both iOS and Android versions of your app, Firebase exports the data for each platform into a separate dataset. Each table contains the user activity and demographic data automatically captured by Firebase Analytics, along with any custom events you’re capturing in your app. Thus, after exporting one week’s worth of data for a cross-platform app, your BigQuery project would contain two datasets, each with seven tables:


Diving into the data


The schema for every Firebase Analytics export table is the same, and we’ve created two datasets (one for iOS and one for Android) with sample user data for you to run the example queries below. The datasets are for a sample cross-platform iOS and Android gaming app. Each dataset contains seven tables  one week’s worth of analytics data.

The following query will return some basic user demographic and device data for one day of usage on the iOS version of our app:

SELECT
user_dim.app_info.app_instance_id,
user_dim.device_info.device_category,
user_dim.device_info.user_default_language,
user_dim.device_info.platform_version,
user_dim.device_info.device_model,
user_dim.geo_info.country,
user_dim.geo_info.city,
user_dim.app_info.app_version,
user_dim.app_info.app_store,
user_dim.app_info.app_platform
FROM
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]

Since the schema for every BigQuery table exported from Firebase Analytics is the same, you can run any of the queries in this post on your own Firebase Analytics data by replacing the dataset and table names with the ones for your project.

The schema has user data and event data. All user data is automatically captured by Firebase Analytics, and the event data is populated by any custom events you add to your app. Let’s take a look at the specific records for both user and event data.

User data


The user records contain a unique app instance ID for each user (user_dim.app_info.app_instance_id in the schema), along with data on their location, device and app version. In the Firebase console, there are separate dashboards for the app’s Android and iOS analytics. With BigQuery, we can run a query to find out where our users are accessing our app around the world across both platforms. The query below makes use of BigQuery’s union feature, which lets you use a comma as a UNION ALL operator. Since a row is created in our table for each bundle of events a user triggers, we use EXACT_COUNT_DISTINCT to make sure each user is only counted once:
SELECT
user_dim.geo_info.country as country,
EXACT_COUNT_DISTINCT( user_dim.app_info.app_instance_id ) as users
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601],
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]
GROUP BY
country
ORDER BY
users DESC

User data also includes a user_properties record, which includes attributes you define to describe different segments of your user base, like language preference or geographic location. Firebase Analytics captures some user properties by default, and you can create up to 25 of your own.

A user’s language preference is one of the default user properties. To see which languages our users speak across platforms, we can run the following query:

SELECT
user_dim.user_properties.value.value.string_value as language_code,
EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id) as users,
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601],
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]
WHERE
user_dim.user_properties.key = "language"
GROUP BY
language_code
ORDER BY
users DESC

Event data


Firebase Analytics makes it easy to log custom events such as tracking item purchases or button clicks in your app. When you log an event, you pass an event name and up to 25 parameters to Firebase Analytics and it automatically tracks the number of times the event has occurred. The following query shows the number of times each event in our app has occurred on Android for a particular day:

SELECT 
event_dim.name,
COUNT(event_dim.name) as event_count
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601]
GROUP BY
event_dim.name
ORDER BY
event_count DESC

If you have another type of value associated with an event (like item prices), you can pass it through as an optional value parameter and filter by this value in BigQuery. In our sample tables, there is a spend_virtual_currency event. We can write the following query to see how much virtual currency players spend at one time:

SELECT 
event_dim.params.value.int_value as virtual_currency_amt,
COUNT(*) as num_times_spent
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601]
WHERE
event_dim.name = "spend_virtual_currency"
AND
event_dim.params.key = "value"
GROUP BY
1
ORDER BY
num_times_spent DESC

Building complex queries


What if we want to run a query across both platforms of our app over a specific date range? Since Firebase Analytics data is split into tables for each day, we can do this using BigQuery’s TABLE_DATE_RANGE function. This query returns a count of the cities users are coming from over a one week period:

SELECT
user_dim.geo_info.city,
COUNT(user_dim.geo_info.city) as city_count
FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
GROUP BY
user_dim.geo_info.city
ORDER BY
city_count DESC

We can also write a query to compare mobile vs. tablet usage across platforms over a one week period:

SELECT
user_dim.app_info.app_platform as appPlatform,
user_dim.device_info.device_category as deviceType,
COUNT(user_dim.device_info.device_category) AS device_type_count FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
GROUP BY
1,2
ORDER BY
device_type_count DESC

Getting a bit more complex, we can write a query to generate a report of unique user events across platforms over the past two weeks. Here we use PARTITION BY and EXACT_COUNT_DISTINCT to de-dupe our event report by users, making use of user properties and the user_dim.user_id field:

SELECT 
STRFTIME_UTC_USEC(eventTime,"%Y%m%d") as date,
appPlatform,
eventName,
COUNT(*) totalEvents,
EXACT_COUNT_DISTINCT(IF(userId IS NOT NULL, userId, fullVisitorid)) as users
FROM (
SELECT
fullVisitorid,
openTimestamp,
FORMAT_UTC_USEC(openTimestamp) firstOpenedTime,
userIdSet,
MAX(userIdSet) OVER(PARTITION BY fullVisitorid) userId,
appPlatform,
eventTimestamp,
FORMAT_UTC_USEC(eventTimestamp) as eventTime,
eventName
FROM FLATTEN(
(
SELECT
user_dim.app_info.app_instance_id as fullVisitorid,
user_dim.first_open_timestamp_micros as openTimestamp,
user_dim.user_properties.value.value.string_value,
IF(user_dim.user_properties.key = 'user_id',user_dim.user_properties.value.value.string_value, null) as userIdSet,
user_dim.app_info.app_platform as appPlatform,
event_dim.timestamp_micros as eventTimestamp,
event_dim.name AS eventName,
event_dim.params.key,
event_dim.params.value.string_value
FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
), user_dim.user_properties)
)
GROUP BY
date, appPlatform, eventName

If you have data in Google Analytics for the same app, it’s also possible to export your Google Analytics data to BigQuery and do a JOIN with your Firebase Analytics BigQuery tables.


Visualizing analytics data


Now that we’ve gathered new insights from our mobile app data using the raw BigQuery export, let’s visualize it using Google Data Studio. Data Studio can read directly from BigQuery tables, and we can even pass it a custom query like the ones above. Data Studio can generate many different types of charts depending on the structure of your data, including time series, bar charts, pie charts and geo maps.

For our first visualization, let’s create a bar chart to compare the device types from which users are accessing our app on each platform. We can paste the mobile vs. tablet query above directly into Data Studio to generate the following chart:
From this chart, it’s easy to see that iOS users are much more likely to access our game from a tablet. Getting a bit more complex, we can use the above event report query to create a bar chart comparing the number of events across platforms:
Check out this post for detailed instructions on connecting your BigQuery project to Data Studio.

What’s next?

If you’re new to Firebase, get started here. If you’re already building a mobile app on Firebase, check out this detailed guide on linking your Firebase project to BigQuery. For questions, take a look at the BigQuery reference docs and use the firebase-analytics and google-bigquery tags on Stack Overflow. And let me know if there are any particular topics you’d like me to cover in an upcoming post.

GitHub on BigQuery: Analyze all the code



Google, in collaboration with GitHub, is releasing an incredible new open dataset on Google BigQuery. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks GitHub Archive project!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.

On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
To learn more, read GitHub's announcement and try some sample queries. Share your queries and findings in our reddit.com/r/bigquery and Hacker News posts. The ideas are endless, and I'll start collecting tips and links to other articles on this post on Medium.

Stay curious!

Build a mobile gaming analytics platform

Popular mobile games can attract millions of players and generate terabytes of game-related data in a short burst of time. This places extraordinary pressure on the infrastructure powering these games and requires scalable data analytics services to provide timely, actionable insights in a cost-effective way.

To address these needs, a growing number of successful gaming companies use Google’s web-scale analytics services to create personalized experiences for their players. They use telemetry and smart instrumentation to gain insight into how players engage with the game and to answer questions like: At what game level are players stuck? What virtual goods did they buy? And what's the best way to tailor the game to appeal to both casual and hardcore players?


A new reference architecture describes how you can collect, archive and analyze vast amounts of gaming telemetry data using Google Cloud Platform’s data analytics products. The architecture demonstrates two patterns for analyzing mobile game events:

  • Batch processing: This pattern helps you process game logs and other large files in a fast, parallelized manner. For example, leading mobile gaming company DeNA moved to BigQuery from Hadoop to get faster query responses for their log file analytics pipeline. In this GDC Lightning Talk video they explain the speed benefits of Google’s analytics tools and how the team was able to process large gaming datasets without the need to manage any infrastructure.
  • Real-time processing: Use this pattern when you want to understand what's happening in the game right now. Cloud Pub/Sub and Cloud Dataflow provide a fully managed way to perform a number of data-processing tasks like data cleansing and fraud detection in real-time. For example, you can highlight a player with maximum hit-points outside the valid range. Real-time processing is also a great way to continuously update dashboards of key game metrics, like how many active users are currently logged in or which in-game items are most popular.

Some Cloud Dataflow features are especially useful in a mobile context since messages may be delayed from the source due to mobile Internet connection issues or batteries running out. Cloud Dataflow's built-in session windowing functionality and triggers aggregate events based on the actual time they occurred (event time) as opposed to the time they're processed so that you can still group events together by user session even if there's a delay from the source.

But why choose between one or the other pattern? A key benefit of this architecture is that you can write your data pipeline processing once and execute it in either batch or streaming mode without modifying your codebase. So if you start processing your logs in batch mode, you can easily move to real-time processing in the future. This is an advantage of the high-level Cloud Dataflow model that was released as open source by Google.



Cloud Dataflow loads the processed data into one or more BigQuery tables. BigQuery is built for very large scale, and allows you to run aggregation queries against petabyte-scale datasets with fast response times. This is great for interactive analysis and data exploration, like the example screenshot above, where a simple BigQuery SQL query dynamically creates a Daily Active Users (DAU) graph using Google Cloud Datalab.


And what about player engagement and in-game dynamics? The BigQuery example above shows a bar chart of the ten toughest game bosses. It looks like boss10 killed players more than 75% of the time, much more than the next toughest. Perhaps it would make sense to lower the strength of this boss? Or maybe give the player some more powerful weapons? The choice is yours, but with this reference architecture you'll see the results of your changes straight away. Review the new reference architecture to jumpstart your data-driven quest to engage your players and make your games more successful, contact us, or sign up for a free trial of Google Cloud Platform to get started.

Further Reading and Additional Resources


- Posted by Oyvind Roti, Solutions Architect

Cloud Audit Logs to help you with audit and compliance needs

Not having a full view of administrative actions in your Google Cloud Platform projects can make it challenging and slow going to troubleshoot when an important application breaks or stops working. It can also make it difficult to monitor access to sensitive data and resources managed by your project. That’s why we created Google Cloud Audit Logs, and today they’re available in beta for App Engine and BigQuery. Cloud Audit Logs help you with your audit and compliance needs by enabling you to track the actions of administrators in your Google Cloud Platform projects. They consist of two log streams: Admin Activity and Data Access.

Admin Activity audit logs contain an entry for every administrative action or API call that modifies the configuration or metadata for the related application, service or resource, for example, adding a user to a project, deploying a new version in App Engine or creating a BigQuery dataset. You can inspect these actions across your projects on the Activity page in the Google Cloud Platform Console.

activity stream.png

Data Access audit logs contain an entry for every one of the following events:
  • API calls that read the configuration or metadata of an application, service or resource
  • API calls that create, modify or read user-provided data managed by a service (e.g. inserting data into a dataset or launching a query in BigQuery)

Currently, only BigQuery generates a Data Access log as it manages user-provided data, but ultimately all Cloud Platform services will provide a Data Access log.

There are many additional uses of Audit Logs beyond audit and compliance needs. In particular, the BigQuery team has put together a collection of examples that show how you can use Audit Logs to better understand your utilization and spending on BigQuery. We’ll be sharing more examples in future posts.


Accessing the Logs
Both of these logs are available in Google Cloud Logging, which means that you’ll be able to view the individual log entries in the Logs Viewer as well as take advantage of the many logs management capabilities available, including exporting the logs to Google Cloud Storage for long-term retention, streaming to BigQuery for real-time analysis and publishing to Google Cloud Pub/Sub to enable processing via Google Cloud Dataflow. The specific content and format of the logs can be found in the Cloud Logging documentation for Audit Logs.

Audit Logs are available to you at no additional charge. Applicable charges for using other Google Cloud Platform services (such as BigQuery and Cloud Storage) as well as streaming logs to BigQuery will still apply. As we find more ways to provide greater insight into administrative actions in GCP projects, we’d love to hear your feedback. Share it here: [email protected].

Posted by Joe Corkery, Product Manager, Google Cloud Platform


BigQuery cost controls now let you set a daily maximum for query costs

Today we’re giving you better cost controls in BigQuery to help you manage your spend, along with improvements to the streaming API, a performance diagnostic tool, and a new way to capture detailed usage logs.

BigQuery is a Google-powered supercomputer that lets you derive meaningful analytics in SQL, letting you only pay for what you use. This makes BigQuery an analytics data warehouse that’s both powerful and flexible. Those accustomed to a traditional fixed-size cluster – where cost is fixed, performance degrades with increased load, and scaling is complex – may find granular cost controls helpful in budgeting your BigQuery usage.

In addition, we’re announcing availability of BigQuery access logs in Audit Logs Beta, improvements to the Streaming API, and a number of UI enhancements. We’re also launching Query Explain to provide insight on how BigQuery executes your queries, how to optimize your queries and how to troubleshoot them.

Custom Quotas: No fear of surprise when the bill comes


Custom quotas allow you to set daily quotas that will help prevent runaway query costs. There are two ways you can set the quota:

  • Project wide: an entire BigQuery project cannot exceed the daily custom quota.
  • Per user: each individual user within a BigQuery project is subject to the daily custom quota.


Query Explain: understand and optimize your queries

Query Explain shows, stage by stage, how BigQuery executes your queries. You can now see if your queries are write, read or compute heavy, and where any performance bottlenecks might be. You can use BigQuery Explain to optimize queries, troubleshoot errors or understand if BigQuery Slots might benefit you.

In the BigQuery Web UI, use the “Explanation” button next to “Results” to see this information.

Improvements to the Streaming API

Data is most valuable when it’s fresh, but loading data into an analytics data warehouse usually takes time. BigQuery is unique among warehouses in that it can easily ingest a stream of up to 100,000 rows per second per table, available for immediate analysis. Some customers even stream 4.5 million rows per second by sharding ingest across tables. Today we’re bringing several improvements to BigQuery Streaming API.

  • Streaming API in EU locations. It’s not just for the US anymore: you may now use the Streaming API to load data into your BigQuery datasets residing in EU.
  • Template tables is a new way to manage related tables used for streaming. It allows an existing table to serve as a template for a streaming insert request. The generated table will have the same schema, and be created in the same dataset and project as the template table. Better yet, when the schema of the template table is updated, the schema of the tables generated from this template will also be updated.
  • No more “warm-up” delay. After streaming the first row into a table, we no longer require a warm-up period of a couple of minutes before the table becomes available for analysis. Your data is available immediately after the first insertion.

Create a paper trail of queries with Audit Logs Beta


BigQuery Audit Logs form an audit trail of every query, every job and every action taken in your project, helping you analyze BigQuery usage and access at the project level, or down to individual users or jobs. Please note that Audit Logs is currently in Beta.

Audit Logs can be filtered in Cloud Logging, or exported back to BigQuery with one click, allowing you to analyze your usage and spend in real-time in SQL.

With today’s announcements, BigQuery gives you more control and visibility. BigQuery is already very easy to use, and with recently launched products like Datalab (a data science notebook integrated with BigQuery), just about anyone in your organization can become a big data expert. If you’re new to BigQuery, take a look at the Quickstart Guide, and the first 1TB of data processed per month is on us. To fully understand the power of BigQuery, check out the documentation and feel free to ask your questions using the “google-bigquery” tag on Stack Overflow.

-Posted by Tino Tereshko, Technical Program Manager

Processing logs at scale using Cloud Dataflow

Logs generated by applications and services can provide an immense amount of information about how your deployment is running and the experiences your users are having as they interact with the products and services. But as deployments grow more complex, gleaning insights from this data becomes more challenging. Logs come from an increasing number of sources, so they can be hard to collate and query for useful information. And building, operating and maintaining your own infrastructure to analyze log data at scale requires extensive expertise in running distributed systems and storage. Today, we’re introducing a new solution paper and reference implementation that will show how you can process logs from multiple sources and extract meaningful information by using Google Cloud Platform and Google Cloud Dataflow.

Log processing typically involves some combination of the following activities:

  • Configuring applications and services
  • Collecting and capturing log files
  • Storing and managing log data
  • Processing and extracting data
  • Persisting insights

Each of those components has it’s own scaling and management challenges, often using different approaches at different times. These sorts of challenges can slow down the generation of meaningful, actionable information from your log data.

Cloud Platform provides a number of services that can help you to address these challenges. You can use Cloud Logging to collect logs from applications and services, and then store them in Google Cloud Storage buckets or stream them to Pub/Sub topics. Dataflow can read from Cloud Storage or Pub/Sub (and many more), process log data, extract and transform metadata and compute aggregations. You can persist the output from Dataflow in BigQuery, where it can be analyzed or reviewed anytime. These mechanisms are offered as managed services—meaning they can scale when needed. That also means that you don't need to worry about provisioning resources up front.

The solution paper and reference implementation describe how you can use Dataflow to process log data from multiple sources and persist findings directly in BigQuery. You’ll learn how to configure Cloud Logging to collect logs from applications running in Container Engine, how to export those logs to Cloud Storage, and how to execute the Dataflow processing job. In addition, the solution shows you how to reconfigure Cloud Logging to use Pub/Sub to stream data directly to Dataflow, so you can process logs in real-time.


Check out the Processing Logs at Scale using Cloud Dataflow solution to learn how to combine logging, storage, processing and persistence into a scalable log processing approach. Then take a look at the reference implementation tutorial on Github to deploy a complete end-to-end working example. Feedback is welcome and appreciated; comment here, submit a pull request, create an issue, or find me on Twitter @crcsmnky and let me know how I can help.

- Posted by Sandeep Parikh, Google Solutions Architect