Processing logs at scale using Cloud Dataflow

Logs generated by applications and services can provide an immense amount of information about how your deployment is running and the experiences your users are having as they interact with the products and services. But as deployments grow more complex, gleaning insights from this data becomes more challenging. Logs come from an increasing number of sources, so they can be hard to collate and query for useful information. And building, operating and maintaining your own infrastructure to analyze log data at scale requires extensive expertise in running distributed systems and storage. Today, we’re introducing a new solution paper and reference implementation that will show how you can process logs from multiple sources and extract meaningful information by using Google Cloud Platform and Google Cloud Dataflow.

Log processing typically involves some combination of the following activities:

  • Configuring applications and services
  • Collecting and capturing log files
  • Storing and managing log data
  • Processing and extracting data
  • Persisting insights

Each of those components has it’s own scaling and management challenges, often using different approaches at different times. These sorts of challenges can slow down the generation of meaningful, actionable information from your log data.

Cloud Platform provides a number of services that can help you to address these challenges. You can use Cloud Logging to collect logs from applications and services, and then store them in Google Cloud Storage buckets or stream them to Pub/Sub topics. Dataflow can read from Cloud Storage or Pub/Sub (and many more), process log data, extract and transform metadata and compute aggregations. You can persist the output from Dataflow in BigQuery, where it can be analyzed or reviewed anytime. These mechanisms are offered as managed services—meaning they can scale when needed. That also means that you don't need to worry about provisioning resources up front.

The solution paper and reference implementation describe how you can use Dataflow to process log data from multiple sources and persist findings directly in BigQuery. You’ll learn how to configure Cloud Logging to collect logs from applications running in Container Engine, how to export those logs to Cloud Storage, and how to execute the Dataflow processing job. In addition, the solution shows you how to reconfigure Cloud Logging to use Pub/Sub to stream data directly to Dataflow, so you can process logs in real-time.


Check out the Processing Logs at Scale using Cloud Dataflow solution to learn how to combine logging, storage, processing and persistence into a scalable log processing approach. Then take a look at the reference implementation tutorial on Github to deploy a complete end-to-end working example. Feedback is welcome and appreciated; comment here, submit a pull request, create an issue, or find me on Twitter @crcsmnky and let me know how I can help.

- Posted by Sandeep Parikh, Google Solutions Architect