Tag Archives: Big Data & Analytics

What it looks like to process 3.5 million books in Google’s cloud

Today’s guest blog comes from Kalev Leetaru, founder of The GDELT Project, which monitors the
world’s news media in nearly every country in over 100 languages to identify the events and narratives driving our global society.

This past September I published into Google BigQuery a massive new public dataset of metadata from 3.5 million digitized English-language books dating back more than two centuries (1800-2015), along with the full text of 1 million of these books. The archive, which draws from the English-language public domain book collections of the Internet Archive and HathiTrust, includes full publication details for every book, along with a wide array of computed content-based data. The entire archive is available as two public BigQuery datasets, and there’s a growing collection of sample queries to help users get started with the collection. You can even map two centuries of books with a single line of SQL.

What did it look like to process 3.5 million books? Data-mining and creating a public archive of 3.5 million books is an example of an application perfectly suited to the cloud, in which a large amount of specialized processing power is needed for only a brief period of time. Here are the five main steps that I took to make the invaluable learnings of millions of books more easily and speedily accessible in the cloud:

The project began with a single 8-core Google Compute Engine (GCE) instance with a 2TB SSD persistent disk that was used to download the 3.5 million books. I downloaded the books to the instance’s local disk, unzipped them, converted them into a standardized file format, and then uploaded them to Google Cloud Storage (GCS) in large batches, using the composite objects and parallel upload capability of GCS. Unlike traditional UNIX file systems, GCS performance does not degrade with large numbers of small files in a single directory, so I could upload all 3.5 million files into a common set of directories.

Figure 1: Visualization of two centuries of books

Once all books had been downloaded and stored into GCS, I launched ten 16-core High Mem (100GB RAM) GCE instances (160 cores total) to process the books, each with a 50GB persistent SSD root disk to achieve faster IO over traditional persistent disks. To launch all ten instances quickly, I launched the first instance and configured that with all of the necessary software libraries and tools, then created and used a disk snapshot to rapidly clone the other nine with just a few clicks. Each of the ten compute instances would download a batch of 100 books at a time to process from GCS.
Once the books had been processed, I uploaded back into GCS all of the computed metadata. In this way, GCS served as a central storage fabric connecting the compute nodes. Remarkably, even in worst-case scenarios when all 160 processors were either downloading new batches of books from GCS or uploading output files back to GCS in parallel, there was no measurable performance degradation.
With the books processed, I deleted the ten compute instances and launched a single 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks. I used this to reassemble the 3.5 million per-book output files into single output files, tab-delimited with data available for each year, merging in publication metadata and other information about each book. Disk IO of more than 750MB/s was observed on this machine.
I then uploaded the final per-year output files to a public GCS directory with web downloading enabled, allowing the public to download the files.

Since very few researchers have the bandwidth, local storage or computing power to process even just the metadata of 3.5 million books, the entire collection was uploaded into Google BigQuery as a public dataset. Using standard SQL queries, you can explore the entire collection in tens of seconds at speeds of up to 45.5GB/s and perform complex analyses entirely in-database.

The entire project, from start to finish, took less than two weeks, a good portion of which consisted of human verification for issues with the publication metadata. This is significant because previous attempts to process even a subset of the collection on a modern HPC supercluster had taken over one month and completed only a fraction of the number of books examined here. The limiting factor was always the movement of data: transferring terabytes of books and their computed metadata across hundreds of processors.

This is where Google’s cloud offerings shine, seemingly purpose-built for data-first computing. In just two weeks, I was able to process 3.5 million books, spinning up a cluster of 160 cores and 1TB of RAM, followed by a single machine with 32 cores, 200GB of RAM, 10TB of SSD disk and 1TB of direct-attached scratch SSD disk. I was able to make the final results publicly accessible through BigQuery at query speeds of over 45.5GB/s.

You can access the entire collection today in BigQuery, explore sample queries, and read more technical detail about the processing pipeline on the GDELT Blog.

I’d like to thank Google, Clemson University, the Internet Archive, HathiTrust, and OCLC in making this project possible, along with all of the contributing libraries and digitization sponsors that have made these digitized books available.

- Posted by Kalev Leetaru, founder of The GDELT Project

Source: Google Cloud Platform Blog

Dataflow and open source – proposal to join the Apache Incubator

Imagine if every time you upgrade your servers you had to learn a new programming framework and rewrite all your applications. That might sound crazy, but it’s what happens with big data pipelines.

It wasn't long ago that Apache Hadoop MapReduce was the obvious engine for all things big data, then Apache Spark came along, and more recently Apache Flink, a streaming-native engine. Unlike upgrading hardware, adopting these more modern engines has generally required rewriting pipelines to adopt engine-specific APIs, often with different implementations for streaming and batch scenarios. This can mean throwing away user code that had just been weathered enough to be considered (mostly) bug-free, and replacing it with immature new code. All of this just because the data pipelines needed to scale better, or have lower latency, or run more cheaply, or complete faster.

Adjusting such aspects should not require throwing away well-tested business logic. You should be able to move your application or data pipeline to the appropriate engine, or to the appropriate environment (e.g., from on-prem to cloud) while keeping the business logic intact. But, to do this, two conditions need to be met. First, you need a portable SDK, which can produce programs that can execute on one of many pluggable execution environments. Second, that SDK has to expose a programming model whose semantics are focused on your workload and not on the capabilities of the underlying engine. For example, MapReduce as a programming model doesn’t meet the bill (even though MapReduce as an execution method might be appropriate in some cases) because it cannot productively express low-latency computations.

Google designed Dataflow specifically to address both of these issues. The Dataflow Java SDK has been architected to support pluggable “runners” to connect to execution engines, of which four currently exist: data Artisans created one for Apache Flink, Cloudera did it for Apache Spark, and Google implemented a single-node local execution runner as well as one for Google’s hosted Cloud Dataflow service.

That portability is possible because the Dataflow programming model is focused on real-life streaming semantics, like real event time (as opposed to the time at which the event arrives), and real sessions (as opposed to whatever arbitrary boundary the batch cycle imposes). This allows Dataflow programs to execute in either batch or stream mode as needed, and to switch from one pluggable execution engine to the other without needing to be rewritten.

Today we’re taking another step in this collaboration. Along with participants from Cloudera, data Artisans, Talend, Cask and PayPal, we sent a proposal for Dataflow to become an Apache Software Foundation (ASF) Incubator project. In this proposal the Dataflow model, Java SDK, and runners will be bundled into one incubating project with the Python SDK joining the project in the future. We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally. Google Cloud Dataflow will remain as a “no-ops” managed service to execute Dataflow pipelines quickly and cost-effectively in Google Cloud Platform.

With Dataflow, you can write one portable data pipeline, which can be used for either batch or stream, and executed in a number of runtimes including Flink, Spark, Google Cloud Dataflow or the local direct pipeline.

We're excited to propose Dataflow as an Apache Incubator project because we believe the Dataflow model, SDK and runners offer a number of unique features in the open-source data space.

Pipeline first, runtime second – With the Dataflow model and SDKs, you focus first on defining your data pipelines, not how they'll run or the characteristics of the particular runner executing them.
Portability – Data pipelines are portable across a number of runtime engines. You can choose a runtime based on any number of considerations, such as performance, cost or scalability.
Unified model – Batch and streaming are integrated into a unified model with powerful semantics, such as windowing, ordering and triggering.
Development tooling – The Dataflow SDK contains the tools you need to create portable data pipelines quickly and easily using open-source languages, libraries and tools.

To understand the power of the Dataflow model, we recommend this article on the O’Reilly Radar: The World Beyond Batch: Streaming 102. For more information about Dataflow, you can also:

Watch the Dataflow overview presentation from the 2015 @Scale Conference
Take a look at the Dataflow Java SDK GitHub repository, which would be moved to the Apache Software Foundation as a part of our proposal
Read the Dataflow model VLDB paper, which provides a detailed overview of the Dataflow model

We're grateful to the Apache Software Foundation and community for their consideration of the Dataflow proposal and look forward to actively participating in open development of Dataflow.

- Posted by Frances Perry (Software Engineer) and James Malone (Product Manager)

Source: Google Cloud Platform Blog

Build a mobile gaming analytics platform

Popular mobile games can attract millions of players and generate terabytes of game-related data in a short burst of time. This places extraordinary pressure on the infrastructure powering these games and requires scalable data analytics services to provide timely, actionable insights in a cost-effective way.

To address these needs, a growing number of successful gaming companies use Google’s web-scale analytics services to create personalized experiences for their players. They use telemetry and smart instrumentation to gain insight into how players engage with the game and to answer questions like: At what game level are players stuck? What virtual goods did they buy? And what's the best way to tailor the game to appeal to both casual and hardcore players?

A new reference architecture describes how you can collect, archive and analyze vast amounts of gaming telemetry data using Google Cloud Platform’s data analytics products. The architecture demonstrates two patterns for analyzing mobile game events:

Batch processing: This pattern helps you process game logs and other large files in a fast, parallelized manner. For example, leading mobile gaming company DeNA moved to BigQuery from Hadoop to get faster query responses for their log file analytics pipeline. In this GDC Lightning Talk video they explain the speed benefits of Google’s analytics tools and how the team was able to process large gaming datasets without the need to manage any infrastructure.

Real-time processing: Use this pattern when you want to understand what's happening in the game right now. Cloud Pub/Sub and Cloud Dataflow provide a fully managed way to perform a number of data-processing tasks like data cleansing and fraud detection in real-time. For example, you can highlight a player with maximum hit-points outside the valid range. Real-time processing is also a great way to continuously update dashboards of key game metrics, like how many active users are currently logged in or which in-game items are most popular.

Some Cloud Dataflow features are especially useful in a mobile context since messages may be delayed from the source due to mobile Internet connection issues or batteries running out. Cloud Dataflow's built-in session windowing functionality and triggers aggregate events based on the actual time they occurred (event time) as opposed to the time they're processed so that you can still group events together by user session even if there's a delay from the source.

But why choose between one or the other pattern? A key benefit of this architecture is that you can write your data pipeline processing once and execute it in either batch or streaming mode without modifying your codebase. So if you start processing your logs in batch mode, you can easily move to real-time processing in the future. This is an advantage of the high-level Cloud Dataflow model that was released as open source by Google.

Cloud Dataflow loads the processed data into one or more BigQuery tables. BigQuery is built for very large scale, and allows you to run aggregation queries against petabyte-scale datasets with fast response times. This is great for interactive analysis and data exploration, like the example screenshot above, where a simple BigQuery SQL query dynamically creates a Daily Active Users (DAU) graph using Google Cloud Datalab.

And what about player engagement and in-game dynamics? The BigQuery example above shows a bar chart of the ten toughest game bosses. It looks like boss10 killed players more than 75% of the time, much more than the next toughest. Perhaps it would make sense to lower the strength of this boss? Or maybe give the player some more powerful weapons? The choice is yours, but with this reference architecture you'll see the results of your changes straight away. Review the new reference architecture to jumpstart your data-driven quest to engage your players and make your games more successful, contact us, or sign up for a free trial of Google Cloud Platform to get started.

Further Reading and Additional Resources

- Posted by Oyvind Roti, Solutions Architect

Source: Google Cloud Platform Blog

Meeting the challenge of financial data transformation

Today’s guest post comes from Salvatore Sferrazza and Sebastian Just from FIS Global, an international provider of financial services and technology solutions. Salvatore and Sebastian tell us how Google Cloud Dataflow transforms fluctuating, large-scale financial services data so that it can be accurately captured and moved across systems.

Much software development in the capital markets (and enterprise systems in general) revolves around the transformation, enrichment and movement of data from one system to another. The unpredictable nature of financial market data volumes, often driven by volatility, exacerbates the pain of scaling and posting data when and where it’s needed for daily trade reconciliation, settlement and regulatory reporting. The implications of technology missteps within such crucial business processes range from missed business opportunities to undesired risk exposure to regulatory non-compliance. These activities must be relentlessly predictable, repeatable and measurable to yield maximum value to stakeholders.

While developers rely on the Extract, Transform and Load (ETL) activities that are so crucial to processing data, they now face limits in terms of the speed and efficiency of ETL as the amount of transactions grows faster than they can process it. As shortened settlement durations and the Consolidated Audit Trail (CAT) loom on the horizon, financial services institutions need simple, fast and powerful approaches to quickly scale and ultimately mitigate time-sensitive risks and operational costs.

Traditionally, developers have considered the activities around ETL data an unglamorous yet necessary dimension of building software products for encapsulating functions that are core to every tier of computing. So when data-driven enterprises are tasked with harvesting insights from massive data sets, it’s quite likely that ETL, in one form or another, is lurking nearby. But in today’s world, data can come from anywhere and in any format, creating a series of labor, time and intellectual challenges. While there may be hundreds of ways to solve the problem, few provide the efficiency and effectiveness so needed in our “big data” world — until recently.

The Google Cloud Dataflow service and its associated software development kit (SDK) provides a series of powerful tools for a myriad of data transformation duties. Designed to perform data processing tasks of any size in a managed services environment, Google Cloud Dataflow simplifies the mechanics of large-scale transformation and supports both batch and stream processing using the same programming model. In our latest white paper, we introduce some of the main concepts behind building and running applications that use Dataflow, then get “hands on” with a job to transform and ingest options market symbol data before storing the transformations within a Google BigQuery data set.

In short, Google Cloud Dataflow allows you to focus on data processing tasks and not cluster management. Rather than asking you to guess the right cluster size, Dataflow automatically scales up or down horizontally as much as needed for your exact processing requirements. This includes scaling all the way down to zero when there is no work, so you’re never paying for an idle cluster. Dataflow also alleviates the pain of writing ETL jobs by standardizing the process of implementing application requirements. As a result, you’ll be able to focus on the data transformations you need to make rather than on the processing mechanics themselves. This not only provides greater flexibility, lower latency and enhanced control of ETL jobs; it offers built-in cost management and ties together other useful Google Cloud services. Beyond common ETL, Dataflow pipelines may also include inline computation ranging from simple counting to highly complex, multi-step analysis. In our experience with the service so far, it can potentially remove much of the work from engineers within financial institutions and regulatory organizations, while providing elasticity to the entire process and ensuring accuracy, scale, performance and cost efficiency.

As market volatility and reporting requirements drive the need for accuracy, low latency and risk reduction, transforming and interpreting market data in a big data world is imperative to trading efficiency and accessibility. Every second counts. With a more cost-effective, real-time and scalable method of processing an ever-increasing volume of data, financial institutions will be able to address specific requirements and volumes at hand while keeping up with the demands of a rapidly evolving global financial system. We hope our experience, as captured in the technical white paper, will prove useful to others in their quest for the more effective way to process data.

Please see this paper’s GitHub page for the complete and buildable project source code.

- Posted by Salvatore Sferrazza, Principal at FIS and Sebastian Just, Manager at FIS

Source: Google Cloud Platform Blog

BigQuery cost controls now let you set a daily maximum for query costs

Today we’re giving you better cost controls in BigQuery to help you manage your spend, along with improvements to the streaming API, a performance diagnostic tool, and a new way to capture detailed usage logs.

BigQuery is a Google-powered supercomputer that lets you derive meaningful analytics in SQL, letting you only pay for what you use. This makes BigQuery an analytics data warehouse that’s both powerful and flexible. Those accustomed to a traditional fixed-size cluster – where cost is fixed, performance degrades with increased load, and scaling is complex – may find granular cost controls helpful in budgeting your BigQuery usage.

In addition, we’re announcing availability of BigQuery access logs in Audit Logs Beta, improvements to the Streaming API, and a number of UI enhancements. We’re also launching Query Explain to provide insight on how BigQuery executes your queries, how to optimize your queries and how to troubleshoot them.

Custom Quotas: No fear of surprise when the bill comes

Custom quotas allow you to set daily quotas that will help prevent runaway query costs. There are two ways you can set the quota:

Project wide: an entire BigQuery project cannot exceed the daily custom quota.
Per user: each individual user within a BigQuery project is subject to the daily custom quota.

Query Explain: understand and optimize your queries

Query Explain shows, stage by stage, how BigQuery executes your queries. You can now see if your queries are write, read or compute heavy, and where any performance bottlenecks might be. You can use BigQuery Explain to optimize queries, troubleshoot errors or understand if BigQuery Slots might benefit you.

In the BigQuery Web UI, use the “Explanation” button next to “Results” to see this information.

Improvements to the Streaming API

Data is most valuable when it’s fresh, but loading data into an analytics data warehouse usually takes time. BigQuery is unique among warehouses in that it can easily ingest a stream of up to 100,000 rows per second per table, available for immediate analysis. Some customers even stream 4.5 million rows per second by sharding ingest across tables. Today we’re bringing several improvements to BigQuery Streaming API.

Streaming API in EU locations. It’s not just for the US anymore: you may now use the Streaming API to load data into your BigQuery datasets residing in EU.
Template tables is a new way to manage related tables used for streaming. It allows an existing table to serve as a template for a streaming insert request. The generated table will have the same schema, and be created in the same dataset and project as the template table. Better yet, when the schema of the template table is updated, the schema of the tables generated from this template will also be updated.
No more “warm-up” delay. After streaming the first row into a table, we no longer require a warm-up period of a couple of minutes before the table becomes available for analysis. Your data is available immediately after the first insertion.

Create a paper trail of queries with Audit Logs Beta

BigQuery Audit Logs form an audit trail of every query, every job and every action taken in your project, helping you analyze BigQuery usage and access at the project level, or down to individual users or jobs. Please note that Audit Logs is currently in Beta.

Audit Logs can be filtered in Cloud Logging, or exported back to BigQuery with one click, allowing you to analyze your usage and spend in real-time in SQL.

With today’s announcements, BigQuery gives you more control and visibility. BigQuery is already very easy to use, and with recently launched products like Datalab (a data science notebook integrated with BigQuery), just about anyone in your organization can become a big data expert. If you’re new to BigQuery, take a look at the Quickstart Guide, and the first 1TB of data processed per month is on us. To fully understand the power of BigQuery, check out the documentation and feel free to ask your questions using the “google-bigquery” tag on Stack Overflow.

-Posted by Tino Tereshko, Technical Program Manager

Source: Google Cloud Platform Blog

The next generation of managed MySQL offerings on Cloud SQL

Google Cloud SQL is an easy-to-use service that delivers fully managed MySQL databases. It lets you hand off to Google the mundane, but necessary and often time consuming tasks — like applying patches and updates, managing backups and configuring replications — so you can put your focus on building great applications. And because we use vanilla MySQL, it’s easy to connect from just about any application, anywhere.

The first generation of Cloud SQL was launched in October 2011 and has helped thousands of developers and companies build applications. As Compute Engine and Persistent Disk have made great advancements since their launch, the second generation of Cloud SQL builds on their innovation to deliver an even better, more performant MySQL solution at a better price/performance ratio. We’re excited to announce the beta availability of the second generation of Cloud SQL — a new and improved Cloud SQL for Google Cloud Platform.

Speed, more speed and scalability

The two principal goals of the second generation of Cloud SQL are: better performance and scalability per dollar. The performance graph below speaks for itself. Second generation Cloud SQL is more than seven times faster than the first generation of Cloud SQL. And it scales to 10TB of data, 15,000 IOPS and 104GB of RAM per instance — well beyond the first generation.

Source: Google internal testing

Yoga for your database (Cloud SQL is flexible)

Cloud users appreciate flexibility. And while flexibility is not a word frequently associated with relational databases, with Cloud SQL we’ve changed that. Flexibility means easily scaling a database up and down. For example, a database that’s growing in size and number of queries per day might require more CPU cores and RAM. A Cloud SQL instance can be changed to allocate additional resources to the database with minimal downtime. Scaling down is just as easy.

Flexibility means easily connecting to your database from any client with Internet access, including Compute Engine, Managed VMs, Container Engine and your workstation. Connectivity from App Engine is only offered for Cloud SQL First Generation right now, but that will change soon. Because we embrace open standards by supporting MySQL Wire Protocol, the standard connection protocol for MySQL databases, you can access your managed Cloud SQL database from just about any application, running anywhere. For example:

Use all your favorite tools, such as MySQL Workbench, Toad and the MySQL command-line tool to manage your Cloud SQL instances
Get low latency connections from applications running on Compute Engine and Managed VMs
Use standard drivers, such as Connector/J, Connector/ODBC, and Connector/NET, making it exceptionally easy to access Cloud SQL from most applications

Flexibility also means easily starting and stopping databases. Many databases must run 24x7, but some are used only occasionally for brief or infrequent tasks. Cloud SQL can be managed using the Cloud Console (our browser-based administration console), command line (part of our gCloud SDK) or a RESTful API. The command line interface (CLI) and API make Cloud SQL administration scriptable and help users maximize their budgets by running their databases only when they’re needed.

The graph below shows the number of active Cloud SQL database instances running over time. Notice the clusters of five sawtooth-like ridges and then a drop for two additional ridges. These clusters show an increased number of databases running during business hours on Monday through Friday each week. Database activity, measured by the number of active databases, falls outside of business hours, especially on the weekends. This repeated rise and fall of database instances is a great example of flexibility. Its magnitude is helped significantly by first generation Cloud SQL’s ability to automatically sleep when it is not being accessed. While this is not a design goal of the second generation of Cloud SQL, users can quickly create and delete, or start and stop databases that only need to run on occasion. Cloud SQL users get the most from their budget because of the service’s flexibility.

What is a "managed" MySQL database?

Cloud SQL delivers fully managed MySQL databases, but what does that really mean? It means Google will apply patches and updates to MySQL, manage your backups, configure replication and provide automatic failover for High Availability (HA) in the event of a zone outage. It also means that you get Google’s operational expertise for your MySQL database. Google’s team of MySQL experts make configuring replication and automatic failover a breeze, so your data is protected and available. They also patch your database when important security updates are delivered. You choose when (day and time of week) the updates should be applied, and Google’s team takes care of the rest. This combined with Cloud SQL’s automatic encryption on database tables, temporary files and backups ensures your data is secure.

High Availability, replication and backups are configurable, so you can choose what's appropriate for each of your database instances. For development instances, you can choose to opt out of replication and automatic failover, while your production instances are fully protected. Even though we manage the database, you’re still in control.

Pricing: commitment issues

Getting the best Cloud SQL price doesn’t require you to commit to a one- or three-year contract. To get the best Cloud SQL price, just run your database 24x7 for the month. That’s it. If you use a database infrequently, you’ll be charged by the minute at the standard price. But there’s no need to decide upfront and Google helps find savings for you. No commitment, no strings attached. As a bonus, everyone gets the 100% sustained use discount during Beta, regardless of usage.

Ready to get started?

If you haven’t signed up for Google Cloud Platform, do so now and get a $300 credit to test drive Cloud SQL. The second generation Cloud SQL has inexpensive micro instances for small applications, and easily scales up and out to serve performance-intensive applications.

You can also take advantage of our growing partner ecosystem and tools to make working in Cloud SQL even easier. We’ve partnered with Talend, Attunity, Dbvisit and xPlenty to help you streamline the process of loading your data into Cloud SQL and with analytics products Tableau, Looker, YellowFin and Bime so you can easily create rich visualizations for meaningful insights. We’ve also integrated with ScaleArc and WebYog to help you monitor and manage your database and have partnered with service providers like Pythian, so you can have expert support during your Cloud SQL implementations. Reach out to any of our partners if you need help getting up and running.

Bottom Line

Cloud SQL Second Generation makes what customers love about Cloud SQL First Generation faster and more scalable, at a better price per performance.

- Posted by Brett Hesterberg, Product Manager, Google Cloud Platform

Source: Google Cloud Platform Blog

Processing logs at scale using Cloud Dataflow

Logs generated by applications and services can provide an immense amount of information about how your deployment is running and the experiences your users are having as they interact with the products and services. But as deployments grow more complex, gleaning insights from this data becomes more challenging. Logs come from an increasing number of sources, so they can be hard to collate and query for useful information. And building, operating and maintaining your own infrastructure to analyze log data at scale requires extensive expertise in running distributed systems and storage. Today, we’re introducing a new solution paper and reference implementation that will show how you can process logs from multiple sources and extract meaningful information by using Google Cloud Platform and Google Cloud Dataflow.

Log processing typically involves some combination of the following activities:

Configuring applications and services
Collecting and capturing log files
Storing and managing log data
Processing and extracting data
Persisting insights

Each of those components has it’s own scaling and management challenges, often using different approaches at different times. These sorts of challenges can slow down the generation of meaningful, actionable information from your log data.

Cloud Platform provides a number of services that can help you to address these challenges. You can use Cloud Logging to collect logs from applications and services, and then store them in Google Cloud Storage buckets or stream them to Pub/Sub topics. Dataflow can read from Cloud Storage or Pub/Sub (and many more), process log data, extract and transform metadata and compute aggregations. You can persist the output from Dataflow in BigQuery, where it can be analyzed or reviewed anytime. These mechanisms are offered as managed services—meaning they can scale when needed. That also means that you don't need to worry about provisioning resources up front.

The solution paper and reference implementation describe how you can use Dataflow to process log data from multiple sources and persist findings directly in BigQuery. You’ll learn how to configure Cloud Logging to collect logs from applications running in Container Engine, how to export those logs to Cloud Storage, and how to execute the Dataflow processing job. In addition, the solution shows you how to reconfigure Cloud Logging to use Pub/Sub to stream data directly to Dataflow, so you can process logs in real-time.

Check out the Processing Logs at Scale using Cloud Dataflow solution to learn how to combine logging, storage, processing and persistence into a scalable log processing approach. Then take a look at the reference implementation tutorial on Github to deploy a complete end-to-end working example. Feedback is welcome and appreciated; comment here, submit a pull request, create an issue, or find me on Twitter @crcsmnky and let me know how I can help.

- Posted by Sandeep Parikh, Google Solutions Architect

googblogs.com

All Google blogs and Press in one site

Tag Archives: Big Data & Analytics

What it looks like to process 3.5 million books in Google’s cloud

Source: Google Cloud Platform Blog

Dataflow and open source – proposal to join the Apache Incubator

Source: Google Cloud Platform Blog

Meeting the challenge of financial data transformation

Source: Google Cloud Platform Blog

BigQuery cost controls now let you set a daily maximum for query costs

Custom Quotas: No fear of surprise when the bill comes

Query Explain: understand and optimize your queries

Improvements to the Streaming API

Create a paper trail of queries with Audit Logs Beta

Source: Google Cloud Platform Blog