Sometimes, when doing a roundup of the week’s news, no clear theme emerges, and you’re left with a disjointed list of unrelated tidbits. That wasn’t a problem this week; both on this blog and in the Google Cloud Platform world at large, people had big data and analytics on the brain.
The week started out with a bang, with big data consultancy Mammoth Data releasing the results of a benchmark test comparing Google Cloud Dataflow with Apache Spark. Google’s data processing service did really well, outperforming Spark by two to five times, depending on the number of cores in the test.
Cloud Dataflow is a paid service, of course, but the platform’s API was recently accepted as an incubator project with the Apache Software Foundation, under Apache Beam. The rationale, according to Tyler Akidau, Google staff software engineer for Apache Beam, is to “provide the world with an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms.” You can read Tyler’s full post here. Data Artisan’s Kostas Tzoumas also provides his organization’s take, and the relationship of Apache Beam to Apache Flink.
We were also treated with the next installment of big data guru Mark Litwintschik’s "A billion taxi rides" series, in which he analyzes data about 1.1 billion taxi and Uber rides in NYC against different data analytics tools. Up this week: Mark schooled us on how he got 33x Faster Queries on Google Cloud Dataproc; the Performance Impact of File Sizes on Presto Query Times; and how to build a 50-node Presto Cluster on Google Cloud's Dataproc.
If that’s not enough for you, be sure to register for a joint webinar with Bitnami, "Visualizing Big Data with Big Money" that uses election data from the Center for Responsive Politics. Using Google BigQuery and the open-source Re:Dash data visualization tool, citizens will be able to grok the enormity of this country’s campaign finance problems depressingly fast.