Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Custom Machine Types goes GA, saving you up to 50% on compute costs

Today we’re announcing the general availability of Custom Machine Types for Google Compute Engine, which let you create virtual machines with vCPU and memory configurations that are perfect for your workloads.

Since our Beta launch, we've seen customers create virtual machines with novel vCPU and memory ratios that aren’t available from any of the major cloud providers. As a result, our customers have saved an average of 19% — and as much as 50% — on top of our already market-leading prices.

  • Wix has seen an 18% savings in compute to power their media platform that now serves over 75 million users.
  • Lytics is saving 20% to 50% by accurately matching resource need to each compute type they used to unlock behavior-rich insights with their Customer Data Platform.
  • iRewind is seeing up to 20% saved in processing cost to power their pipeline that produced more than 500,000 movies just last year.

Custom Machine Types extend Google Compute Engine’s tradition of making IaaS truly flexible and ensuring you only pay for the resources you use. Per-minute billing broke you free from the imposed hourly charges. Sustained Use Discounts made sure you get automatic discounts based on usage without upfront commitments or prepayments. Now, Custom Machine Types give you the option to configure your VMs to achieve the best price performance for your specific workload.

You can create virtual machines with as few as 1 vCPU and as many as 32 vCPUs with up to 6.5 GiB of memory per vCPU. You can use Custom Machine Types with CentOS, CoreOS, Debian, OpenSUSE, Ubuntu, and now with RedHat, and Windows operating systems. Or bring your own Linux variant to further customize your setup. Google Container Engine and Deployment Manager now also support Custom Machine Types.

Custom Machine Types have flat pricing based on per vCPUs and per GiB of memory usage. A 4 vCPU, 10 GiB of memory VM, for example, costs half as much as an 8 vCPU 20 GiB memory VM. You also get our standard customer-friendly pricing features like per-minute billing and sustained use discounts.

Give Custom Machine Types a try today and see how much you could save! Visit the Compute Engine section of Google Cloud Platform Console and click Create Instance. In the Create instance page, you'll notice Machine Type now has a Basic and Customize view. Click Customize and build a virtual machine to fit your needs.



Custom Machine Types are supported by the gcloud command-line tool and through our API. Creating a VM is as easy as:

$ gcloud components update
$ gcloud compute instances create my-custom-vm --custom-cpu 6
--custom-memory 12 --zone us-central1-f


For more info on Custom Machine Types, visit our website.

- Posted by Sami Iqram, Product Manager, Google Cloud Platform

What it looks like to process 3.5 million books in Google’s cloud

Today’s guest blog comes from Kalev Leetaru, founder of The GDELT Project, which monitors the
world’s news media in nearly every country in over 100 languages to identify the events and narratives driving our global society.

This past September I published into Google BigQuery a massive new public dataset of metadata from 3.5 million digitized English-language books dating back more than two centuries (1800-2015), along with the full text of 1 million of these books. The archive, which draws from the English-language public domain book collections of the Internet Archive and HathiTrust, includes full publication details for every book, along with a wide array of computed content-based data. The entire archive is available as two public BigQuery datasets, and there’s a growing collection of sample queries to help users get started with the collection. You can even map two centuries of books with a single line of SQL.

What did it look like to process 3.5 million books? Data-mining and creating a public archive of 3.5 million books is an example of an application perfectly suited to the cloud, in which a large amount of specialized processing power is needed for only a brief period of time. Here are the five main steps that I took to make the invaluable learnings of millions of books more easily and speedily accessible in the cloud:
  1. The project began with a single 8-core Google Compute Engine (GCE) instance with a 2TB SSD persistent disk that was used to download the 3.5 million books. I downloaded the books to the instance’s local disk, unzipped them, converted them into a standardized file format, and then uploaded them to Google Cloud Storage (GCS) in large batches, using the composite objects and parallel upload capability of GCS. Unlike traditional UNIX file systems, GCS performance does not degrade with large numbers of small files in a single directory, so I could upload all 3.5 million files into a common set of directories.
    Figure 1: Visualization of two centuries of books
  2. Once all books had been downloaded and stored into GCS, I launched ten 16-core High Mem (100GB RAM) GCE instances (160 cores total) to process the books, each with a 50GB persistent SSD root disk to achieve faster IO over traditional persistent disks. To launch all ten instances quickly, I launched the first instance and configured that with all of the necessary software libraries and tools, then created and used a disk snapshot to rapidly clone the other nine with just a few clicks. Each of the ten compute instances would download a batch of 100 books at a time to process from GCS.
  3. Once the books had been processed, I uploaded back into GCS all of the computed metadata. In this way, GCS served as a central storage fabric connecting the compute nodes. Remarkably, even in worst-case scenarios when all 160 processors were either downloading new batches of books from GCS or uploading output files back to GCS in parallel, there was no measurable performance degradation.
  4. With the books processed, I deleted the ten compute instances and launched a single 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks. I used this to reassemble the 3.5 million per-book output files into single output files, tab-delimited with data available for each year, merging in publication metadata and other information about each book. Disk IO of more than 750MB/s was observed on this machine.
  5. I then uploaded the final per-year output files to a public GCS directory with web downloading enabled, allowing the public to download the files.
Since very few researchers have the bandwidth, local storage or computing power to process even just the metadata of 3.5 million books, the entire collection was uploaded into Google BigQuery as a public dataset. Using standard SQL queries, you can explore the entire collection in tens of seconds at speeds of up to 45.5GB/s and perform complex analyses entirely in-database.

The entire project, from start to finish, took less than two weeks, a good portion of which consisted of human verification for issues with the publication metadata. This is significant because previous attempts to process even a subset of the collection on a modern HPC supercluster had taken over one month and completed only a fraction of the number of books examined here. The limiting factor was always the movement of data: transferring terabytes of books and their computed metadata across hundreds of processors.

This is where Google’s cloud offerings shine, seemingly purpose-built for data-first computing. In just two weeks, I was able to process 3.5 million books, spinning up a cluster of 160 cores and 1TB of RAM, followed by a single machine with 32 cores, 200GB of RAM, 10TB of SSD disk and 1TB of direct-attached scratch SSD disk. I was able to make the final results publicly accessible through BigQuery at query speeds of over 45.5GB/s.

You can access the entire collection today in BigQuery, explore sample queries, and read more technical detail about the processing pipeline on the GDELT Blog.

I’d like to thank Google, Clemson University, the Internet Archive, HathiTrust, and OCLC in making this project possible, along with all of the contributing libraries and digitization sponsors that have made these digitized books available.

- Posted by Kalev Leetaru, founder of The GDELT Project

How to build mobile apps on Google Cloud Platform

At some point in development, nearly every mobile app needs a backend service. With Google’s services you can rapidly build backend services that:

  • Scale automatically to meet demand
  • Automatically synchronize data across devices
  • Handle the offline case gracefully
  • Send notifications and messages

The following are design patterns you’ll find in Build mobile apps using Google Cloud Platform, which provides a side-by-side comparison of Google services, as well as links to tutorials and sample code. Click on a diagram for more information and links to sample code.

Real-time data synchronization with Firebase

Firebase is a fully managed platform for building iOS, Android and web apps that provides automatic data synchronization and authentication services.

To understand how using Firebase can simplify app development, consider a chat app. By storing the data in Firebase, you get the benefits of automatic synchronization of data across devices, minimal on-device storage, and an authentication service. All without having to write a backend service.

Add managed computation to Firebase apps with Google App Engine

If your app needs backend computation to process user data or orchestrate events, extending Firebase with App Engine gives you the benefit of automatic real-time data synchronization and an application platform that monitors, updates and scales the hosting environment.

An example of how you can use Firebase with App Engine is an app that implements a to-do list. Using Firebase to store the data ensures that the list is updated across devices. Connecting to your Firebase data from a backend service running on App Engine gives you the ability to process or act on that data. In the case of the to-do app, to send daily reminder emails.


Add flexible computation to Firebase with App Engine Managed VMs

If your mobile backend service needs to call native binaries, write to the file systems and make other system calls, extending Firebase with App Engine Managed VMs gives you the benefit of automatic real-time data synchronization and an application platform, with the flexibility to run code outside of the standard App Engine runtime.

Using Firebase and App Engine Managed VMs is similar to using Firebase with App Engine and adds additional options. For example, consider an app that converts chat messages into haikus using a pre-existing native binary. You can use Firebase to store and synchronize the data and connect to that data from a backend service running on App Engine Managed VMs. Your backend service can then detect new messages, call the native binaries to translate them into poetry, and push the new versions back to Firebase.


Automatically generate client libraries with App Engine and Google Cloud Endpoints

Using Cloud Endpoints means you don’t have to write wrappers to handle communication with App Engine. With the client libraries generated by Cloud Endpoints, you can simply make direct API calls from your mobile app.

If you're building an app that does not require real-time data synchronization, or if messaging and synchronization are already part of your backend service, using App Engine with Cloud Endpoints speeds development time by automatically generating client libraries. An example of an app where real-time synchronization is not needed is one that looks up information about retail products and finds nearby store locations.

Have full control with Compute Engine and REST or gRPC

With Google Compute Engine, you create and run virtual machines on Google infrastructure. You have administrator rights to the server and full control over its configuration.

If you have an existing backend service running on a physical or virtual machine, and that service requires a custom server configuration, moving your service to Compute Engine is the fastest way to get your code running on Cloud Platform. Keep in mind that you will be responsible for maintaining and updating your virtual machine.

An example of an app you might run on Compute Engine is a game with a backend service that uses third-party libraries and a custom server configuration to render in-game graphics.

For more information about these designs, as well as information about building your service, testing and monitoring your service and connecting to your service from your mobile app — including sending push notifications — see How to build backend services for mobile apps.

- Posted by Syne Mitchell, Technical Writer, Google Cloud Platform


JGroups-based clustering and node discovery with Google Cloud Storage

The JGroups messaging toolkit is a popular solution for clustering Java-based application servers in a reliable manner. This post describes how to store, host and manage your JGroups cluster member data using Google Cloud Storage. The configuration provided here is particularly well-suited for the discovery of Google Compute Engine nodes; however, for testing purposes, it can also be used with your current on-premises virtual machines.

Overview of JGroups clustering on Cloud Storage


JGroups versions 3.5 and later enable the discovery of clustered members, or nodes, on GCP via a JGroups protocol called GOOGLE_PING. GOOGLE_PING stores information about each member in flat files in a Cloud Storage bucket, and then uses these files to discover initial members in a cluster. When new members are added, they read the addresses of the other cluster members from the Cloud Storage bucket, and then ping each member to announce themselves.

By default, JGroups members use multicast communication over UDP to broadcast their presence to other instances on a network. Google Cloud Platform, like most cloud providers and enterprise networks, does not support multicast; however, both GCP and JGroups support unicast communication over TCP as a viable fallback. In the unicast-over-TCP model, a new instance instead announces its arrival by iterating over the list of nodes already joined to a cluster, individually notifying each node.


Configure Cloud Storage to store JGroups configuration files


To allow JGroups to use Cloud Storage for file storage, begin by creating a Cloud Storage bucket:
  1. In the Cloud Platform Console, go to the Cloud Storage browser.
  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following:
    • A bucket name, subject to the bucket name requirements
    • The Standard storage class
    • A location where bucket data will be stored
Next, set up interoperability and create a new Cloud Storage developer key. You'll need the developer key for authentication: GOOGLE_PING sends an authenticated request via the Cloud Storage XML API, which uses keyed-hash message authentication code (HMAC) authentication with Cloud Storage developer keys. To generate a developer key:
  1. Open the Storage settings page in the Google Cloud Platform Console.
  2. Select the Interoperability tab.
  3. If you have not set up interoperability before, click Enable interoperability access.           Note: Interoperability access allows Cloud Storage to interoperate with tools written for other cloud storage systems. Because GOOGLE_PING is based on the Amazon-oriented S3_PING class in JGroups, it requires interoperability access.
  4. Click Create a new key.
  5. Make note of the Access key and Secret values—you'll need them later.
Important: Keep your developer keys secret. Your developer keys are linked to your Google account, and you should treat them as you would treat any set of access credentials.

Configure your clustered application to use GOOGLE_PING

Now that you've created your Cloud Storage bucket and developer keys, configure your application's JGroups configuration to use the GOOGLE_PING class. For most applications that use JGroups, you can do so as follows:
  1. Edit your JGroups XML configuration file (jgroups.xml in most cases).
  2. Modify the file to use TCP instead of UDP:                                                                              (tcp bind_port="7800")
  3. Locate the PING section and replace it with GOOGLE_PING, as shown in the following example. Replace your-jgroups-bucket with the name of your Cloud Storage bucket, and replace your-access-key and your-secret with the values of your access key and secret:
(!-- PING timeout="2000" num_initial_members="3"/ --)

     (GOOGLE_PING
                location="your-jgroups-bucket"
                access_key="your-access-key"
                secret_access_key="your-secret"
                timeout="2000" num_initial_members="3"/)

Now GOOGLE_PING will use your Cloud Storage bucket and automatically create a folder that's named to match the cluster name.

Warning: By default, your virtual machines will communicate with your bucket insecurely through port 80. To set up an encrypted connection between the instances and the bucket, add the following attribute to the GOOGLE_PING element:

      (google_ping ...="" port="443")

If you use JBoss Wildfly application server, you can configure clustering by configuring the JGroup subsystem and adding the the GOOGLE_PING protocol.

Demonstration

This section walks you through a concrete demonstration of GOOGLE_PING in action. This example sets up a cluster of Compute Engine instances that reside within the same Cloud Platform project, using their internal IPs as ping targets.

First, I start a sender application (using Vert.x) on a Compute Engine instance, making it the first member of my cluster:

$ java -Djava.net.preferIPv4Stack=true
-Djgroups.bind_addr=10.240.0.2 -jar
my-sender-fatjar-3.1.0-fat.jar -cluster -cluster-host 10.240.0.2


Note: In general, you should bind to your Compute Engine instances' internal IP addresses. If you would prefer to cluster your instances by using their externally routable IP addresses, add the following parameter to your java command, replacing (external_ip) with the external IP of the instance:

-Djgroups.external_addr=(external_ip)

When the application begins running, it displays "No reply," as no receiver nodes have been set up yet:


This sender node creates a folder and a .list file in my Cloud Storage bucket. My JGroups cluster is configured with the name JGROUPS_CLUSTER, so my Cloud Storage folder is also automatically named JGROUPS_CLUSTER:


The .list file lists all of the members in the JGROUPS_CLUSTER cluster. In JGroups, the first node to start is designated as the cluster coordinator; as such, the single node I've started has been marked with a T, meaning that the node's cluster-coordinator status is true.


Next, I start a receiver application, also using Vert.x, on a second Compute Engine instance:

$ java -Djava.net.preferIPv4Stack=true
-Djgroups.bind_addr=10.240.0.2 -jar
my-receiver-fatjar-3.1.0-fat.jar -cluster -cluster-host
10.240.0.2


This action adds an entry to the .list file for the new member node:

Once the node has been added to the .list file, the node begins receiving "ping!" messages from the first member node:
The second node responds to each "ping!" message with a "pong!" message. When the first node receives a "pong!" message, it displays "Received reply pong!" in the sender application's standard output:




Get started

You can give GOOGLE_PING a try for free by signing up for a free trial.

- Posted by Grace Mollison, Solutions Architect

Another Big Data blog, in 2016? Really? Why?

In the time it took you to click on this post and start reading, Google Cloud Platform processed millions of big data analytics events and we’ll process billions more later today. We’re fans of distributed systems and large-scale data processing and we know many of you are too.

In almost every survey we’ve done you have told us you want to hear more about new features as well as what’s under the hood of our cloud services, in detail and in an ongoing way.

Today we’re taking a step in that direction with our first topic-focused blog. We’re starting with big data because we have a lot to share on this subject that we haven’t revealed yet and we know there’s tremendous interest in these technologies.

If debating the merits of the Spark and Dataflow programming models into the wee small hours of the morning is something you could easily find yourself doing; or you get excited at the prospect of processing terabytes in seconds with zero setup for a few bucks, or simply want to learn how to use the infrastructure that powers Google for your data processing work, this blog is for you.

The team contributing to it are engineers, developer advocates, product managers, technical writers, technical program managers and support engineers at Google, and they are eager to share their excitement for these technologies with you. They also want to hear what you’re up to and what you need from us  reach out on Twitter @GCPBigData.

Look forward to sharing stories!

Posted by Jo Maitland, Managing Editor, Google Cloud Platform

GCP NEXT 2016: A sneak peak behind the scenes

I recently joined the Google Cloud Platform team, but I’ve never really explained why I was attracted to Google in the first place. Before joining Google I’d been a strong advocate of two key technologies: the Go programing language and Kubernetes. Both just so happen to originate from Google and I’m sure my investment in both technologies helped me land a job here. Like many, I was attracted to Google because of all the inspiring innovations that have helped shape the last decade of computing and have influenced a countless number of open source projects.

I’ve spent several years pouring over Google white papers and stitching together information from across the web trying to stay up to speed, and I’ll tell you it’s pretty time consuming. This year I’ve got a better idea. I’ll be attending GCP NEXT 2016. Why? Because it’s the only conference where you can find complete coverage of Google Cloud Platform technologies and more importantly the people behind them.

Today we’re announcing the GCP NEXT conference program, featuring in-depth technical sessions led by Google and the Google Cloud Platform community— developers, customers and partners. Dive into compute with us for two-full days and come away with a practical expertise in Google Cloud Platform. Sample sessions include:

  • "From idea to market in less than 6 months: Creating a new product with GCP," presented by CI&T — App Developer Track
  • "Painless container management with Google Container Engine & Kubernetes," presented by Brendan Burns & Tim Hockin, Google — Infrastructure & Operations Track
  • "Cloud data warehousing with BigQuery featuring Dropbox Nighthawk," presented by Jordan Tigani, Google & Dropbox — Data & Analytics Track
  • "Security analytics for today's cloud-ready enterprise," presented by Matt O’Connor, Google & PwC — Solutions Showcase

Curated from our Call for Speakers, an internal and external search for the very best content, demos and presenters, NEXT technical tracks cover the most relevant topics in cloud, from machine learning to networking and IOT. They’ll also teach you best practices and how-tos directly from product leaders and developers who have implemented our platform, including speakers from Netflix, Atomic Fiction, FIS Global (Sungard), and many more to be announced.

If you want to know more about Google Cloud Platform, are thinking about moving to the cloud or want to sharpen your skills in compute, don’t miss GCP NEXT. Register today and get our early bird rate (available until February 5th).

To keep up to date on GCP NEXT 2016, follow us on Google+, Twitter, and LinkedIn.

- Posted by Kelsey Hightower, Developer Advocate, Google Cloud Platform

Compute Engine now with 3 TB of high-speed Local SSD and 64 TB of Persistent Disk per VM

To help your business grow, we are significantly increasing size limits of all Google Compute Engine block storage products, including Local SSD and both types of Persistent Disk.

Now up to 64TB of Persistent Disk may be attached per VM for most machine types, including both Standard and SSD-backed Persistent Disk. The volume size limit has increased to 64 TB also, eliminating the need to stripe disks for larger volumes.

Persistent Disk provides fantastic price-performance and offers excellent usability for workloads that rely on durable block storage. Persistent Disk SSD delivers 30 IOPS per 1 GB provisioned, up to 15,000 IOPS per instance. Persistent Disk Standard is great value at $0.04 per GB-mo and provides 0.75 read IOPS per GB and 1.5 write IOPS per GB. Performance limits are set at an instance level, and can be achieved with just a single Persistent Disk.

We have also increased the amount of Local SSD that can be attached to a single virtual machine to 3 TB. Available in Beta today, you can attach twice as many partitions of Local SSD to Google Compute Engine instances. Up to eight 375 GB partitions or 3 TB of high IOPS SSD can now be attached to any machine with at least one virtual CPU.

We talked with Aaron Raddon, Founder and CTO at Lytics who tested our larger Local SSDs. He found they improved Cassandra performance by 50% and provide provisioning flexibility that can lead to additional savings.
The new, larger SSD has the same incredible IOPS performance we announced in January, topping out at 680,000 random 4K read IOPS and 360,000 random 4K write IOPS. With Local SSD you can achieve multiple millions of operations per second for key-value stores and a million writes per second using as few as 50 servers on NoSQL databases.

Local SSD retains the competitive pricing of $0.218 per GB/month while continuing to support extraordinary IOPS performance. As always, data stored in Local SSD is encrypted and our live migration technology means no downtime during maintenance. Local SSD also retains the flexibility of attaching to any instance type.

Siddharth Choudhuri, Principal Engineer at Levyx stated that doubling capacity on local SSDs with the same high IOPS is a game changer for businesses seeking low-latency and high throughput on large datasets. It enables them to index billions of objects on a single, denser node in real-time on Google Cloud Platform when paired with Levyx’s Helium data store.

To get started, head over to the Compute Engine console or read about Persistent Disk and Local SSD in the product documentation.

- Posted by John Barrus, Senior Product Manager, Google Cloud Platform

Cloud9 IDE now supports Google Cloud Platform

If you’re not familiar with Cloud9, you should be! Cloud9 is a development environment in the cloud that offers both a rich code editor and Ubuntu command line with sudo rights. With Cloud9 your development environment is entirely online, allowing you to code from any machine and freeing you from the hassle of managing a local environment.

Now, you can easily create a new Cloud9 workspace connected with a Cloud Platform project. Your GCP-ready Cloud9 workspace comes preinstalled with the Cloud SDK and gcloud command line tool, and allows you to build and deploy your application to Google App Engine directly within the IDE. To learn how, view Cloud9’s documentation.

Getting started

Getting started is easy; first, authenticate with Google in Cloud9. Then, create a workspace for your Cloud Platform project (make sure you’ve created a project in Cloud Platform first). The workspace is configured to store and access your remote source code in Cloud Source Repositories.

Using gcloud and Google Cloud SDK

The Google Cloud SDK comes pre-installed on your workspace's virtual machine and is automatically configured to access your project on Cloud Platform.


Edit, build and deploy directly from Cloud9

With Cloud9, you can edit your project’s code and push changes back to your cloud source repository. When you’re ready, build and deploy to App Engine directly from the IDE.



What’s next for Cloud9 and Cloud Platform


While Cloud9 currently offers support for App Engine Java-based applications, over the next few weeks, they’ll be adding support for additional programming languages and features. If you have questions or comments, please visit Cloud9’s community site.

Want to see it in action? See how quickly you can set up a Cloud Platform project in Cloud9.


We’re very pleased to share Cloud9’s support for Cloud Platform and we’re excited for the languages and features to come. Stay tuned!

Posted by Chris Sells, Product Manager

Google and Red Hat integrate OpenShift Dedicated and Google Cloud Platform to make adopting containers easier

In the coming months, we will be working closely with Red Hat to integrate and deliver OpenShift Dedicated — Red Hat’s managed container application platform — to customers on Google Cloud Platform. This will make adopting containers easier for customers.

We are committed to helping you get the most out of cloud, whether it be purely public, or a hybrid of public and private. Being the Open Cloud and growing investments in open source tools like Kubernetes are two facets of this. Our collaboration with Red Hat is another.

Together, we’ve made Google Compute Engine (GCE) a certified environment for Red Hat offerings, and have worked closely to unlock the power of containers through the Kubernetes project and the creation of the Cloud Native Computing Foundation. We’re now deepening this relationship to integrate OpenShift Dedicated with Google Cloud Platform (GCP) services. In this initial phase, you’ll have access to improved support for containers using Kubernetes and OpenShift, as well as access to powerful GCP services designed to help you make better use of data.

Helping Customers Adopt and Operationalize Containers


Both Google and Red Hat have been hearing a consistent story from enterprise customers, who’ve told us that they plan to move containers from experimental projects to supporting production workloads. In doing so, they aim to add:

  • Improved security: Confidence that containerized applications are developed, deployed and maintained on validated platforms with appropriate provenance and governance.
  • Services and ecosystem: Delivering lifecycle services and open interfaces for partners to give developers and operators the ability to build and execute a broad array of microservice based applications.
  • Dynamic scheduling: Provide frictionless resources and management to enable flexible deployment of containers as workloads change.
  • Storage: Resilient access to application data regardless of container deployment locality.
  • Cross cloud portability and hybrid deployments: Consistent container deployment frameworks, resources and platforms wherever development and deployment occurs.

We’ve heard these requests and believe the combination of Google Cloud Platform and Kubernetes plus Red Hat OpenShift will help.

OpenShift inherits easy portability between and across environments using Kubernetes, enabling hybrid cloud deployments. Red Hat plans to offer OpenShift Dedicated (its managed OpenShift cloud service) on Google Cloud Platform. This service is underpinned by Red Hat Enterprise Linux, and marries Red Hat’s enterprise-grade container application platform with Google’s 10 years of operational expertise around containers. This allows you to accelerate development and enable your developers to focus on application creation rather than operational overhead.

Additionally, we’re pleased to announce that Google’s working on integrating Google Cloud Platform services (including big data, analytics and storage services) with OpenShift Dedicated, with the goal of enabling Red Hat customers to natively access these Google Cloud Platform offerings.

The Best of Open Source and Cloud


With the increasing breadth and maturity of Google Cloud Platform’s offerings, we’re well-suited to complement and integrate with on-premise enterprise infrastructure. Our expanding relationship with Red Hat makes this especially true for enterprise-focused developers in need of stable, more secure and open source solutions that include Google’s cloud services and global infrastructure footprint. If you’re interested in learning more about our plans for OpenShift Dedicated on Google Cloud Platform or becoming a beta tester, please let us know here.

- Posted by Martin Buhr, Product Manager, Google Cloud Platform 

Dataflow and open source – proposal to join the Apache Incubator

Imagine if every time you upgrade your servers you had to learn a new programming framework and rewrite all your applications. That might sound crazy, but it’s what happens with big data pipelines.

It wasn't long ago that Apache Hadoop MapReduce was the obvious engine for all things big data, then Apache Spark came along, and more recently Apache Flink, a streaming-native engine. Unlike upgrading hardware, adopting these more modern engines has generally required rewriting pipelines to adopt engine-specific APIs, often with different implementations for streaming and batch scenarios. This can mean throwing away user code that had just been weathered enough to be considered (mostly) bug-free, and replacing it with immature new code. All of this just because the data pipelines needed to scale better, or have lower latency, or run more cheaply, or complete faster.

Adjusting such aspects should not require throwing away well-tested business logic. You should be able to move your application or data pipeline to the appropriate engine, or to the appropriate environment (e.g., from on-prem to cloud) while keeping the business logic intact. But, to do this, two conditions need to be met. First, you need a portable SDK, which can produce programs that can execute on one of many pluggable execution environments. Second, that SDK has to expose a programming model whose semantics are focused on your workload and not on the capabilities of the underlying engine. For example, MapReduce as a programming model doesn’t meet the bill (even though MapReduce as an execution method might be appropriate in some cases) because it cannot productively express low-latency computations.

Google designed Dataflow specifically to address both of these issues. The Dataflow Java SDK has been architected to support pluggable “runners” to connect to execution engines, of which four currently exist: data Artisans created one for Apache Flink, Cloudera did it for Apache Spark, and Google implemented a single-node local execution runner as well as one for Google’s hosted Cloud Dataflow service.

That portability is possible because the Dataflow programming model is focused on real-life streaming semantics, like real event time (as opposed to the time at which the event arrives), and real sessions (as opposed to whatever arbitrary boundary the batch cycle imposes). This allows Dataflow programs to execute in either batch or stream mode as needed, and to switch from one pluggable execution engine to the other without needing to be rewritten.

Today we’re taking another step in this collaboration. Along with participants from Cloudera, data Artisans, Talend, Cask and PayPal, we sent a proposal for Dataflow to become an Apache Software Foundation (ASF) Incubator project. In this proposal the Dataflow model, Java SDK, and runners will be bundled into one incubating project with the Python SDK joining the project in the future. We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally. Google Cloud Dataflow will remain as a “no-ops” managed service to execute Dataflow pipelines quickly and cost-effectively in Google Cloud Platform.



With Dataflow, you can write one portable data pipeline, which can be used for either batch or stream, and executed in a number of runtimes including Flink, Spark, Google Cloud Dataflow or the local direct pipeline.

We're excited to propose Dataflow as an Apache Incubator project because we believe the Dataflow model, SDK and runners offer a number of unique features in the open-source data space.


  • Pipeline first, runtime second  With the Dataflow model and SDKs, you focus first on defining your data pipelines, not how they'll run or the characteristics of the particular runner executing them.
  • Portability  Data pipelines are portable across a number of runtime engines. You can choose a runtime based on any number of considerations, such as performance, cost or scalability.
  • Unified model  Batch and streaming are integrated into a unified model with powerful semantics, such as windowing, ordering and triggering.
  • Development tooling  The Dataflow SDK contains the tools you need to create portable data pipelines quickly and easily using open-source languages, libraries and tools.


To understand the power of the Dataflow model, we recommend this article on the O’Reilly Radar: The World Beyond Batch: Streaming 102. For more information about Dataflow, you can also:




We're grateful to the Apache Software Foundation and community for their consideration of the Dataflow proposal and look forward to actively participating in open development of Dataflow.

- Posted by Frances Perry (Software Engineer) and James Malone (Product Manager)