Tag Archives: Technical

IAM best practice guides available now

Google Cloud Identity & Access Management (IAM) service gives you additional capabilities to secure access to your Google Cloud Platform resources. To assist you when designing your IAM strategy, we've created a set of best practice guides.

The best practices guides include:

The “Using IAM Securely” guide will help you to implement IAM controls securely by providing a checklist of best practices for the most common areas of concern when using IAM. It categorizes best practices into four sections:

  • Least privilege - A set of checks that assist you in restricting your users or applications to not do more than they're supposed to.
  • Managing Service Accounts and Service Account keys - Provides pointers to help you manage both securely.
  • Auditing - This covers practices that include reminding you to use Audit logs and cloud logging roles
  • Policy Management - Some checks to ensure that you're implementing and managing your policies appropriately.

Cloud Platform resources are organized hierarchically and IAM policies can propagate down the structure. You're able to set IAM policies at the following levels of the resource hierarchy:

  • Organization level. The Organization resource represents your company. IAM roles granted at this level are inherited by all resources under the organization.
  • Project level. Projects represent a trust boundary within your company. Services within the same project have a default level of trust. For example, App Engine instances can access Cloud storage buckets within the same project. IAM roles granted at the project level are inherited by resources within that project.
  • Resource level. In addition to the existing Google Cloud Storage and Google BigQuery ACL systems, additional resources such as Google Genomics Datasets and Google Cloud Pub/Sub topics support resource-level roles so that you can grant certain users permission to a single resource. 
The diagram below illustrates an example of a Cloud Platform resource hierarchy:

The “Designing Resource Hierarchies” guide provides examples of what this means in practice and has a handy checklist to double-check that you're following best practice.

A Service Account is a special type of Google account that belongs to your application or a virtual machine (VM), instead of to an individual end user. The “Understanding Service Accounts” guide provides answers to the most common questions, like:

  • What resources can the service account access?
  • What permissions does it need?
  • Where will the code assuming the identity of the service account be running: on Google Cloud Platform or on-premises?

This guide discusses what the implications are of making certain decisions so that you have enough information to use Service Accounts safely and efficiently.

We’ll be producing more IAM best practice guides and are keen to hear from customers using IAM or wanting to use IAM on what additional content would be helpful. We’re also keen to hear if there are curated roles we haven’t thought of. We want Cloud Platform to be the most secure and the easiest cloud to use so your feedback is important to us and helps us shape our approach. Please share your feedback with us at:

[email protected]

- Posted by Grace Mollison, Solutions Architect

Getting started with Red Hat OpenShift on Google Cloud Platform

We recently announced that Red Hat’s container platform OpenShift Dedicated will run on Google Cloud Platform, letting you hook up your OpenShift clusters to the full portfolio of Google Cloud services. So what’s the best way to get started?

We recommend deploying a Kubernetes-based solution. In the example below, we'll analyze incoming tweets using Google Cloud Pub/Sub (Google’s fully-managed real-time messaging service that allows you to send and receive messages between independent applications) and Google BigQuery (Google's fully managed, no-ops, low cost analytics database). This can be the starting point for incorporating social insights into your own services.

Step 0: If you don’t have a GCP account already, please sign-up for Cloud Platform, setup billing and activate APIs.

Step 1: Next you'll setup a service account. A service account is a way to interact with your GCP resources by using a different identity than your primary login and is generally intended for server-to-server interaction. From the GCP Navigation Menu, click on "Permissions."

Once there, click on "Service accounts."
Click on "Create service account," which will prompt you to enter a service account name. Provide a name relevant to your project and click on "Furnish a new private key." The default "JSON" Key type should be left selected.

Step 2: Once you click "Create," a service account “.json” will be downloaded to your browser’s downloads location.

Important: Like any credential, this represents an access mechanism to authenticate and use resources in your GCP account  KEEP IT SAFE! Never place this file in a publicly accessible source repo (e.g., public GitHub).

Step 3: We’ll be using the JSON credential via a Kubernetes secret deployed to your OpenShift cluster. To do so, first perform a base64 encoding of your JSON credential file:

$ base64 -i ~/path/to/downloads/credentials.json

Keep the output (a very long string) ready for use in the next step, where you’ll replace ‘BASE64_CREDENTIAL_STRING’ in the pod example (below) with the output just captured from base64 encoding.

Important: Note that base64 is encoded (not encrypted) and can be readily reversed, so this file (with the base64 string) is just as confidential as the credential file above.

Step 4: Next you’ll create the Kubernetes secret inside your OpenShift Cluster. A secret is the proper place to make sensitive information available to pods running in your cluster (like passwords or the credentials downloaded in the previous step). This is what your pod definition will look like (e.g., google-secret.yaml):

apiVersion: v1
kind: Secret
  name: google-services-secret
type: Opaque
  google-services.json: BASE64_CREDENTIAL_STRING

You’ll want to add this file to your source-control system (minus the credentials).

Replace ‘BASE64_CREDENTIAL_STRING’ with the base64 output from the prior step.

Step 5: Deploy the secret to the cluster:

$ oc create -f google-secret.yaml

Step 6: Now you’re in a position to use Google APIs from your OpenShift cluster. To take your GCP-enabled cluster for a spin, try going through the steps detailed in the write-up: https://cloud.google.com/solutions/real-time/kubernetes-pubsub-bigquery

You’ll need to make two minor tweaks for the solution to work on your OpenShift cluster:

  1. For any pods that need to access Google APIs, modify the pod to create a reference to the secret. The environment variable “GOOGLE_APPLICATION_CREDENTIALS” needs to be exported to the pod  more info on how they work:

    In the PubSub-BiqQuery solution, that means you’ll modify two pod definitions:
    • pubsub/bigquery-controller.yaml
    • pubsub/twitter-stream.yaml

    For example:
    apiVersion: v1
    kind: ReplicationController
      name: bigquery-controller
        name: bigquery-controller
              value: /etc/secretspath/google-services.json
            - name: secrets
              mountPath: /etc/secretspath
              readOnly: true
          - name: secrets
              secretName: google-services-secret

  2. Finally, anywhere the solution instructs you to use "kubectl," replace that with the equivalent OpenShift command "oc." 

That’s it! If you follow along with the rest of the steps in the solution, you’ll soon be able to query (and see) tweets showing up in your BigQuery table  arriving via Cloud Pub/Sub. Going forward with your own deployments, all you need do is follow the above steps of attaching the credential secret to any pod where you use Google Cloud SDKs and/or access Google APIs.

Join us at GCP Next!

If you’re attending GCP Next and want to experience a live ‘hands-on’ walk-thru of this and other solutions, please join us at the Red Hat OpenShift Workshop. Hope to see you there! If not, don’t miss all the Next sessions online.

- Posted by Sami Zuhuruddin, Solutions Architect, Google Cloud Platform

Google shares software network load balancer design powering GCP networking

At NSDI ‘16, we're revealing the details of Maglev1, our software network load balancer that enables Google Compute Engine load balancing to serve a million requests per second with no pre-warming.

Google has a long history of building our own networking gear, and perhaps unsurprisingly, we build our own network load balancers as well, which have been handling most of the traffic to Google services since 2008. Unlike the custom Jupiter fabrics that carry traffic around Google’s data centers, Maglev load balancers run on ordinary servers  the same hardware that the services themselves use.

Hardware load balancers are often deployed in an active-passive configuration to provide failover, wasting at least half of the load balancing capacity. Maglev load balancers don't run in active-passive configuration. Instead, they use Equal-Cost Multi-Path routing (ECMP) to spread incoming packets across all Maglevs, which then use consistent hashing techniques to forward packets to the correct service backend servers, no matter which Maglev receives a particular packet. All Maglevs in a cluster are active, performing useful work. Should one Maglev become unavailable, the other Maglevs can carry the extra traffic. This N+1 redundancy is more cost effective than the active-passive configuration of traditional hardware load balancers, because fewer resources are intentionally sitting idle at all times.

Google’s highly flexible cluster management technology, called Borg, makes it possible for Google engineers to move service workloads between clusters as needed to take advantage of unused capacity, or other operational considerations. On Google Cloud Platform, our customers have similar flexibility to move their workloads between zones and regions. This means that the mix of services running in any particular cluster changes over time, which can also lead to changing demand for load balancing capacity.

With Maglev, it's easy to add or remove load balancing capacity, since Maglev is simply another way to use the same servers that are already in the cluster. Recently, the industry has been moving toward Network Function Virtualization (NFV), providing network functionality using ordinary servers. Google has invested a significant amount of effort over a number of years to make NFV work well in our infrastructure. As Maglev shows, NFV makes it easier to add and remove networking capacity, but having the ability to deploy NFV technology also makes it possible to add new networking services without adding new, custom hardware.

How does this benefit you, as a user of GCP? You may recall we were able to scale from zero to one million requests per second with no pre-warming or other provisioning steps. This is possible because Google clusters, via Maglev, are already handling traffic at Google scale. There's enough headroom available to add another million requests per second without bringing up new Maglevs. It just increases the utilization of the existing Maglevs.

Of course, when utilization of the Maglevs exceeds a threshold, more Maglevs are needed. Since the Maglevs are deployed on the same server hardware that's already present in the cluster, it's easy for us to add that capacity. As a developer on Cloud Platform, you don’t need to worry about load balancing capacity. Google’s Maglevs, and our team of Site Reliability Engineers who manage them, have that covered for you. You can focus on building an awesome experience for your users, knowing that when your traffic ramps up, we’ve got your back.

- Posted by Daniel E. Eisenbud, Technical Lead, Maglev and Paul Newson, Developer Advocate (Maglev fan)

1 D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein. Maglev: A Fast and Reliable Software Network Load Balancer, 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016

Three ways to build Slack integrations on Google Cloud Platform

Slack has become the hub of communications for many teams. It integrates with many services, like Google Drive, and there are ways to integrate with your own infrastructure. This blog post describes three ways to build Slack integrations on Google Cloud Platform using samples from our Slack samples repository on GitHub. Clone it with:

git clone https://github.com/GoogleCloudPlatform/slack-samples.git

Or make your own fork to use as the base for your own integrations. Since all the platforms we use in this tutorial support incoming HTTPS connections, all these samples could be extended into Slack Apps and distributed to other teams.

Using Slack for notifications from a Compute Engine instance

If you're using Google Compute Engine as a virtual private server, it can be useful to get an alert to know who's using a machine. This could be an audit log, but it's also useful to know when someone is using a shared machine so you don't step on changes each other are making.

To get started, we assume you have a Linux Compute Engine instance. You can follow this guide to create one and follow along.

Create a Slack incoming webhook and save the webhook URL. It will look something like https://hooks.slack.com/services/YOUR/SLACK/INCOMING-WEBHOOK. Give the hook a nice name, like "SSH Bot" and a recognizable icon, like a lock emoji.

Next, SSH into the machine and clone the repository. We'll be using the notify sample for this integration.

git clone https://github.com/GoogleCloudPlatform/slack-samples.git
cd slack-samples/notify

Create a file slack-hook with the webhook URL and test your webhook out.

nano slack-hook
# paste in URL, write out, and exit
PAM_USER=$USER PAM_RHOST=testhost ./login-notify.sh

The script sends a POST request to your Slack webhook. You should receive a Slack message notifying you of this.

We'll be adding a PAM hook to run whenever someone SSHes into the machine. Verify that SSH is using PAM by making sure there's a line "UsePAM yes" in the /etc/ssh/sshd_config file.

sudo nano /etc/ssh/sshd_config

We can now set up the PAM hook. The install.sh script creates a /etc/slack directory and copies the login-notify.sh script and slack-hook configuration there.

It configures /etc/pam.d/sshd to run the script whenever someone SSHes into the machine by adding the line "session optional pam_exec.so seteuid /etc/slack/login-notify.sh".

Keep this SSH window open in case something went wrong and verify that you can login from another SSH terminal. You should receive another notification on Slack, this time with the real remote host IP address.

Building a bot and running it in Google Container Engine

If you want to run a Slack bot, one of the easiest ways to do it is to use Beep Boop, which will take care of running your bot on Cloud Platform for you, so you can focus on making the bot the best you can.

A Slack bot connects to the Slack Real Time Messaging API using Websockets; it runs as a long-running process, listening to and sending messages. Google Container Engine provides a nice balance of control for running a bot. It uses Kubernetes to keep your bot running and manage your secret tokens. It's also one of the easiest ways to run a server that uses Websockets on Cloud Platform. We'll walk you through running a Node.js Botkit Slack bot on Container Engine, using Google Container Registry to store our Docker image.

First, set up your development environment for Google Container Engine. Clone the repository and change to the bot sample directory.

git clone https://github.com/GoogleCloudPlatform/slack-samples.git
cd slack-samples/bot

Next, create a cluster, if you don't already have one:

gcloud container clusters create my-cluster

Create a Slack bot user and get an authentication token. We'll be loading this token in our bot using the Kubernetes Secrets API. Replace MY-SLACK-TOKEN with the one for your bot user. The generate-secret.sh script creates the secret configuration for you by doing a simple text substitution in a template.

./generate-secret.sh MY-SLACK-TOKEN
kubectl create -f slack-token-secret.yaml

First, build the Docker container. Replace my-cloud-project-id below with your Cloud Platform project ID. This tags the container so that the gcloud command line tool can upload it to your private Container Registry.

export PROJECT_ID=my-cloud-project-id
docker build -t gcr.io/${PROJECT_ID}/slack-bot .

Once the build completes, upload it.

gcloud docker push gcr.io/${PROJECT_ID}/slack-bot

First, create a replication controller configuration, populated with your project ID, so that Kubernetes knows where to load the Docker image from. Like generate-secret.sh, the generate-rc.sh script creates the replication controller configuration for you by doing a simple text substitution in a template.

./generate-rc.sh $PROJECT_ID

Now, tell Kubernetes to create the replication controller to start running the bot.

kubectl create -f slack-bot-rc.yaml

You can check the status of your bot with:

kubectl get pods

Now your bot should be online and respond to "Hello."

Shut down and clean up

To shutdown your bot, we tell Kubernetes to delete the replication controller.

kubectl delete -f slack-bot-rc.yaml

If you've created a container cluster, you may still get charged for the Compute Engine resources it's using, even if they're idle. To delete the cluster, run:

gcloud container clusters delete my-cluster

This deletes the Compute Engine instances that are running the cluster.

Building a Slash command on Google App Engine

App Engine is a great platform for building Slack slash commands. Slash commands require that the server support SSL with a valid certificate. App Engine supports HTTPS without any configuration for apps using the provided *.appspot.com domain, and it supports SSL for custom domains. App Engine also provides great auto-scaling. You automatically get more instances with more usage and fewer (as few as zero or a configurable minimum) when demand goes down, and a free tier to make it easy to get started.

We'll be using Go on App Engine, but you can use any language supported by the runtime, including Python, Java1, and PHP.

Clone the repository. We'll be using the slash command sample for this integration.

git clone https://github.com/GoogleCloudPlatform/slack-samples.git
cd slack-samples/command/1-custom-integration

If you can reach your development machine from the internet, you should be able to test locally. Create a Slash Command and point it at http://your-machine:8080/quotes/random and run:

goapp serve --host=

Now that we see it's working, we can deploy it. Replace with your Cloud Platform project ID in the following command and run:

goapp deploy -application  ./

Update your Slash Command configuration and try it out!
If you want to publish your command to be used by more than one team, you'll need to create a Slack App. This will give you an OAuth Client ID and Client secret. Plug these values into the config.go file of the App sample and deploy in the same way to get an "Add to Slack" button.

- Posted by Tim Swast, Developer Programs Engineer

1 Java is registered trademark of Oracle and/or its affiliates.

TensorFlow machine learning with financial data on Google Cloud Platform

If you knew what happened in the London markets, how accurately could you predict what will happen in New York? It turns out, this is a great scenario to be tackled by machine learning!

The premise for this problem is that by following the sun and using data from markets that close earlier, such as London that closes 4.5 hours ahead of New York, you could more accurately predict market behaviors 7 out of 10 times.

We’ve published a new solution, TensorFlow Machine Learning with Financial Data on Google Cloud Platform, that looks at this problem. We hope you’ll enjoy exploring it with us interactively in the Google Cloud Datalab notebook we provide.

As you go through the solution, you’ll query six years of time series data for eight different markets using Google BigQuery, explore that data using Cloud Datalab, then produce two powerful TensorFlow models on Cloud Platform.

TensorFlow is Google’s next generation machine learning library, allowing you to build high performance, state-of-the-art, scalable deep learning models. Cloud Platform provides the compute and storage on demand required to build, train and test those models. The two together are a marriage made in heaven and can provide a tremendous force multiplier for your business.

This solution is intended to illustrate the capabilities of Cloud Platform and TensorFlow for fast, interactive, iterative data analysis and machine learning. It does not offer any advice on financial markets or trading strategies. The scenario presented in the tutorial is an example. Don't use this code to make investment decisions.

- Posted by Corrie Elston, Solutions Architect, Google Cloud Platform

How to build your own recommendation engine using machine learning on Google Compute Engine

You might like this blog post . . . if you like recommendation engines. If that sentence has a familiar ring, you've probably browsed many websites that use a recommendation engine.

Recommendation engines are the technology behind content discovery networks and the suggestion features of most ecommerce websites. They improve a visitor's experience by offering relevant items at the right time and on the right page. Adding that intelligence makes your application more attractive, enhances the customer experience and increases their satisfaction. Digital Trends studies show that 73% of customers prefer to get a personalized experience during their shopping experience.

There are various components to a recommendation engine, ranging from data ingestion and analytics to machine learning algorithms. In order to provide relevant recommendations, the system must be scalable and able to handle the demands that come with processing Big Data and must provide an easy way to improve the algorithms.

Recommendation engines, particularly the scalable ones that produce great suggestions, are highly compute-intensive workloads. The following features of Google Cloud Platform are well-suited to support this kind of workload:

Customers building recommendation engines are jumping on board. Antvoice uses Google Cloud Platform to deploy their self-learning, multi-channel, predictive recommendation platform.

This new solution article provides an introduction to implementing product recommendations. It shows you how you can use open source technologies to setup a basic recommendation engine on Cloud Platform. It uses the example of a house-renting website that suggests houses that the user might be interested in based on their previous behavior through a technique known as collaborative filtering.

To provide recommendations, whether in real time while customers browse your site, or through email later on, several things need to happen. At first, while you know little about your users' tastes and preferences, you might base recommendations on item attributes alone. But your system needs to be able to learn from your users, collecting data about their tastes and preferences.

Over time and with enough data, you can use machine learning algorithms to perform useful analysis and deliver meaningful recommendations. Other users’ inputs can also improve the results, effectively periodically retraining the system. This solution deals with a recommendations system that already has enough data to benefit from machine learning algorithms.

A recommendation engine typically processes data through the following four phases:

The following diagram represents the architecture of such a system:

Each component of this architecture can be deployed using various easy-to-implement technologies to get you started:

  • Front-End: By deploying a simple application on Google App Engine, a user can see a page with top recommendations. You can take it from there, easily building a strong and scalable web platform that can manage one to several millions of users, with minimum operations.
  • Storage: The solution uses Google Cloud SQL, our managed MySQL option. A commonly used database in the ecommerce domain, this database integrates well with MLlib, a machine learning library.
  • Machine learning: Using Google Cloud Dataproc or bdutil, two options that simplify deployment and management of Hadoop/Spark clusters, you'll deploy and run MLlib-based scripts.

The solution also discusses considerations for how to analyze the data, including:

  • Timeliness concerns, such as real-time, near-real-time, and batch data analysis. This information can help you understand your options for how quickly you can present recommendations to the user and what it takes to implement each option. The sample solution focuses mainly on a near-real-time approach.
  • Filtering methods, such as content-based, cluster and collaborative filtering. You'll need to decide exactly what information goes into making a recommendation, and these filtering methods are the common ones in use today. The sample solution focuses mainly on collaborative filtering, but a helpful appendix provides more information about the other options.

We hope that this solution will give you the nuts and bolts you need to build an intelligent and ever-improving application that makes the most of the information that your users give you. Happy reading!

If you liked this blog post . . . you can get started today following these steps:
  1. Sign up for a free trial
  2. Download and follow the instructions on the Google Cloud Platform Github page
  3. “Recommend” this solution to your friends.

- Posted by Matthieu Mayran, Cloud Solutions Architect

Google seeks new disks for data centers

Today, during my keynote at the 2016 USENIX conference on File and Storage Technologies (FAST 2016), I’ll be talking about our goal to work with industry and academia to develop new lines of disks that are a better fit for data centers supporting cloud-based storage services. We're also releasing a white paper on the evolution of disk drives that we hope will help continue the decades of remarkable innovation achieved by the industry to date.

But why now? It's a fun but apocryphal story that the width of Roman chariots drove the spacing of modern train tracks. However, it is true that the modern disk drive owes its dimensions to the 3½” floppy disk used in PCs. It's very unlikely that's the optimal design, and now that we're firmly in the era of cloud-based storage, it's time to reevaluate broadly the design of modern disk drives.

The rise of cloud-based storage means that most (spinning) hard disks will be deployed primarily as part of large storage services housed in data centers. Such services are already the fastest growing market for disks and will be the majority market in the near future. For example, for YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires more than one petabyte (1M GB) of new storage every day or about 100x the Library of Congress. As shown in the graph, this continues to grow exponentially, with a 10x increase every five years.

At the heart of the paper is the idea that we need to optimize the collection of disks, rather than a single disk in a server. This shift has a range of interesting consequences including the counter-intuitive goal of having disks that are actually a little more likely to lose data, as we already have to have that data somewhere else anyway. It’s not that we want the disk to lose data, but rather that we can better focus the cost and effort spent trying to avoid data loss for other gains such as capacity or system performance.

We explore physical changes, such as taller drives and grouping of disks, as well as a range of shorter-term firmware-only changes. Our goals include higher capacity and more I/O operations per second, in addition to a better overall total cost of ownership. We hope this is the beginning of both a new chapter for disks and a broad and healthy discussion, including vendors, academia and other customers, about what “data center” disks should be in the era of cloud.

- Posted by Eric Brewer, VP Infrastructure, Google

What it looks like to process 3.5 million books in Google’s cloud

Today’s guest blog comes from Kalev Leetaru, founder of The GDELT Project, which monitors the
world’s news media in nearly every country in over 100 languages to identify the events and narratives driving our global society.

This past September I published into Google BigQuery a massive new public dataset of metadata from 3.5 million digitized English-language books dating back more than two centuries (1800-2015), along with the full text of 1 million of these books. The archive, which draws from the English-language public domain book collections of the Internet Archive and HathiTrust, includes full publication details for every book, along with a wide array of computed content-based data. The entire archive is available as two public BigQuery datasets, and there’s a growing collection of sample queries to help users get started with the collection. You can even map two centuries of books with a single line of SQL.

What did it look like to process 3.5 million books? Data-mining and creating a public archive of 3.5 million books is an example of an application perfectly suited to the cloud, in which a large amount of specialized processing power is needed for only a brief period of time. Here are the five main steps that I took to make the invaluable learnings of millions of books more easily and speedily accessible in the cloud:
  1. The project began with a single 8-core Google Compute Engine (GCE) instance with a 2TB SSD persistent disk that was used to download the 3.5 million books. I downloaded the books to the instance’s local disk, unzipped them, converted them into a standardized file format, and then uploaded them to Google Cloud Storage (GCS) in large batches, using the composite objects and parallel upload capability of GCS. Unlike traditional UNIX file systems, GCS performance does not degrade with large numbers of small files in a single directory, so I could upload all 3.5 million files into a common set of directories.
    Figure 1: Visualization of two centuries of books
  2. Once all books had been downloaded and stored into GCS, I launched ten 16-core High Mem (100GB RAM) GCE instances (160 cores total) to process the books, each with a 50GB persistent SSD root disk to achieve faster IO over traditional persistent disks. To launch all ten instances quickly, I launched the first instance and configured that with all of the necessary software libraries and tools, then created and used a disk snapshot to rapidly clone the other nine with just a few clicks. Each of the ten compute instances would download a batch of 100 books at a time to process from GCS.
  3. Once the books had been processed, I uploaded back into GCS all of the computed metadata. In this way, GCS served as a central storage fabric connecting the compute nodes. Remarkably, even in worst-case scenarios when all 160 processors were either downloading new batches of books from GCS or uploading output files back to GCS in parallel, there was no measurable performance degradation.
  4. With the books processed, I deleted the ten compute instances and launched a single 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks. I used this to reassemble the 3.5 million per-book output files into single output files, tab-delimited with data available for each year, merging in publication metadata and other information about each book. Disk IO of more than 750MB/s was observed on this machine.
  5. I then uploaded the final per-year output files to a public GCS directory with web downloading enabled, allowing the public to download the files.
Since very few researchers have the bandwidth, local storage or computing power to process even just the metadata of 3.5 million books, the entire collection was uploaded into Google BigQuery as a public dataset. Using standard SQL queries, you can explore the entire collection in tens of seconds at speeds of up to 45.5GB/s and perform complex analyses entirely in-database.

The entire project, from start to finish, took less than two weeks, a good portion of which consisted of human verification for issues with the publication metadata. This is significant because previous attempts to process even a subset of the collection on a modern HPC supercluster had taken over one month and completed only a fraction of the number of books examined here. The limiting factor was always the movement of data: transferring terabytes of books and their computed metadata across hundreds of processors.

This is where Google’s cloud offerings shine, seemingly purpose-built for data-first computing. In just two weeks, I was able to process 3.5 million books, spinning up a cluster of 160 cores and 1TB of RAM, followed by a single machine with 32 cores, 200GB of RAM, 10TB of SSD disk and 1TB of direct-attached scratch SSD disk. I was able to make the final results publicly accessible through BigQuery at query speeds of over 45.5GB/s.

You can access the entire collection today in BigQuery, explore sample queries, and read more technical detail about the processing pipeline on the GDELT Blog.

I’d like to thank Google, Clemson University, the Internet Archive, HathiTrust, and OCLC in making this project possible, along with all of the contributing libraries and digitization sponsors that have made these digitized books available.

- Posted by Kalev Leetaru, founder of The GDELT Project

How to build mobile apps on Google Cloud Platform

At some point in development, nearly every mobile app needs a backend service. With Google’s services you can rapidly build backend services that:

  • Scale automatically to meet demand
  • Automatically synchronize data across devices
  • Handle the offline case gracefully
  • Send notifications and messages

The following are design patterns you’ll find in Build mobile apps using Google Cloud Platform, which provides a side-by-side comparison of Google services, as well as links to tutorials and sample code. Click on a diagram for more information and links to sample code.

Real-time data synchronization with Firebase

Firebase is a fully managed platform for building iOS, Android and web apps that provides automatic data synchronization and authentication services.

To understand how using Firebase can simplify app development, consider a chat app. By storing the data in Firebase, you get the benefits of automatic synchronization of data across devices, minimal on-device storage, and an authentication service. All without having to write a backend service.

Add managed computation to Firebase apps with Google App Engine

If your app needs backend computation to process user data or orchestrate events, extending Firebase with App Engine gives you the benefit of automatic real-time data synchronization and an application platform that monitors, updates and scales the hosting environment.

An example of how you can use Firebase with App Engine is an app that implements a to-do list. Using Firebase to store the data ensures that the list is updated across devices. Connecting to your Firebase data from a backend service running on App Engine gives you the ability to process or act on that data. In the case of the to-do app, to send daily reminder emails.

Add flexible computation to Firebase with App Engine Managed VMs

If your mobile backend service needs to call native binaries, write to the file systems and make other system calls, extending Firebase with App Engine Managed VMs gives you the benefit of automatic real-time data synchronization and an application platform, with the flexibility to run code outside of the standard App Engine runtime.

Using Firebase and App Engine Managed VMs is similar to using Firebase with App Engine and adds additional options. For example, consider an app that converts chat messages into haikus using a pre-existing native binary. You can use Firebase to store and synchronize the data and connect to that data from a backend service running on App Engine Managed VMs. Your backend service can then detect new messages, call the native binaries to translate them into poetry, and push the new versions back to Firebase.

Automatically generate client libraries with App Engine and Google Cloud Endpoints

Using Cloud Endpoints means you don’t have to write wrappers to handle communication with App Engine. With the client libraries generated by Cloud Endpoints, you can simply make direct API calls from your mobile app.

If you're building an app that does not require real-time data synchronization, or if messaging and synchronization are already part of your backend service, using App Engine with Cloud Endpoints speeds development time by automatically generating client libraries. An example of an app where real-time synchronization is not needed is one that looks up information about retail products and finds nearby store locations.

Have full control with Compute Engine and REST or gRPC

With Google Compute Engine, you create and run virtual machines on Google infrastructure. You have administrator rights to the server and full control over its configuration.

If you have an existing backend service running on a physical or virtual machine, and that service requires a custom server configuration, moving your service to Compute Engine is the fastest way to get your code running on Cloud Platform. Keep in mind that you will be responsible for maintaining and updating your virtual machine.

An example of an app you might run on Compute Engine is a game with a backend service that uses third-party libraries and a custom server configuration to render in-game graphics.

For more information about these designs, as well as information about building your service, testing and monitoring your service and connecting to your service from your mobile app — including sending push notifications — see How to build backend services for mobile apps.

- Posted by Syne Mitchell, Technical Writer, Google Cloud Platform

Build a mobile gaming analytics platform

Popular mobile games can attract millions of players and generate terabytes of game-related data in a short burst of time. This places extraordinary pressure on the infrastructure powering these games and requires scalable data analytics services to provide timely, actionable insights in a cost-effective way.

To address these needs, a growing number of successful gaming companies use Google’s web-scale analytics services to create personalized experiences for their players. They use telemetry and smart instrumentation to gain insight into how players engage with the game and to answer questions like: At what game level are players stuck? What virtual goods did they buy? And what's the best way to tailor the game to appeal to both casual and hardcore players?

A new reference architecture describes how you can collect, archive and analyze vast amounts of gaming telemetry data using Google Cloud Platform’s data analytics products. The architecture demonstrates two patterns for analyzing mobile game events:

  • Batch processing: This pattern helps you process game logs and other large files in a fast, parallelized manner. For example, leading mobile gaming company DeNA moved to BigQuery from Hadoop to get faster query responses for their log file analytics pipeline. In this GDC Lightning Talk video they explain the speed benefits of Google’s analytics tools and how the team was able to process large gaming datasets without the need to manage any infrastructure.
  • Real-time processing: Use this pattern when you want to understand what's happening in the game right now. Cloud Pub/Sub and Cloud Dataflow provide a fully managed way to perform a number of data-processing tasks like data cleansing and fraud detection in real-time. For example, you can highlight a player with maximum hit-points outside the valid range. Real-time processing is also a great way to continuously update dashboards of key game metrics, like how many active users are currently logged in or which in-game items are most popular.

Some Cloud Dataflow features are especially useful in a mobile context since messages may be delayed from the source due to mobile Internet connection issues or batteries running out. Cloud Dataflow's built-in session windowing functionality and triggers aggregate events based on the actual time they occurred (event time) as opposed to the time they're processed so that you can still group events together by user session even if there's a delay from the source.

But why choose between one or the other pattern? A key benefit of this architecture is that you can write your data pipeline processing once and execute it in either batch or streaming mode without modifying your codebase. So if you start processing your logs in batch mode, you can easily move to real-time processing in the future. This is an advantage of the high-level Cloud Dataflow model that was released as open source by Google.

Cloud Dataflow loads the processed data into one or more BigQuery tables. BigQuery is built for very large scale, and allows you to run aggregation queries against petabyte-scale datasets with fast response times. This is great for interactive analysis and data exploration, like the example screenshot above, where a simple BigQuery SQL query dynamically creates a Daily Active Users (DAU) graph using Google Cloud Datalab.

And what about player engagement and in-game dynamics? The BigQuery example above shows a bar chart of the ten toughest game bosses. It looks like boss10 killed players more than 75% of the time, much more than the next toughest. Perhaps it would make sense to lower the strength of this boss? Or maybe give the player some more powerful weapons? The choice is yours, but with this reference architecture you'll see the results of your changes straight away. Review the new reference architecture to jumpstart your data-driven quest to engage your players and make your games more successful, contact us, or sign up for a free trial of Google Cloud Platform to get started.

Further Reading and Additional Resources

- Posted by Oyvind Roti, Solutions Architect