Tag Archives: Customers

Last month today: July on GCP

The month of July saw our Google Cloud Next ‘18 conference come and go, and there was plenty of exciting news, updates and demos to share from the show. Here’s a look at some of the most-read blog posts from July.

What caught your attention this month: Creating the open cloud
  • One of the most-read posts this month covered the launch of our Cloud Services Platform, which allows you to build a true hybrid cloud infrastructure. Some of the key components of Cloud Services Platform include the managed Istio service mesh, Google Kubernetes Engine (GKE) On-Prem and GKE Policy Management, Cloud Build for fully managed CI/CD, and several serverless offerings (more on that below). Combined, these technologies can help you gain consistency, security, speed and flexibility of the cloud in your local data center, along with the freedom of workload portability to the environment of your choice.
  • Another popular read was a rundown of Google Cloud’s new serverless offerings. These include core serverless compute announcements such as new App Engine runtimes, Cloud Functions general availability and more. It also included serverless containers, so you can run serverless workloads in a fully managed container environment; GKE Serverless add-on to easily run serverless workloads on Kubernetes Engine; and Knative, the open-source project on which that add-on is built. There are even more features included in this post, too, like Cloud Build, Stackdriver monitoring and Cloud Firestore integration with GCP. 
Bringing detailed metrics and Kubernetes apps to the forefront
  • Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
  • It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.

Building the cloud back-end
  • For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
  • And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use. 
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!

Cloud-native architecture with serverless microservices — the Smart Parking story

By Brian Granatir, SmartCloud Engineering Team Lead, Smart Parking

Editor’s note: When it comes to microservices, a lot of developers ask why they would want to manage many services rather than a single, big, monolithic application? Serverless frameworks make doing microservices much easier because they remove a lot of the service management overhead around scaling, updating and reliability. In this first installment of a three-part series, Google Cloud Platform customer Smart Parking gives us their take on event-driven architecture using serverless microservices on GCP. Then read on for parts two and three, where they walk through how they built a high-volume, real-world smart city platform on GCP—with code samples!

Part 1


When "the cloud" first appeared, it was met with skepticism and doubt. “Why would anyone pay for virtual servers?” developers asked. “How do you control your environment?” You can't blame us; we're engineers. We resist change (I still use vim), and believe that proof is always better than a promise. But, eventually we found out that this "cloud thing" made our lives easier. Resistance was futile.

The same resistance to change happened with git (“svn isn't broken”) and docker (“it's just VMs”). Not surprising — for every success story, for every promise of a simpler developer life, there are a hundred failures (Ruby on Rails: shots fired). You can't blame any developer for being skeptical when some random "bloke with a blog" says they found the next great thing.

But here I am, telling you that serverless is the next great thing. Am I just a bloke? Is this a blog? HECK YES! So why should you read on (other than for the jokes, obviously)? Because you might learn a thing or two about serverless computing and how it can be used to solve non-trivial problems.

We developed this enthusiasm for serverless computing building a smart city platform. What is a smart city platform, you ask? Imagine you connect all the devices and events that occur in a city to improve resource efficiency and quality of citizen life. The platform detects a surge in parking events and changes traffic lights to help the flow of cars leaving downtown. It identifies a severe rainstorm and turns on street lights in the middle of the day. Public trash cans alert sanitation when they are full. Nathan Fillion is spotted on 12th street and it swarm-texts local citizens. A smart city is a vast network of distributed devices (IoT City 2000!) streaming data and methods to easily correlate these events and react to them. In other words, it's a hard problem with a massive scale—perfect for serverless computing!
In-ground vehicle detection sensor


What the heck is serverless?


But before we go into a lot more depth about the platform, let’s define our terms. In this first article, we give a brief overview of the main concepts used in our smart city platform and how they match up with GCP services. Then, in the second article, we'll dive deeper into the architecture and how each specific challenge was met using various different serverless solutions. Finally, we'll get extra technical and look at some code snippets and how you can maximize functionality and efficiency. In the meantime, if you have any questions or suggestions, please don't hesitate to leave a comment or email me directly ([email protected]).

First up, domain-driven design (DDD). What is domain-driven design? It's a methodology for designing software with an emphasis on expertise and language. In other words, we recognize that engineering, of any kind, is a human endeavour whose success relies largely on proper communication. A tiny miscommunication [wait, we're using inches?] can lead to massive delays or customer dissatisfaction. Developing a domain helps assure that everyone (not just the development team) is using the same terminology.

A quick example: imagine you’re working on a job board. A client calls customer support because a job they just posted never appeared online. The support representative contacts the development team to investigate. Unfortunately, they reach your manager, who promptly tells the team, “Hey! There’s an issue with a job in our system.” But the code base refers to job listings as "postings" and the daily database tasks as "jobs." So naturally, you look at the database "jobs" and discover that last night’s materialization failed. You restart the task and let support know that the issue should be resolved soon. Sadly, the customer’s issue wasn’t addressed, because you never addressed the "postings" error.

Of course, there are more potent examples of when language differences between various aspects of the business can lead to problems. Consider the words "output," "yield," and "spike" for software monitoring a nuclear reactor. Or, consider "sympathy" and "miss" for systems used by Klingons [hint: they don’t have words for both]. Is it too extreme to say domain-driven design could save your life? Ask a Klingon if he’ll miss you!

In some ways, domain-driven design is what this article is doing right now! We're establishing a strong, ubiquitous vocabulary for this series so everyone is on the same page. In part two, we'll apply DDD to our example smart city service.

Next, let's discuss event-driven architecture. Event-driven architecture (EDA) means constructing your system as a series of commands and/or events. A user submits an online form to make a purchase: that's a command. The items in stock are reserved: that's an event. A confirmation is sent to the user: that's an event. The concept is very simple. Everything in our system is either a command or an event. Commands lead to events and events may lead to new commands and so on.

Of course, defining events at the start of a project requires a good understanding of the domain. This is why it's common to see DDD and EDA together. That said, the elegance of a true event-driven architecture can be difficult to implement. If everything is a command or an event, where are the objects? I got that customer order, but where do I store the "order" and how to I access it? We'll investigate this in much more detail in part two of this series. For now, all you need to understand is that our example smart city project will be defining everything as commands and events!

Now, onto serverless. Serverless computing simply means using existing, auto-scaling cloud services to achieve system behaviours. In other words, I don't manage any servers or docker containers. I don't set up networks or manage operation (ops). I merely provide the serverless solution my recipe and it handles creation of any needed assets and performs the required computational process. A perfect example is Google BigQuery. If you haven't tried it out, please go do that. It's beyond cool (some kids may even say it's "dank": whatever that means). For many of us, it’s our first chance to interact with a nearly-infinite global compute service. We're talking about running SQL queries against terabytes of data in seconds! Seriously, if you can't appreciate what BigQuery does, then you better turn in your nerd card right now (mine says "I code in Jawa").

Why does serverless computing matter? It matters because I hate being woken up at night because something broke on production! Because it lets us auto-scale properly (instead of the cheating we all did to save money *cough* docker *cough*). Because it works wonderfully with event-driven architectures and microservices, as we'll see throughout parts 2 & 3 of this series.

Finally, what are microservices? Microservices is a philosophy, a methodology, and a swear word. Basically, it means building our system in the same way we try to write code, where each component does one thing and one thing only. No side effects. Easy to scale. Easy to test. Easier said than done. Where a traditional service may be one database with separate read/write modules, an equivalent microservices architecture may consist of sixteen databases each with individual access management.

Microservices are a lot like eating your vegetables. We all know it sounds right, but doing it consistently is a challenge. In fact, before serverless computing and the miracles of Google's cloud queuing and database services, trying to get microservices 100% right was nearly impossible (especially for a small team on a budget). However, as we'll see throughout this series, serverless computing has made microservices an easy (and affordable) reality. Potatoes are now vegetables!

With these four concepts, we’ve built a serverless sandwich, where:
  • Domain-driven design is the peanut butter, defining the language and context of our project 
  • Event-driven architecture is the jelly, limiting the scope of our domain to events 
  • Microservices: is the bread, limiting our architecture to tiny components that react to single event streams
And finally, serverless is having someone else make the sandwich for you (and cutting off the crust), running components on auto-scaling, auto-maintained compute services.

As you may have guessed, we're going to have a microservice that reacts to every command and event in our architecture. Sounds crazy, but as you'll see, it's super simple, incredibly easy to maintain, and cheap. In other words, it's fun. Honestly, remember when coding was fun? Time to recapture that magic!

To repeat, serverless computing is the next big thing! It's the peanut butter and jelly sandwich of software development. It’s an uninterrupted night’s sleep. It's the reason I fell back in love with web services. We hope you’ll come back for part two where we take all these ideas and outline an actual architecture.

What we learned doing serverless — the Smart Parking story



Part 3 

You made it through all the fluff and huff! Welcome to the main event. Time for us to explore some key concepts in depth. Of course, we won't have time to cover everything. If you have any further questions, or recommendations for a follow-up (or a prequel . . . who does a prequel to a tech blog?), please don't hesitate to email me.

"Indexing" in Google Cloud Bigtable


In parts one & two, we mentioned Cloud Bigtable a lot. It's an incredible, serverless database with immense power. However, like all great software systems, it's designed to deal with a very specific set of problems. Therefore, it has constraints on how it can be used. There are no traditional indexes in Bigtable. You can't say "index the email column" and then query it later. Wait. No indexes? Sounds useless, right? Yet, this is the storage mechanism used by Google to run our life-depending sites: Gmail, YouTube, Google Maps, etc. But I can search in those. How do they do it without traditional indexes? I'm glad you asked!!

The answer has two parts: (1) using rowkeys and (2) data mitosis. Let's take a look at both. But, before we do that, let's make one very important point: Never assume you have expertise in anything just because you read a blog about it!!!

I know, it feels like reading this blog [with its overflowing abundance of awesome] might be the exception. Unfortunately, it's not. To master anything, you need to study the deepest parts of its implementation and practice. In other words, to master Bigtable you need to understand "what" it is and "why" it is. Fortunately, Bigtable implements the HBase API. This means you can learn heaps about Bigtable and this amazing data storage and access model by reading the plentiful documentation on HBase, and its sister project Hadoop. In fact, if you want to understand how to build any system for scale, you need to have at least a basic understanding of MapReduce and Hadoooooooooop (little known fact, "Hadoop" can be spelt with as many o's as desired; reduce it later).

If you just followed the concepts covered in this blog, you'd walk away with an incomplete and potentially dangerous view of Bigtable and what it can do. Bigtable will change your development life, so at least take it out to dinner a few times before you get down on one knee!

Ok, got the disclaimer out of the way, now onto rowkeys!

Rowkeys are the only form of indexing provided in Bigtable. A rowkey is the ID used to distinguish individual rows. For example, if I was storing a single row per user, I might have the rowkeys be the unique usernames. For example:
We can then scan and select rows by using these keys. Sounds simple enough. However, we can make these rowkeys be compound indexes. That means that we carry multiple pieces of information within a single rowkey. For example, what if we had three kinds of users: admin, customer and employee. We can put this information in the rowkey. For example:


(Note: We're using # to delineate parts of our rowkey, but you can use any special character you want.)

Now we can query for user type easily. In other words, I can easily fetch all "admin" user rows by doing a prefix search (i.e., find all rows that start with "admin#"). We can get really fancy with our rowkeys too. For example, we can store user messages using something like:
However, we cannot search for the latest 10 messages by Brian using rowkeys. Also, there's no easy way to get a series of related messages in order. Maybe I need a unique conversation ID that I put at the start of each rowkey? Maybe.

Determining the right rowkeys is paramount to using Bigtable effectively. However, you'll need to watch out for hotspotting (a topic not covered in this blog post). Also, any Bigtablians out there will be very upset with me because my examples don't show column families. Yeah, Bigtable must have the best holidays, because everything is about families.

So, we can efficiently search our rows using rowkeys, but this may seem every limited. Who could design a single rowkey that covers every possible query? You can't. This is where the second major concept comes in: data mitosis.

What is data mitosis? It's replication of data into multiple tables that are optimized for specific queries. What? I'm replicating data just to overcome indexing limits? This is madness. NO! THIS. IS. SERVERLESS!

While it might sound insane, storage is cheap. In fact, storage is so cheap, we'd be naive to not abuse it. This means that we shouldn't be afraid to store our data as many times as we want to simply improve overall access. Bigtable works efficiently with billions of rows. So go ahead and have billions of rows. Don't worry about capacity or maintaining a monsterous data cluster, Google does that for you. This is the power of serverless. I can do things that weren't possible before. I can take a single record and store it ten (or even a hundred) times just to make data sets optimized for specific usages (i.e., for specific queries).

To be honest, this is the true power of serverless. To quote myself, storage is magic!

So, if I needed to access all messages in my system for analytics, why not make another view of the same data:
Of course, data mitosis means you have insanely fast access to data but it isn't without cost. You need to be careful in how you update data. Imagine the bookkeeping nightmare of trying to manage synchronized updates across dozens of data replicants. In most cases, the solution is never updating rows (only adding them). This is why event-driven architectures are ideal for Bigtable. That said, no database is perfect for all problems. That's why it's great that I can have SQL, noSQL, and HBASE databases all running for minimal costs (with no maintenance) using serverless! Why use only one database? Use them all!

Exporting to BigQuery


In the previous section we learned about the modern data storage model: store everything in the right database and store it multiple times. It sounds wonderful, but how do we run queries that transcend this eclectic set of data sources? The answer is. . . we cheat. BigQuery is cheating. I cannot think of any other way of describing the service. It's simply unfair. You know that room in your house (or maybe in your garage)? That place where you store EVERYTHING—all the stuff you never touch but don't want to toss out? Imagine if you had a service that could search through all the crap and instantly find what you're looking for. Wouldn't that be nice? That's BigQuery. If it existed IRL . . . it would save marriages. It's that good.

By using BigQuery, we can scale our searches across massive data sets and get results in seconds. Seriously. All we need to do is make our data accessible. Fortunately, BigQuery already has a bunch of onramps available (including pulling your existing data from Google Cloud Storage, CSVs, JSON, or Bigtable), but let's assume we need something custom. How do you do this? By streaming the data directly into BigQuery! Again, we're going to replicate our data into another place just for convenience. I would've never considered this until serverless made it cheap and easy.

In our architecture, this is almost too easy. We simply add a Cloud Function that listens to all our events and streams them into BigQuery. Just subscribe to the Pub/Sub topics and push. It’s so simple. Here's the code:

exports.exportStuffToBigQuery = function exportStuffToBigQuery( event, callback ) {
    return parseEvent(event)
    .then(( eventData ) => {
      const BigQuery = require('@google-cloud/bigquery')();
      return BigQuery.dataset('myData').table('stuff').insert(eventData);
    })
    .then(( ) => callback())
  };

That's it! Did you think it was going to be a lot of code? These are Cloud Functions. They should be under 100 lines of code. In fact, they should be under 40. With a bit of boilerplate, we can make this one line of code:

exports.exportStuffToBigQuery = function exportStuffToBigQuery( event, callback ) {
    myFunction.run(event, callback, (( data ) => { myFunction.sendOutput(eventData) });
  };

Ok, but what is the boilerplate code? More on that in the next section. This section is short, as it should be. Honestly, getting data into BigQuery is easy. Google has provided a lot of input hooks and keeps adding more. Once you have the data in there (regardless of size), you can just run the standard SQL queries you all know and loathe love. Up, up, down, down, left, right, left, right, B, A!


Cloud Function boilerplate


Cloud Functions use Node? Cloud Functions are JavaScript? Gag! Yes, that was my initial reaction. Now (9 months later), I never want to write anything else in my career. Why? Because Cloud Functions are simple. They are tiny. You don't need a big robust programming language when all you're doing is one thing. In fact, this is a case where less is more. Keep it simple! If your Cloud Function is too complex, break it apart.

Of course, there are a sequence of steps that we do in every Cloud Function:

  1) Parse trigger
  2) Do stuff
  3) Send output
  4) Issue callback
  5) Catch errors

The only thing we should be writing is step 2 (and sometimes step 5). This is where boilerplate code comes in. I like my code like I like my wine: DRY! [DRY = Don't Repeat Yourself, btw].

So write the code to parse your triggers and send your outputs once. There are more steps! The actual sequence of steps for a Cloud Function is:


  1) Filter unneeded events
  2) Log start
  3) Parse trigger
  4) Do stuff
  5) Send output(s)
  6) Issue retry for timeout errors
  7) Catch and log all fatal errors (no retry)
  8) Issue callback
  9) Do it all asynchronously
  10) Test above code
  11) Create environment resources (if needed)
  12) Deploy code

Ugh. So our simple Cloud Functions just became a giant list of steps. It sounds painful, but it can all be overcome with some boilerplate code and an understanding of how Cloud Functions work at a larger level.

How do we do this? By adding a common configuration for each Cloud Function that can be used to drive testing, deployment and common behaviour. All our Cloud Functions start with a block like this:

const options = {
    functionName: 'doStuff',
    trigger: 'stuff-commands',
    triggerType: 'pubsub',
    aggregateType: 'devices',
    aggregateSource: 'bigtable',
    targets: [
      { type: 'bigtable', name: 'stuff' },
      { type: 'pubsub', name: 'stuff-events'}
    ],
    filter: [ 'UpdateStuff' ]
  };

It may seem basic, but this understanding of Cloud Functions allows us to create a harness that can perform all of the above steps. We can deploy a Cloud Function if we know its trigger and its type. Since everything is inside GCP, we can easily create resources if we know our output targets and their types. We can perform efficient logging and track data through our system by knowing the start and end point (triggers and targets) for each function. The filter allows us to limit which events arriving in a Pub/Sub topic are handled.

So, what's the takeaway for this section? Make sure you understand Cloud Functions fully. See them as tiny connectors between a given input and target output (preferably only one). Use this to make boilerplate code. Each Cloud Function should contain a configuration and only the lines of code that make it unique. It may seem like a lot of work, but making a generic methodology for handling Cloud Functions will liberate you and your code. You'll get addicted and find yourself sheepishly saying, "Yeah, I kinda like JavaScript, and you know . . .Node" (imagine that!)

Testing


We can't end this blog without a quick talk on testing. Now let me be completely honest. I HATED testing for most of my career. I'm flawless, so why would I want to write tests? I know, I know . . . even a diamond needs to be polished every once-in-awhile.

That said, now I love testing. Why? Because testing Cloud Functions is super easy. Seriously. Just use Ava and Sinon and "BAM". . . sorted. It really couldn't be simpler. In fact, I wouldn't mind writing another series of posts on just testing Cloud Functions (a blog on testing, who'd read that?).

Of course, you don't need to follow my example. Those amazing engineers at Google already have examples for almost every possible subsystem. Just take a look at their Node examples on GitHub for Cloud Functions: https://github.com/GoogleCloudPlatform/nodejs-docs-samples/tree/master/functions [hint: look in the test folders].

For many of you, this will be very familiar. What might be new is integration testing across microservices. Again, this could be an entire series of articles, but I can provide a few quick tips here.

First, use Google's emulators. They have them for just about everything (Pub/Sub, Datastore, Bigtable, Cloud Functions). Getting them set up is easy. Getting them to all work together isn't super simple, but not too hard. Again, we can leverage our Cloud Function configuration (seen in the previous section), to help drive emulation.

Second, use monitoring to help design integration testing. What is good monitoring if not a constant integration test? Think about how you would monitor your distributed microservices and how you'd look at various data points to look for slowness or errors. For example, I'd probably like to monitor the average time it takes for a single input to propagate across my architecture and send alerts if we slip beyond standard deviation. How do I do this? By having a common ID carried from the start to end of a process.

Take our architecture as an example. Everything is a chain of commands and events. Something like this:
If we have a single ID that flows through this chain, it'll be easy for us to monitor (and perform integration testing). This is why it's great to have a common parent for both commands and events. This is typically referred to as a "fact." So everything in our system is a "fact." The JSON might look something like this:

{
    fact: {
      id: "19fg-3fsf-gg49",
      type: "Command",
      subtype: "UpdateReadings"
    },
    readings: {}
  }

As we move through our chain of commands and events, we change the fact type and subtype, but never the ID. This means that we can log and monitor the flow of each of our initial inputs as it migrates through the system.

Of course, as with all things monitoring (and integration testing), life isn't so simple. This stuff is hard. You simply cannot perfect your monitoring or integration testing. If you did, you would've solved the Halting Problem. Seriously, if I could give any one piece of advice to aspiring computer scientists, it would be to fully understand the Halting Problem and the CAP theorem.

Pitfalls of serverless


Serverless has no flaws! You think I'm joking. I'm not. The pitfalls in serverless have nothing to do with the services themselves. They all have to do with you. Yep. True serverless systems are extremely powerful and cost-efficient. The only problem: developers have a tendency to use these services wrong. They don't take the time to truly understand the design and rationale of the underlying technologies. Google uses these exact same services to run the world's largest and most performant web applications. Yet, I hear a lot of developers complaining that these services are just "too slow" or "missing a lot."

Frankly, you're wrong. Serverless is not generic. It isn't compute instances that let you you install whatever you want. That’s not what we want! Serverless is a compute service that does a specific task. Now, those tasks may seem very generic (like databases or functions), but they're not. Each offering has a specific compute model in mind. Understanding that model is key to getting the maximum value.

So what is the biggest pitfall? Abuse. Serverless lets you do a LOT for very cheap. That means the mistakes are going to come from your design and implementation. With serverless, you have to really embrace the design process (more than ever). Boiling your problem down to its most fundamental elements will let you build a system that doesn't need to be replaced every three to four years. To get where we needed to be, my team rebuilt the entire kernel of our service three times in one year. This may seem like madness and it was. We were our own worst enemy. We took old (and outdated) notions of software and web services and kept baking it into the new world. We didn't believe in serverless. We didn't embrace data mitosis. We resisted streaming. We didn't put data first. All mistakes. All because we didn’t fully understanding the intent of our tools.

Now, we have an amazing platform (with a code base reduced by 80%) that will last for a very long time. It'w optimized for monitoring and analytics, but we didn't even need to try. By embracing data and design, we got so much for free. It's actually really easy, if you get out of your own way.

Conclusion 


As development teams beginning to transition to a world of IoT and serverless, they will encounter an unique set of challenges. The goal of this series was to provide an overview of recommended techniques and technologies used by one team to ship a IoT/serverless product. A quick summary of each part is as follows:

Part 1 - Getting the most out of serverless computing requires a cutting-edge approach to software design. With the ability to rapidly prototype and release software, it’s important to form a flexible architecture that can expand at the speed of inspiration. Sound cheesy, but who doesn’t love cheese? Our team utilized domain-driven design (DDD) and event-driven architecture (EDA) to efficiently define a smart city platform. To implement this platform, we built microservices deployed on serverless compute services.

Biggest takeaway: serverless now makes event-driven architecture and microservices not only a reality, but almost a necessity. Viewing your system as a series of events will allow for resilient design and efficient expansion.

Part 2 - Implementation of an IoT architecture on serverless services is now easier than ever. On Google Cloud Platform (GCP), powerful serverless tools are available for:

  • IoT fleet management and security -> IoT Core 
  • Data streaming and windowing -> Dataflow 
  • High-throughput data storage -> Bigtable 
  • Easy transaction data storage -> Datastore 
  • Message distribution -> Pub/Sub 
  • Custom compute logic -> Cloud Functions 
  • On-demand, analytics of disparate data sources -> BigQuery

Combining these services allows any development team to produce a robust, cost-efficient and extremely performant product. Our team uses all of these and was able to adopt each new service within a single one-week sprint.

Biggest takeaway: DevOps is dead. Serverless systems (with proper non-destructive, deterministic data management and testing) means that we’re just developers again! No calls at 2am because some server got stuck? Sign me up for serverless!

Part 3 - To be truly serverless, a service must offer a limited set of computational actions. In other words, to be truly auto-scaling and self-maintaining, the service can’t do everything. Understanding the intent and design of the serverless services you use will greatly improve the quality of your code. Take the time to understand the use-cases designed for the service so that you extract the most. Using a serverless offering incorrectly can lead to greatly reduced performance.

For example, Pub/Sub is designed to guarantee rapid, at-least-once delivery. This means messages may arrive multiple times or out-of-order. That may sound scary, but it’s not. Pub/Sub is used by Google to manage distribution of data for their services across the globe. They make it work. So can you. Hint: consider deterministic code. Hint, hint: If order is essential at time of data inception, use windowing (see Dataflow).

Biggest takeaway: Don’t try to use a hammer to clean your windows. Research serverless services and pick the ones that suit your problem best. In other words, not all serverless offerings are created equal. They may offer the same essential API, but the implementation and goals can be vastly different.

Finally, before we part, let me say, “Thank you.” Thanks for following through all my ramblings to reach this point. There's a lot of information, and I hope that it gives you a preview of the journey that lies ahead. We're entering a new era of web development. It's a landscape full of treasure, opportunity, dungeons and dragons. Serverless computing lets us discard the burden of DevOps and return to the adventure of pure coding. Remember when all you did was code (not maintenance calls at 2am)? It's time to get back there. I haven't felt this happy in my coding life in a long time. I want to share that feeling with all of you!

Please, send feedback, requests, and dogs (although, I already have 7). Software development is a never-ending story. Each chapter depends on the last. Serverless is just one more step on our shared quest for holodecks. Yeah, once we have holodecks, this party is over! But until then, code as one.

Implementing an event-driven architecture on serverless — the Smart Parking story



Part 2 


In this article, we’re going to explore how to build an event-driven architecture on serverless services to solve a complex, real-world problem. In this case, we’re building a smart city platform. An overview of the domain can be found in part one. If you haven’t read part one, please go take a look now. Initial reviews are in, and critics are saying “this may be the most brilliantly composed look at modern software development; where’s dinner?” (please note: in this case the "critics" are my dogs, Dax and Kiki).

Throughout this part, we’ll be slowly building up an architecture. In part three, we’ll dive deeper into some of the components and review some actual code. So let’s get to it. Where do we start? Obviously, with our input!

Zero step: defining our domain 


Before we begin, let’s define the domain of a smart city. As we learned in the previous post, defining the domain means establishing a clear language and terminology for referencing objects and processes in our software system. Of course, creating this design is typically more methodical, iterative, and far more in-depth. It would take a genius to just snap and put an accurate domain at the end of a blog post (it’s strange that I never learned to snap my fingers, right?).

Our basic flow, for this project is a network of distributed IoT (Internet of Things) devices that send periodic readings that are used to define the frames of larger correlated events throughout a city.
  • Sensor - electronic device that's capable of capturing and reporting one or more specialized readings 
  • Gateway - an internet-connected hub that's capable of receiving readings from one or more sensors and sending these packages to our smart cloud platform 
  • Device - the logical combination of a sensor and its reporting gateway (used to define a clear split between the onramp from processing) 
  • Readings - key-value pairs (e.g., { temperature: 35, battery: "low" } ) sent by sensors 
  • UpdatedReadings - the command to update readings for a specific device 
  • ReadingsUpdated - the event that occurs in our system when new readings are received from a device (response to a UpdateReadings command) 
  • Frame - a collection of correlated / collocated events (typically ReadingsUpdated) used to drive business logic through temporal reasoning [lots more on this later] 
  • Device Report - an analytic view of devices and their health metrics (typically used by technicians) 
  • Event Report - an analytic view of frames (typically used by business managers) If we connect all of these parts together in a diagram, and add some serveless glue (parts in bold), we get a nice overview of our architecture:

Of course, there's a fair bit of missing glue in the above diagram. For example, how do we take an UpdateReadings command and get it into Bigtable? This is where my favorite serverless service comes into play: Cloud Functions! How do we install devices? Cloud Functions. How do we create organizations? Cloud Functions. How do we access data through an API? Cloud Functions. How do we conquer the world? Cloud Functions. Yep, I’m in love!

Alright, now we have our baseline, let’s spend the rest of this post exploring just how we go about implementing each part of our architecture and dataflows.

First step: inbound data


Our smart city platform is nothing more than a distributed network of internet-connected (IoT) devices. These devices are composed of one or more sensors that capture readings and their designated gateways that help package this data and send it through to our cloud.

For example, we may have an in-ground sensor used to detect a parked car. This senor reports IR and magnetic readings that are transferred through RF (radio frequencies) to a nearby gateway. Another example is a smart trash can that monitors capacity and broadcasts when the bin is full.

The challenge of IoT-based systems has always been collecting data, updating in-field devices, and security. We could write an entire series of articles on how to deal with these challenges. In fact, the burden of this task is the reason we haven’t seen many sophisticated, generic IoT platforms. But not anymore! The problem has been solved for us by those wonderful engineers at Google. Cloud IoT Core is a serverless service offered by Google Cloud Platform (GCP) that helps you skip all the annoying steps. It’s like jumping on top of the bricks!

Wait . . . does anyone get that reference anymore? Mario Brothers. The video game for the NES. You could jump on top of the ceiling of bricks to reach a secret pipe that let you skip a bunch of levels. It was a pipe because you were a plumber . . .  fighting a turtle dragon to save a princess. And you could throw fireballs by eating flowers. Trust me, it made perfect sense!

Anyway! Cloud IoT Core is the secret passage that lets you skip a bunch of levels and get to the good stuff. It scales automatically and is simple to use. Seriously, don’t spend any time managing your devices and securing your streams. Let Google do that for you.

So, sensors are observing life in our city and streaming this data to IoT Core. Where does it end up after IoT Core? In Cloud Pub/Sub, Google’s serverless queuing service. Think of it as a globally distributed subscription queue with guaranteed delivery. The result: our vast network of data streams has been converted to a series of queues that our services can subscribe to. This is our inbound pipeline. It scales nearly infinitely and requires no operation support. Think about that. We haven’t even written any code yet and we already have an incredibly robust architecture. Trust me, it took my team only a week to move over existing device onramp to IoT Core—it’s that straightforward. And how many problems have we had? How many calls at 3 AM to fix the inbound data? Zero. They should call it opsless rather than serverless!

Anyway, we got our data streaming in. So far, our architecture looks like this:
While we’re exploring a smart city platform made from IoT devices, you can use this pipeline with almost any architecture. Just replace the IoT Core box with your onboarding service and output to Pub/Sub. If you still want that joy of serverless (and no calls at 3 AM), then consider using Google Cloud Dataflow as your onramp!

What is Dataflow? It's a serverless implementation of a Hadoop-like pipeline used for the transformation and enriching of streaming or batch data. Sounds really fancy, and it actually is. If you want to know just how fancy, grab any data engineer and ask for their war stories on setting up and maintaining a Hadoop cluster (it might take awhile; bring popcorn). In our architecture, it can be used to both onramp data from an external source, to help with efficient formation of aggregates (i.e., to MapReduce a large number of events), or to help with windowing for streaming data. This is huge. If you know anything about streaming data, then you’ll know the value of a powerful, flexible windowing service.

Ok, now that we got streaming data, let’s do something with it!

Second step: normalizing streaming data

How is a trash can like a street lamp? How is a parking sensor like a barometer? How is a phaser like a lightsaber? These are questions about normalization. We have a lot of streaming data, but how do we find a common way of correlating it all?

For IoT this is a complex topic and more information can be found in this whitepaper. Of course, the only whitepaper in most developers lives comes on a roll. So here is a quick summary:

How do we normalize the data streaming from our distributed devices? By converting them all to geolocated events. If we know the time and location of a sensor reading, we can start to colocate and correlate events that can lead to action. In other words, we use location and time to help use build a common reference point for everything going on in our city.

Fortunately, many (if not all devices) will already need some form of decoding / translation. For example, consider our in-ground parking sensor. Since it's transmitting data over radio frequencies, it must optimize and encode data. Decoding could happen in the gateway, but we prefer a gateway to contain no knowledge of the devices it services. It should just act as the doorway to the world wide web (for all the Generation Z folks out there, that’s what the "www." in urls stands for).

Ideally, devices would all natively speak "smart city" and no decoding or normalization would be needed. Until then, we still need to create this step. Fortunately, it is super simple with Cloud Functions.

Cloud Functions is a serverless compute offering from Google. It allows us to run a chunk of code whenever a trigger occurs. We simply supply the recipe and identify the trigger and Google handles all the scaling and compute resource allocation. In other words, all I need to do is write the 20-50 lines of code that makes my service unique and never worry about ops. Pretty sweet, huh?

So, what’s our trigger? A Pub/Sub topic. What’s our code? Something like this:

function decodeParkingReadings( triggerInput ) {
    parsePubSubMessage(triggerInput)
    .then(decode)
    .then(converToCommand)
    .then(PubSub.topic(‘DeviceCommands’).publish)
  }

If you’re not familiar with promises and async coding in JavaScript, the above code simply does the following:
  1. Parse the message from Pub/Sub 
  2. When this is done, decode the payload byte string sent by the sensor 
  3. When this is done, wrap the decoded readings with our normalized UpdateReadings command data struct 
  4. When this is done, send the normalized event to the Device Readings Pub/Sub 
Of course, you’ll need to write the code for the "decode" and "convertToCommand" functions. If there's no timestamp provided by the device, then it would need to be added in one of these two steps. We’ll get more in-depth into code examples in part three.

So, in summary, the second step is to normalize all our streams by converting them into commands. In this case, all sensors are sending in a command to UpdateReadings for their associated device. Why didn’t we just create the event? Why bother making a command? Remember, this is an event-driven architecture. This means that events can only be created as a result of a command. Is it nitpicky? Very. But is it necessary? Yes. By not breaking the command -> event -> command chain, we make a system that's easy to expand and test. Without it, you can easily get lost trying to track data through the system (yes, a lot more on tracking data flows later).

So our architecture now looks like this:
Data streams coming into our platform are decoded using bespoke Cloud Functions that output a normalized, timestamped command. So far, we’ve only had to write about 30 - 40 lines of code, and the best part . . . we’re almost halfway complete with our entire platform.

Now that we have commands, we move onto the real magic. . . storage. Wait, storage is magic?

Third step: storage and indexing events


Now that we've converted all our inbound data into a sequence of commands, we’re 100% into event-driven architecture. This means that now we need to address the challenges of this paradigm. What makes event-driven architecture so great? It makes sense and is super easy to extend. What makes event-driven architecture painful? Doing it right has been a pain. Why? Because you only have commands and events in your system. If you want something more meaningful you need to aggregate these events. What does that mean? Let’s consider a simple example.

Let’s say you’ve got an event-driven architecture for a website that sells t-shirts. The orders come in as commands from a user-submitted web form. Updates also come in as commands. On the backend, we store only the events. So consider the following event sequence for a single online order:

1 - (Order Created) 
     orderNumber: 123foo
     items: [ item: redShirt, size: XL, quantity: 2 ]
     shippingAddress: 123 Bar Lane
2 - (Address Changed)
     orderNumber: 123foo
     shippingAddress: 456 Infinite Loop
3 - (Quantity Changed)
      orderNumber: 123foo
      items: [ item: redShirt, size: XL, quantity: 1 ]

You cannot get an accurate view of the current order by looking at only one event. If you only looked at #2 (Address Changed), you wouldn’t know the item quantities. If you only looked at #3 (Quantity Change), you wouldn’t have the address.

To get an accurate view, you need to "replay" all the events for the order. In event-driven architecture, this process is often referred to as "hydrating." Alternatively, you can maintain an aggregate view of the order (the current state of the order) in the database and update it whenever a new command arrives. Both of these methods are correct. In fact, many event-driven architectures use both hydration and aggregates.

Unfortunately, implementing consistent hydration and/or aggregation isn’t easy. There are libraries and even databases designed to handle this, but that was before the wondrous powers of serverless computing. Enter Google Cloud Bigtable and BigQuery.

Bigtable is the database service that Google uses to index and search the entire internet. Let that sink in. When you do a Google search, it's Bigtable that gets you the data you need in a blink of an eye. What does that mean? Unmatched power! We’re talking about a database optimized to handle billions of rows with millions of columns. Why is that so important? Because it lets us do event-driven architecture right!

For every command we receive, we create a corresponding event that we store in Bigtable.

Wow.
This blogger bloke just told us that we store data in a database.
Genius.

Why thank you! But honestly, it's the aspects of the database that matters. This isn’t just any database. Bigtable lets us optimize without optimizing. What does that mean? We can store everything and anything and access it with speed. We don’t need to write code to optimize our storage or build clever abstractions. We just store the data and retrieve it so fast that we can aggregate and interpret at access.

Huh?

Let me give you an example that might help explain the sheer joy of having a database that you cannot outrun.

These days, processors are fast. So fast that the slowest part of computing is loading data from the disk (even SSD). Therefore, most of the world’s most performant systems use aggressive compression when interacting with storage. This means that we'll compress all writes going to disk to reduce read-time. Why? Because we have so much excess processing power, it's faster to decompress data rather than read more bytes from disk. Go back 20 years and tell developers that we would "waste time" by compressing everything going to disk and they would tell you that you’re mad. You’ll never get good performance if you have to decompress everything coming off the disk!

In fact, most platforms go a step further and use the excessive processing power to also encrypt all data going to disk. Google does this. That’s why all your data is secure at rest in their cloud. Everything written to disk is compressed AND encrypted. That goes for their data too, so you know this isn’t impacting performance.

Bigtable is very much the same thing for web services. Querying data is so fast that we can perform processing post-query. Previously, we would optimize our data models and index the heck out of our tables just to reduce query time inside the database. When the database can query across billions of rows in 6 ms, that changes everything. Now we just store and store and store and process later.

This is why storing your data in a database is amazing, if it's the right database!

So how do we deal with aggregation and hydration? We don’t. At least not initially. We simply accept our command, load any auxiliary lookups needed (often the sensor / device won’t be smart enough to know its owner or location), and then save it to Bigtable. Again, we’re getting a lot of power with very little code and effort.

But wait, there’s more! I also mentioned BigQuery. This is the service that allows us to run complex SQL queries across massive datasets (even datasets store in a non-SQL database). In other words, now that I’ve stored all this data, how do I get meaning from it? You could write a custom service, or just use BigQuery. It will let you perform queries and aggregations across terabytes of data in seconds.

So yes, for most of you, this could very well be the end of the architecture:
Seriously, that’s it. You could build any modern web service (social media, music streaming, email) using this architecture. It will scale infinitely and have a max response time for queries of around 3 - 4 seconds. You would only need to write the initial normalization and the required SQL queries for BigQuery. If you wanted even faster responses, you could target queries directly at Bigtable, but that requires a little more effort.

This is why storage is magic. Pick the right storage and you can literally build your entire web service in two weeks and never worry about scale, ops, or performance. Bigtable is my Patronus!!

Now we could stop here. Literally. We could make a meaningful and useful city platform with just this architecture. We’d be able to make meaningful reports and views on events happening throughout our city. However, we want more! Our goal is to make a smart city, one that automatically reacts to events.

Fourth step: temporal reasoning


Ok, this is where things start to get a little more complex. We have events—a lot of events. They are stored, timestamped and geolocated. We can query this data easily and efficiently. However, we want to make our system react.

This is where temporal reasoning comes in. The fundamental idea: we build business rules and insights based on the temporal relationship between grouped events. These collections of related events are commonly referred to as "intervals" or "frames." For example, if my interval is a lecture, the lecture itself can contain many smaller events:


We can also take the lecture in the context of an even larger interval, such as a work day:
And of course, these days can be held in the context of an even larger frame, like a work week.

Once we've built these frames (these collection of events), we can start asking meaningful questions. For example, "Has the average temperature been above the danger threshold for more than 5 minutes?", "Was maintenance scheduled before the spike in traffic?", "Did the vehicle depart before the parking time limit?"

For many of you, this process may sound familiar. This approach of applying business rules for streaming data has many similarities to a Complex Event Processing (CEP) service. In fact, a wonderful implementation of a CEP that uses temporal reasoning is the Drools Fusion module. Amazing stuff! Why not just use a CEP? Unfortunately, business rule management systems (BRMS) and CEPs haven't yet fully embraced the smaller, bite-size methodologies of microservices. Most of these system require a single monolithic instance that demands absolute data access. What we need is a distributed collection of rules that can be easily referenced and applied by a distributed set of autoscaling workers.

Fortunately, writing the rules and applying the logic is easy once you have the grouped events. Creating and extending these intervals is the tricky part. For our smart city platform, this means having modules that define specific types of intervals and then adds any and all related events.

For example, consider a parking module. This would take the readings from sensors that detect the arrival and departure of a vehicle and create a larger parking interval. An example of a parking interval might be:
We simply build a microservice that listens to ReadingsUpdated events and manages creation and extension of parking intervals. Then, we're free to make a service that reacts to the FrameUpdated events and runs temporal reasoning rules to see if a new command should be created. For example, "If there's no departure 60 minutes after arrival, broadcast an IssueTicket command."

Of course, we may need to correlate events into an interval that are outside the scope of the initial sensor. In the parking example, we see "payment made." Payment is clearly not collected by the parking sensor. How do we manage this? By creating links between the interval and all known associated entities. Then, whenever a new event enters our system, we can add it to all related intervals (if the associated producer or its assigned groups are related). This sounds complex, but it's actually rather easy to maintain a complex set of linkages in Bigtable. Google does on a significant scale (like the entire internet). Of course, this would be a lot simpler if someone provided a serverless graph database!

So, without diving too much into the complexities, we have the final piece of our architecture. We collect events into common groups (intervals), maintain a list of links to related entities (for updates), and apply simple temporal reasoning (business rules) to drive system behavior. Again, this would be a nightmare without using an event-driven architecture built on serverless computing. In fact, once we get a serverless graph database and distributed BRMS, we've solved the internet (spoiler alert: we'll all change into data engineers, AI trainers and UI guys).

[BTW, for more information, please consult the work of the godfather of computer-based temporal reasoning, James F. Allen. More specifically, his whitepapers An Interval-Based Representation of Temporal Knowledge and Maintaining Knowledge about Temporal Events]

Fifth step: the extras


While everything sounds easy, there are a few details I may have glossed over. I hope you found some! You’re an engineer, of course you did. Sorry, but this part is a little technical. You can skip it if you like! Or, just email me your doubt and I’ll reply!

A quick example is how do I query a subset of devices. We have everything stored in Bigtable, but how do I look at only one group? For example, what if I only wanted to look at devices or events downtown?

This is where grouping comes in. It’s actually really easy with Bigtable. Since Bigtable is NoSQL, it means that we can have sparse columns. In other words, we can have a family called "groups" and any custom set of column qualifiers in this family per row. In other words, we let an event belong to any number of groups. We look up the current groups when the command is received for the device and add the appropriate columns. This will hopefully make more sense when we go deeper in part three.

Another area worth a passing mention is extension and testing. Why is serverless and event-driven architecture so easy to test and extend? The ease of testing comes from the microservices. Each component does one thing and does it well. It accepts either a command or an event and produces a simple output. For example, each of our event Pub/Subs has a Cloud Function that simply takes the events and stores them in Google Cloud Storage for archival purposes. This function is only 20 lines of code (mostly boilerplate) and has absolutely no impact on the performance of other parts of the system. Why no impact? It's serverless, meaning that it autoscales only for its needs. Also, thanks to the Pub/Sub queues, our microservice is taking a non-destructive replication of input (i.e., each microservice is getting a copy of the message without putting a burden on any other part of our architecture).

This zero impact is also why extension of our architecture is easy. If we want to build an entirely new subsystem, we simply branch off one of the Pub/Subs. This means a developer can rebuild the entire system if they want with zero impact and zero downtime for the existing system. I've done this [transitioned an entire architecture from Datastore to Bigtable1], and it's liberating. Finally, we can rebuild and refactor our services without having to toss out the core of our architecture—the events. In fact, since the heart of our system is events published through serverless queues, we can branch our system just like many developers branch their code in modern version control systems (i.e, Git). We simply create new ways to react to commands and events. This is perfect for introducing new team members. These noobs [technical term for a new starter] can branch off a Pub/Sub and deploy code to the production environment on their first day with zero risk of disrupting the existing system. That's powerful stuff! No-fear coding? Some dreams do come true.

BUT—and this is a big one (fortunately, I like big buts)—what about integration testing? Building and testing microservices is easy. Hosting them on serverless is easy. But how do we monitor them and, more importantly, how do we perform integration testing on this chain of independent functions? Fortunately, that's what Part Three is for. We'll cover this all in great detail there.

Conclusion


In this post, we went deep into how we can make an event-driven architecture work on serveless through the context of a smart city platform. Phew. That was a lot. Hope it all made sense (if not, drop me an email or leave a comment). In summary, modern serverless cloud services allow us to easily build powerful systems. By leveraging autoscaling storage, compute and queuing services, we can make a system that outpaces any demand and provides world-class performance. Furthermore, these systems (if designed correctly) can be easy to create, maintain and extend. Once you go serverless, you'll never go back! Why? Because it's just fun!

In the next part, we'll go even deeper and look at the code required to make all this stuff work.


A little more context on the refactor, for those who care to know. Google Datastore is a brillant and extremely cost-efficient database. It is noSQL like Bigtable but offers traditional-style indexing. For most teams, Datastore will be a magic bullet (solving all your scalability and throughput needs). It’s also ridiculously easy to use. However, as your data sets (especially for streaming) start to grow, you’ll find that the raw power of Bigtable cannot be denied. In fact, Datastore is built on Bigtable. Still, for most of us, Datastore will be everything we could want in a database (fast, easy and cheap, with infanite2 scaling).

Did I put a footnote in a footnote? Yes. Does that make it a toenote? Definitely. Is ‘infanite’ a word? Sort of. In·fa·nite (adjective) - practically infinite. Google’s serverless offerings are infanite, meaning that you’ll never hit the limit until your service goes galactic.

Oro: How GCP smoothed our path to PCI DSS compliance



Editor’s note: We recently we made a bunch of security announcements, and today we’re sharing a story from Oro, Inc., which runs its OroCommerce e-commerce service on Google Cloud Platform, and was pleasantly surprised by the ease and speed with which they were able to demonstrate PCI DSS compliance. Read on for Oro’s information security officer’s take on achieving PCI DSS compliance in the cloud.

Building and running an e-commerce website poses many challenges. You want your website to be easy to use, have an attractive design and an intuitive user interface. It must scale during peak seasons like Black Friday and Cyber Monday. But equally, if not more important, is information security. E-commerce websites are frequent targets because they handle financial transactions and payment card industry (PCI) information such as credit and debit card numbers. They also connect into many other systems, so it must meet many strict infosec industry standards.

If you have an e-commerce website, achieving PCI DSS compliance is critical. As a Chief Information Security Officer (CISO), Chief Information Officer (CIO), Chief Technology Officer (CTO) or other Infosec specialist, you may be concerned about PCI compliance on cloud infrastructures. Here at Oro, the company behind the OroCommerce B2B eCommerce platform, we addressed our PCI DSS compliance requirements by using Google Cloud Platform (GCP) as our Infrastructure-as-a-Service (IaaS) platform, and pass the benefits on to our OroCommerce customers. Achieving PCI DSS compliance may not be as easy as googling the closest pizza shops or gas stations, but Google Cloud’s IaaS platform certainly simplifies the process, ensuring you have everything needed to be compliant.

Using cloud and IaaS wasn’t always our top choice for building a PCI DSS-compliant website. Initially, our customers were reluctant to put their precious data into another party’s hands and store it somewhere in a foggy cloud. But nowadays, attitudes have changed. GCP provided us with strong support and a variety of tools to help build a PCI DSS compliant solution.
We had an excellent experience partnering and working with Google to complete the PCI DSS certification on our platform-as-a-service (PaaS) that hosts customized OroCommerce sites for Oro customers. We're proud to partner with Google Cloud to offer our customers a secure environment.


Building PCI DSS compliant infrastructure

At its core, building a PCI DSS compliant infrastructure requires:

  • The platform used to build your service must be PCI DSS compliant. This is a direct compliance requirement. 
  • Your platform must provide all the tools and methods used to build secure networks.

Google helped with both of these. The first point was easy, since all GCP services are PCI DSS compliant. In addition, Google provided us with a Shared Responsibility document that lists all PCI DSS requirements. This document explains the details of how Google achieves compliance and what Google customers need to do above and beyond that to support a compliant environment. This document not only has legal value but if used as a checklist, it can be a useful tool when going for PCI DSS certification.

For example, Google supports PCI DSS requirement #9, which mandates the physical security of a hosting environment including the need for guards, hard disk drive shredders, surveillance, etc. Hearing that Google takes the responsibility to protect both hardware and data from physical theft or damage was very reassuring. We rely on GCP tools to protect against inappropriate access and ensure day-to-day information security.
Another key requirement of a secure network architecture (and PCI DSS) is to hide all internal nodes from external access, control all incoming and outgoing traffic, and use network segregation for different application tiers. OroCommerce fulfills these requirements by using Google’s Virtual Private Cloud, firewall rules, advanced load balancers and Cloud Identity and Access Management (IAM) for authentication control. Google Site Reliability Engineers (SRE) have secure connections into production nodes inside the isolated production network using Google’s 2-step authentication mechanisms.

Also, we found that we can safely use Google-provided Compute Engine images based on up-to-date and secure Linux distributions. This frees the sysadmin from hardening of the OS, so they can pay more attention to vulnerability management and other important tasks.

While the importance of a secure infrastructure, access control, and network configuration is well-known, it’s also important to build and maintain a reliable logging and monitoring system. The PCI DSS standard puts an emphasis on audit trails and logs. To be compliant, you must closely monitor environments for suspicious activity and collect all needed data for a predetermined length of time to investigate any incidents. We found the combination of Stackdriver Monitoring and Logging, plus big data services such as BigQuery, helped us meet our monitoring, storing and log analysis needs. With Stackdriver, we monitor our production systems and detect anomalies in a thorough and timely manner, spending less time on configuration and support. We use BigQuery to analyze our logs so engineers can easily figure out what happened during a particular period of time.

Back in 2017 when we started to work on getting PCI DSS compliance for OroCommerce, we expected to spend a huge amount of time and resources on this process. But as we moved forward, we figured out how much GCP helped us to meet our goal. Having achieved PCI DSS compliance, it’s clear that choosing GCP for our infrastructure was the right decision.

How we used Cloud Spanner to build our email personalization system—from “Soup” to nuts



[Editor’s note: Today we hear from Tokyo-based Recruit Technologies Co., Ltd., whose email marketing platform is built on top of GCP. The company recently migrated its database to Cloud Spanner for lower cost and operational requirements, and higher availability compared to their previous HBase-based database system. Cloud Spanner also allows Recruit to calculate metrics (KPIs) in real time without having to transfer the data first. Read on to learn more about their workload and how Cloud Spanner fits into the picture.]

There are just under 8 billion people on Earth, depending on the source. Here at Recruit, our work is to develop and maintain an email marketing system that sends personalized emails to tens of millions of customers of hundreds of web services, all with the goal of providing the best, most relevant customer experience.

When Recruit was founded in 1960, the company was focused on helping match graduates to jobs. Over the years, we’ve expanded to help providers that deal with almost every personal moment and event a person encounters in their life. From travel plans to real estate, restaurant choices to haircuts, we offer software and services to help providers deliver on virtually everything and connect to their end-customers.

Recruit depends on email as a key marketing vehicle to end-customers and to provide a communications channel to clients and advertisers across its services. To maximize the impact of these emails, we customize each email we send. To help power this business objective, we developed a proprietary system named “Soup” that we host on Google Cloud Platform (GCP). Making use of Google Cloud Spanner, Soup is the connective tissue that manages the complex customization data needed for this system.

Of course, getting from idea to functioning product is easier said than done. We have massive datasets so requirements like high availability and serving data in real-time are particularly tricky. Add in a complex existing on-premises environment, some of which we had to maintain in our journey to the cloud creating a hybrid environment, and the project became even more challenging.

A Soup primer


First, why the name “Soup”? The name of the app is actually “dashi-wake” in Japanese, from “dashi,” a type of soup. In theory, Soup is a fairly simple application: its API returns recommendation results based on the data we compute about a user via the user's user ID. Soup ingests pre-computed recommendations and then serves those recommendations to the email generation engine and tracks metrics. While Soup doesn’t actually send the customer emails, it manages the entire volume of personalization and customization data for tens of millions of users. It also manages the computed metrics associated with these email sends such as opens, clicks, and other metadata.

Soup leverages other GCP services such as App Engine Flex (Node.js), BigQuery, Data Studio, and Stackdriver in addition to Cloud Spanner.
(click to enlarge)

Soup requirements


High availability

If the system is unavailable when a user decides to open an email they see a white screen with no content at all. Not only is that lost revenue for that particular email, it makes customers less likely to open future emails from us.

Low latency

Given a user ID, the system needs to search all its prediction data and generate the appropriate contentan HTML file, an image, multiple images, or other contentand deliver it, all very quickly.

Real-time log ingestion and fast JOINs

In today’s marketing environment, tracking user activity and being able to make dynamic recommendations based on it is a must-have. We live in an increasingly real-time world. In the past, it might have been OK to take a week or longer to adapt content based on customer behavior. Now? A delay of even a day can make the difference between a conversion and a lost opportunity.

The problem


Pushing out billions of personalized emails to tens of millions of customers comes with some unique challenges. Our previous on-premises system was based on Apache HBase, the open-source NoSQL database, and Hive data warehouse software. This setup presented three major obstacles:

Cluster sizing

Email marketing is a bursty workload. You typically send a large batch of emails, which requires a lot of compute, and then there’s a quiet period. For our email workloads, we pre-compute a large set of recommendations and then serve those recommendations dynamically upon email open. On-premises, there wasn’t much flexibility and we had to resize clusters manually. We were plagued by errors whenever loads of email opens and the resulting requests to the system outpaced the traffic we could handle, because the cluster size of our HBase/Hive system couldn’t keep up.

Performance

The next issue was optimizing the schema model for performance. Soup has a couple of main functions: services write customer tracking data to it, and downstream “customers” read that data from it to create the personalized emails. On the write side, after the data is written to Soup, the writes need to be aggregated. We initially did this on-premises, which was quite difficult and time consuming because Hbase’s doesn’t offer aggregation queries, and because it was hard to scale in response to traffic bursts.

Transfer delays

Finally, every time we needed to generate a recommendation model for a personalized email blast, we needed to transfer the necessary data from HBase to Hive to create the model, then back to HBase. These complex data transfers were taking two-to-three days. Needless to say, this didn’t allow for the type of agility that we need to provide the best service to our customers.

Cloud Spanner allows us to store all our data in one place, and simply join the data tables and do aggregates; there’s no need for a time-intensive data transfer. Using this model, we believe we can cut the recommendation generation time from days to under a minute, bringing real-time back into the equation.

Why Cloud Spanner?

Compared to the previous application running on-premises, Cloud Spanner offers lower cost, lower operations requirements and higher availability. Most critically, we wanted to calculate metrics (KPIs) in real time without data transfer. Cloud Spanner allows us to do this by pumping SQL queries into a custom dashboard that monitors KPIs in real time.

Soup now runs on GCP, although the recommendations themselves are still generated in an on-premise Hadoop cluster. The computed recommendations are stored in Cloud Spanner for the reasons mentioned above. After moving to GCP and architecting for the cloud, we see an error rate of .005% per second vs. a previous rate of 4% per second, an improvement of 1/800. This means that for an email blast sent to all users in Japan, one user won’t be able to see one image in one email. Since these emails often contain 10 images or more, this error rate is acceptable.

Cloud Spanner also solved our scaling problem. In the future, Soup will have to support one million concurrent users in different geographical areas. Likewise, Soup has to perform 5,000 queries per second (QPS) at peak times on the read side, and will expand this requirement to 20,000 to 30,000 QPS in the near future. Cloud Spanner can handle all the different, complex transactions Soup has to run, while scaling horizontally with ease.

Takeaways


In migrating our database to Cloud Spanner, we learned many things that are worth taking note of, whether you have 10 or 10 million users.

Be prepared to scale

We took scaling into account from Day One, sketching out specific requirements for speed, high availability, and other metrics. Only by having these requirements specifically laid out were we able to choose—and build—a solution that could meet them. We knew we needed elastic scale.

With Cloud Spanner, we didn’t have to make any of the common trade-offs between the relational database structure we wanted, and the scalability and availability needed to keep up with the business requirements. Likewise, with a growing company, you don’t want to place any artificial limits on growth, and Cloud Spanner’s ability to scale to “arbitrarily large” database sizes eliminates this cap, as well as the need to rewrite or migrate in the future as our data needs grow.

Be realistic about downtime

For us, any downtime can result in literally thousands of lost opportunities. That meant that we had to demand virtually zero downtime from any solution, to avoid serving up errors to our users. This was an important realization. Google Cloud provides an SLA guarantee for Cloud Spanner. This solution is more available and resistant to outages than anything we would build on our own.

Don’t waste time on management overhead

When you’re worrying about millions of users and billions of emails, the last thing you have time to do is all the maintenance and administrative tasks required to keep a database system healthy and running. Of course, this is true for the smallest installations, as well. Nobody has a lot of extra time to do things that should be taken care of automatically.

Don’t be afraid of hybrid

We found that a hybrid architecture that leverages the cloud for fast data access but still using our existing on-premises investments for batch processing to be effective. In the future, we may move the entire workload to the cloud but data has gravity, and we currently have lots of data stored on-premises.

Aim for real-time

At this time, we can only move data in and out of Cloud Spanner in small volumes. This prevents us from making real-time changes to recommendations. Once Cloud Spanner supports batch and streaming connections, we'll be able to enable an implementation to provide more real-time recommendations to deliver even more relevant results and outcomes.

Overall, we’re extremely happy with Cloud Spanner and GCP. Google Cloud has been a great partner in our move to the cloud, and the unique services provided enable us to offer the best service to our customers and stay competitive.

8 DevOps tools that smoothed our migration from AWS to GCP: Tamr



Editor’s note: If you recently migrated from one cloud provider to another—or are thinking about making the move—you understand the value of avoiding vendor lock-in by using third-party tools. Tamr, a data unification provider, recently made the switch from AWS to Google Cloud Platform, bringing with them a variety of DevOps tools to help with the migration and day-to-day operations. Check out their recommendations for everything from configuration management to storage to user management.

Here at Tamr, we recently migrated from AWS to Google Cloud Platform (GCP), for a wide variety of reasons, including more consistent compute performance, cheaper machines, preemptible machines and better committed usage stories, to name a few. The larger story of our migration itself is worth its own blog post, which will be coming in the future, but today, we’d like to walk through the tools that we used internally that allowed us to make the switch in the first place. Because of these tools, we migrated with no downtime and were able to re-use almost all of the automation/management code we’d developed internally over the past couple of years.

We attribute a big part of our success to having been a DevOps shop for the past few years. When we first built out our DevOps department, we knew that we needed to be as flexible as possible. From day one, we had a set of goals that would drive our decisions as a team, and which technologies we would use. Those goals have proved themselves as they have held up over time, and more recently allowed us to seamlessly migrate our platform from AWS to GCP and Google Compute Engine.

Here were our goals. Some you’ll recognize as common DevOps mantras, others were more specific to our organization:

  • Automate everything, and its corollary, "everything is code"
  • Treat servers as cattle, not pets
  • Scale our devops team sublinearly in relation to the number of servers and services we support 
  • Don’t be tied into one vendor/cloud ecosystem. Flexibility matters, as we also ship our entire stack and install it on-prem at our customers sites
Our first goal was well defined and simple. We wanted all operation tasks to be fully automated. Full stop. Though we would have to build our own tooling in some cases, for the most part there's a very rich set of open source tools out there that can solve 95% of our automation problems with very little effort. And by defining everything as code, we could easily review each change and version everything in git.

Treating servers as cattle, not pets is core to the DevOps philosophy. Server "pets" have names like postgres-master, and require you to maintain them by hand. That is, you run commands on it via a shell and upgrade settings and packages yourself. Instead, we wanted to focus on primitives like the amount of cores and RAM that our services need to run. We also wanted to kill any server in the cluster at any time without having to notify anyone. This makes doing maintenance much easier and streamlined, as we would be able to do rolling restarts of every server in our fleet. It also ties into our first goal of automating everything.

We also wanted to keep our DevOps team in check. We knew from the get-go that to be successful, we would be running our platform across large fleets of servers. Doing things by hand requires us to hire and train a large number of operators just to run through set runbooks. By automating everything and investing in tooling we can scale the number of systems we maintain without having to hire as many people.

Finally, we didn’t want to get tied into one single vendor cloud ecosystem, for both business reasons—we deploy our stack at customer sites—and because we didn’t want to be held hostage by any one cloud provider. To avoid getting locked into a cloud’s proprietary services, we would have to run most things ourselves on our own set of servers. While you may choose to use equivalent services from their cloud provider, we like the independence of this go-it-alone approach.

Our DevOps toolbox


1. Server/Configuration management: Ansible 

Picking a configuration management system should be the very first thing you do when building out your DevOps toolbox, because you’ll be using it on every server that you have. For configuration management, we chose to use Ansible; it’s one of the simpler tools to get started with, and you can use it on just about any Linux machine.

You can use Ansible in many different ways: as a scripting language, as a parallel ssh client, and as a traditional configuration management tool. We opted to use it as a configuration management tool and set up our code base following Ansible best practices. In addition to the best practices layed out in the documentation, we went one step further and made all of our Ansible code fully idempotent—that is, we expect to be able to run Ansible at any time, and as long as everything is already up-to-date, for it to not have to make any changes. We also try and make sure that any package upgrades in Ansible have the correct handlers to ensure a zero downtime deployment.

We were able to use our entire Ansible code base in both the AWS and GCP environments without having to change any of our actual code. The only things that we needed to change were our dynamic inventory scripts, which are just Python scripts that Ansible executes to find the machines in your environment. Ansible playbooks allow you to use multiple of these dynamic inventory scripts simultaneously, allowing us to run Ansible across both clouds at once.

That said, Ansible might not be the right fit for everyone. It can be rather slow for some things and isn’t always ideal in an autoscaling environment, as it's a push-based system, not pull-based (like Puppet and Chef). Some alternatives to Ansible are the afore-mentioned Puppet and Chef, as well as Salt. They all solve the same general problem (automatic configuration of servers) but are optimized for specific use cases.

2. Infrastructure configuration: Terraform

When it comes to setting up infrastructure such as VPCs, DNS and load balancers, administrators sometimes set up cloud services by hand, then forget they are there, or how they configured them. (I’m guilty of this myself.) The story goes like this: we need a couple of machines to test an integration with a vendor. The vendor wants shell access to the machines to walk us through problems and requests an isolated environment. A month or two goes by and everything is running smoothly, and it’s time to set up a production environment based on the development environment. Do you remember what you did to set it up? What settings you customized? That is where infrastructure-as-code configuration tools can be a lifesaver.

Terraform allows you to codify the settings and infrastructure in your cloud environments using its domain specific language (DSL). It handles everything for you (cloud integrations, and ordering of operations for creating resources) and allows you to provision resources across multiple cloud platforms. For example, in Terraform, you can create DNS records in Google DNS that reference a resource in AWS. This allows you to easily link resources across multiple environments and provision complex networking environments as code. Most cloud providers have a tool for managing resources as code: AWS has CloudFormation, Google has Cloud Deployment Manager, and Openstack has Heat Orchestration Templates. Terraform effectively acts as a superset of all these tools and provides a universal format across all platforms.

3. Server imaging: Packer 

One of the basic building blocks of a cloud environment is a Virtual Machine (VM) image. In AWS, there’s a marketplace with AMI images for just about anything, but we often needed to install tools onto our servers beyond the basic services included in the AMI. For example, think Threatstack agents that monitor the activity on the server and scan packages on the server for CVEs. As a result, it was often easier to just build our own images. We also build custom images for our customers and need to share them into their various cloud accounts. These images need to be available to different regions, as do our own base images that we use internally as the basis for our VMs. Having a consistent way to build images independent of a specific cloud provider and a region is a huge benefit.

We use Packer, in conjunction with our Ansible code base, to build all of our images. Packer provides the framework to spin up machines, runs our Ansible code, then saves a copy of the snapshot of the machine into our account. Because Packer is integrated with configuration management tools, it allowed us to define everything in the AMIs as source code. This allows us to easily version images and have confidence that we know exactly what’s in our images. It made reproducing problems that customers had with our images trivial, and allowed us to easily generate changelogs for images.

The bigger benefit that we experienced was that when we switched to Compute Engine, we were able to reuse everything we had in AWS. All we needed to change was a couple of lines in Packer to tell it to use Compute Engine instead of AWS. We didn’t have to change anything to our base images that developers use day-to-day or the base images that we use in our compute clusters.

4. Containers: Docker

When we first started building out our infrastructure at Tamr, we knew that we wanted to use containers as I had used them at my previous company and seen how powerful and useful they can be at scale. Internally we have standardized on Docker as our primary container format. It allows us to build a single shippable artifact for a service that we can run on any Linux system. This gives us portability between Linux operating systems without significant effort. In fact, we’ve been able to Dockerize most of our system dependencies throughout the stack, to simplify bootstrapping from a vanilla Linux system.

5 and 6. Container and service orchestration: Mesos + Marathon

Containers in and of themselves don’t inherently provide scale or high availability on their own. Docker itself is just a piece of the puzzle. To fully leverage containers you need something to manage them and provide management hooks. This is where a container orchestration comes in. It allows you to link together your containers and use them to build up services in a consistent, fault-tolerant way.

For our stack we use Apache Mesos as the basis of our compute clusters. Mesos is basically a distributed kernel for scheduling tasks on servers. It acts as a broker for requests from frameworks to resources (cpu, memory, disk, gpus) available on machines in the Mesos cluster. One of the most common frameworks for Mesos is Marathon, which ships as part Mesosphere’s commercial DC/OS (Data Center Operating System), the main interface for launching tasks onto a Mesos cluster. Internally we deploy all of our services and dependencies on top of a custom Mesos cluster. We spent a fair amount of time building our own deployment/packaging tool on top of Marathon for shipping releases and handling deployments. (Down the road we hope to open source this tool, in addition to writing a few blog posts about it).

The Mesos + Marathon approach for hosting services is so flexible that during our migration from AWS to GCP, we were able to span our primary cluster across both clouds. As a result, we were able to slowly switch services running on the cluster from one cloud to another using Marathon constraints. As we were switching over, we simply spun up more Compute Engine machines and then deprecated machines on the AWS side. After a couple of days, all of our services were running on Compute Engine machines, and off of AWS.

However, if we were building our infrastructure from scratch today, we would heavily consider building on top of Kubernetes rather than Mesos. Kubernetes has come a long way since we started building out our infrastructure, but it just wasn’t ready at the time. I highly recommend Google Kubernetes Engine as a starting point for organizations starting to dip their toes into the container orchestration waters. Even though it's a managed service, the fact that it's based on open-source Kubernetes ensures minimized the risk of cloud lock-in.

7. User management: JumpCloud 

One of the first problems we dealt with in our AWS environment was how to provide ssh access to our servers to our development team. Before we automated server provisioning, developers often created a new root key every time they spun up an instance. We soon consolidated to one shared key. Then we upgraded to running an internal LDAP instance. As the organization grew, managing that LDAP server became a pain—we were definitely treating it as a pet. So we went looking for a hosted LDAP/Active Directory offering, which led us JumpCloud. After working with them, we ended up using their agent on our servers instead of an LDAP connector, even though they have a hosted LDAP endpoint that we do use for other things. The JumpCloud agent syncs with JumpCloud and provisions users and groups and ssh keys onto the server automatically for us. JumpCloud also provides a self-service portal for developers updating their ssh keys. This means that we now spend almost no time actually managing access to our servers; it’s all fully automated.

It’s worth noting that access to machines on Compute Engine is completely different than AWS. With GCP, users can use the gcloud command line interface (CLI) to gain access to a machine. The CLI generates a ssh key, and provisions it onto the server and creates a user account on the machine (for example, here's a sample command is `gcloud compute --project "gce-project" ssh --zone "us-east1-b" "my-machine-name"`). In addition, users can upload their ssh-keys/users pairs in the console and new machines will have those users accounts set up on launch of a machine. In other words, the problem of how to provide ssh access to developers that we ran into in on AWS doesn’t exist on Compute Engine.

JumpCloud solved a specific problem with AWS, but provides a portable solution across both GCP and AWS. Using it with GCP works great, however if you're 100% on GCP, you don’t need to rely on an additional external service such as JumpCloud to manage your users.

8. Storage: RexRay 

Given that we run a large amount of services on top of a Mesos cluster we needed a way to provide persistent storage to Docker containers running there. Since we treat servers as cattle not pets (we expect to be able to kill any one server at any time), using Mesos local persistent storage wasn’t an option for us. We ended up using RexRay as an interface for provisioning/mounting disks into containers. RexRay acts as the bridge on a server between disks and a remote storage provider. Its main interface is a Docker storage driver plugin that can make API calls to a wide variety of sources (AWS, GCP, EMC, Digital Ocean and many more) and mount the provisioned storage into a Docker container. In our case, we were using EBS volumes on AWS and persistent disks on Compute Engine. Because RexRay is implemented as a Docker plugin, the only thing we had to change between the environments was the config file with the Compute Engine vs. AWS settings. We didn’t have to change any of our upstream invocations for disk resources.


DevOps = Freedom 


From the viewpoint of our DevOps team, these tools enabled a smooth migration, without much manual effort. Most things only required updating a couple of config files to be able to talk to Compute Engine APIs. At the top layers in our stack that our developers use, we were able to switch to Compute Engine with no development workflow changes, and zero downtime. Going forward, we see being able to span across and between clouds at will as a competitive advantage, and this would not be possible without the investment we made into our tooling.

Love our list of essential DevOps tools? Hate it? Leave us a note in the comments—we’d love to hear from you. To learn more about Tamr and our data unification service, visit our website.

Why we used Elastifile Cloud File System on GCP to power drug discovery



[Editor’s note: Last year, Silicon Therapeutics talked about how they used Google Cloud Platform (GCP) to perform massive drug discovery virtual screening. In this guest post, they discuss the performance and management benefits they realized from using the Elastifile Cloud File System and CloudConnect. If you’re looking for a high-performance file system that integrates with GCP, read on to learn more about the environment they built.]

Here, at Silicon Therapeutics, we’ve seen the benefits of GCP as a platform for delivering massive scale-out compute, and have used it as an important component of our drug discovery workload. For example, in our past post we highlighted the use of GCP for screening millions of compounds against a conformational ensemble of a flexible protein target to identify putative drug molecules.

However, like a lot of high-performance computing workflows, we encounter data challenges. It turns out, there are a lot of data management and storage considerations involved with running one of our core applications, molecular dynamics (MD) simulations, which involve the propagation of atoms in a molecular system over time. The time-evolution of atoms is determined by numerically solving Newton's equations of motion, where forces between the atoms are calculated using molecular mechanics force fields. These calculations typically generate thousands of snapshots containing the atomic coordinates, each with tens of thousands of atoms, resulting in relatively large trajectory files. As such, running MD on a large dataset (e.g. the entirety of the ~100,000 structures in the Protein Data Bank (PDB)) could generate a lot of data (over a petabyte).

In scientific computing, decreasing the overall time-to-result and increasing accuracy are crucial in helping to discover treatments for illnesses and diseases. In practice, doing so is extremely difficult due to the ever-increasing volume of data and the need for scalable, high-performance, shared data access and complex workflows. Infrastructure challenges, particularly around file storage, often consume valuable time that could be better spent on core research, thus slowing the progress of critical science.

Our physics-based workflows create parallel processes that generate massive amounts of data, quickly. Supporting these workflows requires flexible, high-performance IT infrastructure. Furthermore, analyzing the simulation results to find patterns and discover new druggable targets means sifting through all that data—in the case of this run, over one petabyte. That kind of infrastructure would be prohibitively expensive to build internally.

The public cloud is a natural fit for our workflows, since in the cloud, we can easily apply thousands of parallel compute nodes to a simulation or analytics job. However, while cloud is synonymous with scalable, high-performance compute, delivering complementary scalable, high-performance storage in the cloud can be problematic. We’re always searching for simpler, more efficient ways to store, manage, and process data at scale, and found that the combination of GCP and the Elastifile cross-cloud data fabric could help us resolve our data challenges, thus accelerating the pace of research.
Our HPC architecture used Google Compute Engine CPUs and GPUs, Elastifile for distributed file storage, and Google Cloud Storage plus Elastifile to manage inactive data.

Why high-performance, scale-out file storage is crucial


To effectively support our bursty molecular simulation and analysis workflows, we needed a cloud storage solution that could satisfy three key requirements:

  • File-native primary storage - Like many scientific computing applications, the analysis software for our molecular simulations was written to generate and ingest data in file format from a file system that ensures strict consistency. These applications won’t be refactored to interface directly with object storage systems like Google Cloud Storage any time soon—hence the need for a cloud-based, POSIX-compliant file system. 
  • Scalable global namespace - Stitching together file servers on discrete cloud instances may suffice for simple analyses on small data sets. However, the do-it-yourself method comes up short as datasets grow and when you need to share data across applications (e.g., in multi-stage workflows). We needed a modern, fully-distributed, shared file system to deliver the scalable, unified namespace that our workflows require. 
  • Cost-effectiveness - Finally, when managing bursty workloads at scale, rigid storage infrastructure can be prohibitively expensive. Instead, we needed a solution that could be rapidly deployed/destroyed, to keep our infrastructure costs aligned to demand. And ideally, for maximum flexibility, we also wanted a solution that could facilitate data portability, both 1) between sites and clouds, and 2) between formats—file format for “active” processing and object format for cost-optimized “inactive” storage/archival/backup.


Solving the file storage problem


To meet our storage needs and support the evolving requirements of our research, we worked with Elastifile, whose cross-cloud data fabric was the backbone of our complex molecular dynamics workflow.

The heart of the solution is the Elastifile Cloud File System (ECFS), a software-only, distributed file system designed for performance and scalability in cloud and hybrid-cloud environments. Built to support the noisy, heterogeneous environments encountered at cloud-scale, ECFS is well-suited to primary storage for data-intensive scientific computing workflows. To facilitate data portability and policy-based controls, Elastifile file systems are exposed to applications via Elastifile “data containers.” Each file system can span any number of cloud instances within a single namespace, while maintaining the strict consistency required to support parallel, transactional applications in complex workflows.

By deploying ECFS on GCP, we were able to simplify and optimize a molecular dynamics workflow. We then applied it to 500 unique proteins as a proof of concept for the aforementioned PDB-wide screen. For this computation, we leveraged a SLURM cluster running on GCP. The compute nodes were 16 n1-highcpu-32 instances, with 8 GPUs attached to every instance for a total of 120 K80 GPUs and 512 CPUs. The storage capacity was provided by a 6 TB Elastifile data container mounted on all the compute nodes.
Defining SLURM configuration to allocate compute and storage resources
Before Elastifile, provisioning and managing storage for such workflows was a complex, manual process. We partitioned the input datasets manually and created several different clusters, each with their own disks. This was because a single large disk often led to NFS issues, specifically with large metadata. In the old world, once the outputs of each cluster were completed, we stored the disks as snapshots. For access, we spun up an instance and shared the credentials for data access. This access pattern was error-prone as well as insecure. Also, at scale, manual processes such as these are time-consuming and introduce risk of critical errors and/or data loss.

With Elastifile, however, deploying and managing storage resources was quick and easy. We simply specified the desired storage capacity, and the ECFS cluster was automatically deployed, configured and made instantly available to the SLURM-managed compute resources . . . all in a matter of minutes. Also, if we want, we can expand the cluster later for additional capacity, with the push of a button. This future-proofs the infrastructure to be able to handle dynamically changing workflow requirements and data scale. By simplifying and automating the deployment process for a cloud-based file system, Elastifile reduced the complexity and risk associated with manual storage provisioning.
Specifying desired file system attributes and policies via Elastifile's unifed management console
In addition, by leveraging Elastifile’s CloudConnect service, we were able to seamlessly promote and demote data between ECFS and Cloud Storage, minimizing infrastructure costs. Elastifile CloudConnect makes it easy to move the data to Google buckets from Elastifile’s data container, and once the data has moved, we can tear down the Elastifile infrastructure, reducing unnecessary costs.
Leveraging Elastifile's CloudConnect UI to monitor progress of data "check in" and "check out" operations between file and object storage
This data movement is essential to our operations, since we need to visualize and analyze subsets of this data on our local desktops. Moving forward, leveraging Elastifile’s combination of data performance, parallelism, scalability, shareability and portability will help us perform more—and larger-scale—molecular analyses in shorter periods of time. This will ultimately help us find better drug candidates, faster.
Visualizing the protein structure, based on the results of the molecular dynamics analyses
As a next step, we’ll work to scale the workflow to all of the unique protein structures in the PDB and perform deep-learning analysis on the resulting data to find patterns associated with proteins dynamics, druggability and tight-binding ligands.

To learn more about how Elastifile supports highly-parallel, on-cloud molecular analysis on GCP, check out this demo video and be sure to visit them at www.elastifile.com.

How we built a serverless digital archive with machine learning APIs, Cloud Pub/Sub and Cloud Functions



[Editor’s note: Today we hear from Incentro, a digital service provider and Google partner, which recently built a digital asset management solution on top of GCP. It combines machine learning services like Cloud Vision and Speech APIs to easily find and tag digital assets, plus Cloud Pub/Sub and Cloud Functions for an automated, serverless solution. Read on to learn how they did it.]

Here at Incentro, we have a large customer base among media and publishing companies. Recently, we noticed that our customers struggle with storing and searching for digital media assets in their archives. It’s a cumbersome process that involves a lot of manual tagging. As a result, the videos are often stored without being properly tagged, making it nearly impossible to find and reuse these assets afterwards.

To eliminate this sort of manual labour and to generate more business value from these expensive video assets, we sought to create a solution that would take care of mundane tasks like tagging photos and videos. This solution is called Segona Media (https://segona.io/media), and lets our customers store assets and tag and index their digital assets automatically.

Segona Media management features


Segona Media currently supports images, video and audio assets. For each of these asset types, Google Cloud provides specific managed APIs to extract relevant content from the asset without customers having to tag them manually or transcribe them.

  • For images, Cloud Vision API extracts most of the content we need: labels, landmarks, text and image properties are all extracted and can be used to find an image.
  • For audio, Cloud Speech API showed us tremendous results in transcribing an audio track. After extracting the audio into speech, we also use Google Cloud Natural Language API to discover sentiment and categories in the transcription. This way, users can search for spoken text, but also search for categories of text and even sentiment.
  • For video, we typically use a combination of audio and image analysis. Cloud Video Intelligence API extracts labels and timeframes, and detects explicit content. On top of that, we process the audio track from a video the same way we process audio assets (see above). This way users can search content from the video as well as from spoken text in the video.

Segona Media architecture


The traditional way for developing a solution like this involves getting hardware running, determining and installing application servers, databases, storage nodes, etc. After developing and getting the solution into production you may then come across a variety of familiar challenges: the operating system needs to be updated or upgraded or databases don't scale to cope with unexpected production data. We didn't want any of this, so after careful consideration, decided on a completely managed solution and serverless architecture. That way we’d have no servers to maintain, we could leverage Google’s ongoing API improvements and our solution could scale to handle the largest archives we could find.

We wanted Segona Media to also be able to easily connect to common tools in the media and publishing industries. Adobe InDesign, Premiere, Photoshop and Digital Asset Management solutions must all be able to easily store and retrieve assets from Segona Media. We solved this by using GCP APIs that were already in place for storing assets in Google Cloud Storage and just take it from there. We retrieve assets using the managed Elasticsearch API that runs on GCP.

Each action that Segona Media performs is a separate Google Cloud Function, usually triggered mostly by a Cloud Pub/Sub queue. Using a Pub/Sub queue to trigger a Cloud Function is an easy and scalable way to publish new actions.

Here’s a high-level architecture view of Segona Media:

High Level Architecture

And here's how the assets flow through Segona Media:

  1. An asset is uploaded/stored to a Cloud Storage bucket
  2. This event triggers a Cloud Function, which generates a unique ID, extracts metadata from the file object, moves it to the appropriate bucket (with lifecycle management) and creates the asset in the Elasticsearch index (we run Elastic Cloud hosted on GCP).
  3. This queues up multiple asset processors in Google Cloud Pub/Sub that are specific for an asset type and that extract relevant content from assets using Google APIs.


Media asset management


Now, let's see how Segona Media handles different types of media assets.

Images

Images have a lot of features on which you can search, which we do via a dedicated microservice processor.

  • We use ImageMagick to extract metadata from the image object itself. We extract all XMP and EXIF metadata that's embedded in the file itself. This information is then added to the Elastic index and makes the image searchable by, for example, copyright info or resolution. 
  • Cloud Vision API extracts labels in the image, landmarks, text and image properties. This takes away manual tagging of objects in the image and makes a picture searchable for its contents.
  • Segona Media offers customers to create custom labels. For example, a television manufacturer might want to know the specific model of a television set in a picture. We’ve implemented custom predictions by building our own Tensorflow models trained on custom data, and we train and run predictions on Cloud ML Engine.
  • For easier serving of all assets, we also create a low resolution thumbnail of every image.

Audio


Processing audio is pretty straightforward. We want to be able to search for spoken text in audio files, and we use Cloud Speech API to extract text from the audio. We then feed the transcription into the Elasticsearch index, making the audio file searchable by every word.

Video


Video is basically the combination of everything we do with images and audio files. There are some minor differences though, so let's see what microservices we invoke for these assets:

  • First of all, we create a thumbnail so we can serve up a low-res image of the video. We take a thumbnail at 50% of the video. We do this by combining FFmpeg and FFprobe in Cloud Functions, and store this thumbnail alongside the video asset. Creating a thumbnail with Cloud Functions and FFmpeg is easy! Check out this code snippet: https://bitbucket.org/snippets/keesvanbemmel/keAkqe 
  • Using the same FFmpeg architecture, we extract the audio stream from the video. This audio stream is then processed like any other audio file: We extract the text from spoken text in the audio stream and add it to the Elastic index so the video itself can also be found by searching for every word that's spoken. We extract the audio stream from the video in a single channel FLAC format as this gives us the best results. 
  • We also extract relevant information from the video contents itself using Cloud Video Intelligence. We extract labels that are in the video as well as the timestamps for when the labels were created. This way, we know which objects are at what point in the video. Knowing the timestamps for a given label is a fantastic way to point a user to not only a video, but the exact moment in the video that contains the object they're looking for.

There you have it—a summary of how to do smart media tagging in a completely serverless fashion, without all the OS updates when scaling up or out, or of course, the infrastructure maintenance and support! This way we can focus on what we care about: bringing an innovative, scalable solution to our end customers. Any questions? Let us know! We love to talk about this stuff ;) Leave a comment below, email me at [email protected], or find us on Twitter at @incentro_.

Analyzing your BigQuery usage with Ocado Technology’s GCP Census



[Editor’s note: Today we hear from Google Cloud customer Ocado Technology, which created (and open sourced!) a program to give them at-a-glance insights about their data warehouse usage, by reading BigQuery metadata. Read on to learn about how they architected the tool, what kinds of questions it can answerand whether it might be useful in your own environment.]

Here at Ocado Technology, we use a wide range of Google Cloud Platform (GCP) big data products for data-driven decisions and machine learning. Notably, we use Google BigQuery as the main storage solution for data analytics in the Ocado Smart Platform, our proprietary solution for operating online retail businesses.

Because BigQuery is so central to the Ocado platform, we wanted an easy way to get a bird’s eye view of the data stored in it. So we created GCP Census a tool that collects metadata about BigQuery tables and stores it back into BigQuery for analysis. To have a better overview of all the data stored in BigQuery, we wanted to ask:
  • Which datasets/tables are the largest or the most expensive?
  • How many tables/partitions do we have?
  • How often are tables/partitions updated over time?
  • How are our datasets/tables/partitions growing over time?
  • Which tables/datasets are stored in a specific location?
If you also need better visibility into your organization’s BigQuery usage, read on to learn about how we architected GCP Census and what it can do. Then go ahead and download it for your own use—we recently open sourced it!

Our BigQuery domain


We store petabytes of data in BigQuery, divided into multiple GCP projects and hundreds of thousands of tables. BigQuery has many useful features for enterprise cloud data warehouses, especially in terms of speed, scalability and reliability. One example is partitioned tables rather than daily tables, which we recently adopted for their numerous benefits. At the same time, partitioned tables increased the complexity and scale of our BigQuery environment, and BigQuery offers limited ways of analysing metadata:
  • overall data size per project (from billing data) 
  • single table size (from BigQuery UI or REST API) 
  • __TABLES_SUMMARY__ and __PARTITIONS_SUMMARY__ provide only basic information, like list of tables/partitions and last update time
These constraints inspired us to build an additional layer to give us a bird’s eye view of our data.

GCP Census architecture

The resulting tool, GCP Census, is a Google App Engine app written in Python that regularly collects metadata about BigQuery tables and stores it in BigQuery.


Here's how it works:
  1. App Engine cron triggers a daily run
  2. GCP Census crawls metadata from all projects/datasets/tables to which it has access
  3. It creates a task for each table and schedules it for execution in App Engine Task Queue
  4. A task worker then retrieves table metadata using the REST API and streams it into the metadata tables. In case of partitioned tables, GCP Census also retrieves the partitions’ summary by querying the partitioned table and stores the metadata in partition_metadata table
GCP Census is highly scalable as it can easily scan millions of table/partitions. It’s also easy to set up: before GCP Census scans the resources to which it has IAM access, it automatically creates the relevant tables and views. Finally, it’s a secure cloud-native solution with App Engine Firewall and fine-grained access control, plus App Engine’s scalability and reliability!

Using GCP Census


There are several benefits to using GCP Census. Now you can get answers to all the questions by querying BigQuery from the UI or the API.

You can find below a few examples of how you can query GCP Census metadata.
  • Count all data to which GCP Census has access
    SELECT sum(numBytes) FROM
    `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
  • Count all tables and partitions
    SELECT count(*)
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
  • Select top 100 largest datasets
    SELECT projectId, datasetId, sum(numBytes) as totalNumBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    GROUP BY projectId, datasetId ORDER BY totalNumBytes DESC LIMIT 100
  • Select top 100 largest tables
    SELECT projectId, datasetId, tableId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`s
    ORDER BY numBytes DESC LIMIT 100
  • Select top 100 largest partitions
    SELECT projectId, datasetId, tableId, partitionId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
    ORDER BY numBytes DESC LIMIT 100
Optionally, you can create a Data Studio dashboard based on the metadata. We used Data Studio because of the ease and simplicity in creating dashboards with the BigQuery connector. Splitting data by project, dataset or label and diving into the storage costs is now a breeze, and we have multiple Data Studio dashboards that help us quickly dive into the largest project, dataset or table.

Below you can find a screenshot with one of our dashboards (all real data has been redacted).
With GCP Census, we’ve learned some of the characteristics of our data; for example, we now know which data is modified daily or which historical partitions have been modified recently. We were also able to identify potential cost optimization areas—huge temporary tables that no one uses but that were incurring significant storage costs. All in all, we’ve learned a lot about our operations, and saved a bunch of money!

You can find the source code for GCP Census at Github at https://github.com/ocadotechnology/gcp-census, plus the required steps needed for installation and setup. We look forward to your ideas and contributions!