Tag Archives: Google Cloud Platform

An example escalation policy — CRE life lessons



In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble


Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:
  • finding the root cause and fixing the relevant class of issue, or
  • automating remediation such that ongoing manual intervention is no longer necessary, or 
  • simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.


Escalation policy thresholds


Threshold 1 -  wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers
  • If,
    • SRE have concluded they cannot bring the service into SLO without help, and
    • SRE and dev agree that the SLO represents desired user experience
  • Then,
    • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
    • SRE escalates to dev leads for visibility and additional assistance if necessary
    • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • When the service is brought back into SLO,
    • SRE will revert any alerting changes
    • SRE may create a postmortem
    • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
Threshold 3 - wherein SRE pauses feature releases and focuses on reliability
  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 30-day error budget is exhausted
  • Then during the following week,
    • Only cherry-picked fixes for diagnosed root causes may be pushed to production
    • SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work
    • Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).
  • When the service is brought back into SLO,
    • Normal binary releases resume
    • SRE creates a postmortem
    • Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Then,
    • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
    • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response


SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases


Blocking releases is an appropriate course of action for three main reasons:
  1. Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly
  2. It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release
  3. While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary


We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.

Whitepaper: Embark on a journey from monoliths to microservices



Today we introduced the next in a series of white papers about migration entitled “Taking the Cloud-Native Approach with Microservices.” This paper switches gears from “lift-and-shift,” and introduces the idea of “move-and-improve.” If you missed the first white paper, you can read the blog and download a copy.

The white paper provides context on monolithic software application architecture, as well as microservices architecture. You’ll also learn about the shortcomings of monoliths: They can be challenging to scale properly, and their faults are harder to isolate. Deploying monoliths can also be cumbersome and time consuming, and they generally require a long-term commitment to a particular technology stack. Alternatively, microservices are thought to be more agile, fault-resilient and scalable, because the application is modularized into a system of small services with well-defined, narrowly scoped functions and APIs.

PetShop is an eCommerce website reference implementation that is well known within both the Java and Microsoft .NET development communities, and the white paper uses it to step through the process of deconstructing a monolith into microservices. Specifically, the paper considers three different layers that may or may not be deployed in different physical tiers: the presentation, business logic and data access layers.

In addition, you’ll be introduced to the concept of domain-driven design (DDD), which advocates modeling based on a business’s practical use cases. In its simplest form, DDD consists of decomposing a business domain into smaller functional chunks, at either the business function or business process level, so that the complexity of both a business and problem domain can be better understood and resolved through your application.

Download your copy of the white paper, and GitHub repositories; then, take a look at how you can deconstruct the PetShop reference implementation and build a microservice-based version. You’ll be well on your way to deconstructing and rebuilding your own monoliths!

Analyzing your BigQuery usage with Ocado Technology’s GCP Census



[Editor’s note: Today we hear from Google Cloud customer Ocado Technology, which created (and open sourced!) a program to give them at-a-glance insights about their data warehouse usage, by reading BigQuery metadata. Read on to learn about how they architected the tool, what kinds of questions it can answerand whether it might be useful in your own environment.]

Here at Ocado Technology, we use a wide range of Google Cloud Platform (GCP) big data products for data-driven decisions and machine learning. Notably, we use Google BigQuery as the main storage solution for data analytics in the Ocado Smart Platform, our proprietary solution for operating online retail businesses.

Because BigQuery is so central to the Ocado platform, we wanted an easy way to get a bird’s eye view of the data stored in it. So we created GCP Census a tool that collects metadata about BigQuery tables and stores it back into BigQuery for analysis. To have a better overview of all the data stored in BigQuery, we wanted to ask:
  • Which datasets/tables are the largest or the most expensive?
  • How many tables/partitions do we have?
  • How often are tables/partitions updated over time?
  • How are our datasets/tables/partitions growing over time?
  • Which tables/datasets are stored in a specific location?
If you also need better visibility into your organization’s BigQuery usage, read on to learn about how we architected GCP Census and what it can do. Then go ahead and download it for your own use—we recently open sourced it!

Our BigQuery domain


We store petabytes of data in BigQuery, divided into multiple GCP projects and hundreds of thousands of tables. BigQuery has many useful features for enterprise cloud data warehouses, especially in terms of speed, scalability and reliability. One example is partitioned tables rather than daily tables, which we recently adopted for their numerous benefits. At the same time, partitioned tables increased the complexity and scale of our BigQuery environment, and BigQuery offers limited ways of analysing metadata:
  • overall data size per project (from billing data) 
  • single table size (from BigQuery UI or REST API) 
  • __TABLES_SUMMARY__ and __PARTITIONS_SUMMARY__ provide only basic information, like list of tables/partitions and last update time
These constraints inspired us to build an additional layer to give us a bird’s eye view of our data.

GCP Census architecture

The resulting tool, GCP Census, is a Google App Engine app written in Python that regularly collects metadata about BigQuery tables and stores it in BigQuery.


Here's how it works:
  1. App Engine cron triggers a daily run
  2. GCP Census crawls metadata from all projects/datasets/tables to which it has access
  3. It creates a task for each table and schedules it for execution in App Engine Task Queue
  4. A task worker then retrieves table metadata using the REST API and streams it into the metadata tables. In case of partitioned tables, GCP Census also retrieves the partitions’ summary by querying the partitioned table and stores the metadata in partition_metadata table
GCP Census is highly scalable as it can easily scan millions of table/partitions. It’s also easy to set up: before GCP Census scans the resources to which it has IAM access, it automatically creates the relevant tables and views. Finally, it’s a secure cloud-native solution with App Engine Firewall and fine-grained access control, plus App Engine’s scalability and reliability!

Using GCP Census


There are several benefits to using GCP Census. Now you can get answers to all the questions by querying BigQuery from the UI or the API.

You can find below a few examples of how you can query GCP Census metadata.
  • Count all data to which GCP Census has access
    SELECT sum(numBytes) FROM
    `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
  • Count all tables and partitions
    SELECT count(*)
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
  • Select top 100 largest datasets
    SELECT projectId, datasetId, sum(numBytes) as totalNumBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
    GROUP BY projectId, datasetId ORDER BY totalNumBytes DESC LIMIT 100
  • Select top 100 largest tables
    SELECT projectId, datasetId, tableId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`s
    ORDER BY numBytes DESC LIMIT 100
  • Select top 100 largest partitions
    SELECT projectId, datasetId, tableId, partitionId, numBytes
    FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
    ORDER BY numBytes DESC LIMIT 100
Optionally, you can create a Data Studio dashboard based on the metadata. We used Data Studio because of the ease and simplicity in creating dashboards with the BigQuery connector. Splitting data by project, dataset or label and diving into the storage costs is now a breeze, and we have multiple Data Studio dashboards that help us quickly dive into the largest project, dataset or table.

Below you can find a screenshot with one of our dashboards (all real data has been redacted).
With GCP Census, we’ve learned some of the characteristics of our data; for example, we now know which data is modified daily or which historical partitions have been modified recently. We were also able to identify potential cost optimization areas—huge temporary tables that no one uses but that were incurring significant storage costs. All in all, we’ve learned a lot about our operations, and saved a bunch of money!

You can find the source code for GCP Census at Github at https://github.com/ocadotechnology/gcp-census, plus the required steps needed for installation and setup. We look forward to your ideas and contributions!

Running dedicated game servers in Kubernetes Engine: tutorial



Packaging server applications as container images is quickly gaining traction across tech organizations, game companies among them. They want to use containers to improve VM utilization, as well as take advantage of the isolated run-time paradigm. Despite their interest, many game companies don't know where to start.

Using the orchestration framework Kubernetes to deploy production-scale fleets of dedicated game servers in containers is an excellent choice. We recommend Google Kubernetes Engine as the easiest way to start a Kubernetes cluster for game servers on Google Cloud Platform (GCP) without manual setup steps. Kubernetes will help simplify your configuration management and select a VM with adequate resources to spin up a match for your players for you automatically.

We recently put together a tutorial that shows you how to integrate dedicated game servers with Kubernetes Engine, and how to automatically scale the number of VMs up and down according to player demand. It also offers some key storage strategies, including how to manage your game server assets without having to manually distribute them with each container image. Check it out, and let us know what other Google Cloud tools you’d like to learn how to use in your game operations. You can reach me on Twitter at @gcpjoe.

Get latest Kubernetes version 1.9 on Google’s managed offering



We're excited to announce that Kubernetes version 1.9 will be available on Google Kubernetes Engine next week in our early access program. This release includes greater support for stateful and stateless applications, hardware accelerator support for machine learning workloads and storage enhancements. Overall, this release achieves a big milestone in making it easy to run a wide variety of production-ready applications on Kubernetes without having to worry about the underlying infrastructure. Google is the leading contributor to open-source Kubernetes releases and now you can access the latest Kubernetes release on our fully-managed Kubernetes Engine, and let us take care of managing, scaling, upgrading, backing up and helping to secure your clusters. Further, we recently simplified our pricing by removing the fee for cluster management, resulting in real dollar savings for your environment.

We're committed to providing the latest technological innovation to Kubernetes users with one new release every quarter. Let’s a take a closer look at the key enhancements in Kubernetes 1.9.

Workloads APIs move to GA


The core Workloads APIs (DaemonSet, Deployment, ReplicaSet and StatefulSet), which let you run stateful and stateless workloads in Kubernetes 1.9, move to general availability (GA) in this release, delivering production-grade quality, support and long-term backwards compatibility.

Hardware accelerator enhancements


Google Cloud Platform (GCP) provides a great environment for running machine learning and data analytics workloads in containers. With this release, we’ve improved support for hardware accelerators such as NVIDIA Tesla P100 and K80 GPUs. Compute-intensive workloads will benefit greatly from cost-effective and high performance GPUs for many use cases ranging from genomics and computational finance to recommendation systems and simulations.

Local storage enhancements for stateful applications


Improvements to the Kubernetes scheduler in this release make it easier to use local storage in Kubernetes. The local persistent storage feature (alpha) enables easy access to local SSD on GCP through Kubernetes’ standard PVC (Persistent Volume Claim) interface in a simple and portable way. This allows you to take an existing Helm Chart, or StatefulSet spec using remote PVCs, and easily switch to local storage by just changing the StorageClass name. Local SSD offers superior performance including high input/output operations per second (IOPS), low latency, and is ideal for high performance workloads, distributed databases, distributed file systems and other stateful workloads.

Storage interoperability through CSI


This Kubernetes release introduces an alpha implementation of Container Storage Interface (CSI). We've been working with the Kubernetes community to provide a single and consistent interface for different storage providers. CSI makes it easy to add different storage volume plugins in Kubernetes without requiring changes to the core codebase. CSI underscores our commitment to being open, flexible and collaborative while providing maximum value—and options—to our users.

Try it now!


In a few days, you can access the latest Kubernetes Engine release in your alpha clusters by joining our early access program.

Three ways to configure robust firewall rules



If you administer firewall rules for Google Cloud VPCs, you want to ensure that firewall rules you create can only be associated with correct VM instances by developers in your organization. Without that assurance, it can be difficult to manage access to sensitive content hosted on VMs in your VPCs or allow these instances access to the internet, and you must carefully audit and monitor the instances to ensure that such unintentional access is not given through the use of tags. With Google VPC, there are now multiple ways to help achieve the required level of control, which we’ll describe here in detail.

As an example, imagine you want to create a firewall rule to restrict access to sensitive user billing information in a data store running on a set of VMs in your VPC. Further, you’d like to ensure that developers who can create VMs for applications other than the billing frontend cannot enable these VMs to be governed by firewall rules created to allow access to billing data.
Example topology of a VPC setup requiring secure firewall access.
The traditional approach here is to attach tags to VMs and create a firewall rule that allows access to specific tags, e.g., in the above example you could create a firewall rule that allows all VMs with the billing-frontend tag access to all VMs with the tag billing-data. The drawback of this approach is that any developer with Compute InstanceAdmin role for the project can now attach billing-frontend as a tag to their VM, and thus unintentionally gain access to sensitive data.

Configuring Firewall rules with Service Accounts


With the general availability of firewall rules using service accounts, instead of using tags, you can block developers from enabling a firewall rule on their instances unless they have access to the appropriate centrally managed service accounts. Service accounts are special Google accounts that belong to your application or service running on a VM and can be used to authenticate the application or service for resources it needs to access. In the above example, you can create a firewall rule to allow access to the billing-data@ service account only if the originating source service account of the traffic is billing-frontend@.
Firewall setup using source and target service accounts. (Service accounts names are abbreviated for simplicity.)
You can create this firewall rule using the following gcloud command:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-service-accounts billing-frontend@web.iam.gserviceaccount.com \
    --target-service-accounts billing-data@web.iam.gserviceaccount.com
If, in the above example, the billing frontend and billing data applications are autoscaled, you can specify the service accounts for the corresponding applications in the InstanceTemplate configured for creating the VMs.

The advantage of using this approach is that once you set it up, the firewall rules may remain unchanged despite changes in underlying IAM permissions. However, you can currently only associate one service account with a VM and to change this service account, the instance must be in a stopped state.

Creating custom IAM role for InstanceAdmin


If you want the flexibility of tags and the limitations of service accounts is a concern, you can create a custom role with more restricted permissions that disable the ability to set tags on VMs; do this by removing the compute.instances.setTag permission. This custom role can have other permissions present in the InstanceAdmin role and can then be assigned to developers in the organization. With this custom role, you can create your firewall rules using tags:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-tags billing-frontend \
    --target-tags billing-data
Note, however, that permissions assigned to a custom role are static in nature and must be updated with any new permissions that might be added to the InstanceAdmin role, as and when required.

Using subnetworks to partition workloads


You can also create firewall rules using source and destination IP CIDR ranges if the workloads can be partitioned into subnetworks of distinct ranges as shown in the example diagram below.
Firewall setup using source and destination ranges.
In order to restrict developers’ ability to create VMs in these subnetworks, you can grant Compute Network User role selectively to developers on specific subnetworks or use Shared VPC.

Here’s how to configure a firewall rule with source and destination ranges using gcloud:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-ranges 10.20.0.0/16 \
    --destination-ranges 10.30.0.0/16
This method allows for better scalability with large VPCs and allows for changes in the underlying VMs as long as the network topology remains unchanged. Note, however, that if a VM instance has can_ip_forward enabled, it may send traffic using the above source range and thus gain access to sensitive workloads.

As you can see, there’s a lot to consider when configuring firewall rules for your VPCs. We hope these tips help you configure firewall rules in a more secure and efficient manner. To learn more about configuring firewall rules, check out the documentation.

Why you should pick strong consistency, whenever possible



Do you like complex application logic? We don’t either. One of the things we’ve learned here at Google is that application code is simpler and development schedules are shorter when developers can rely on underlying data stores to handle complex transaction processing and keeping data ordered. To quote the original Spanner paper, “we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”1

Put another way, data stores that provide transactions and consistency across the entire dataset by default lead to fewer bugs, fewer headaches and easier-to-maintain application code.

Defining database consistency


But to have an interesting discussion about consistency, it’s important to first define our terms. A quick look at different databases on the market shows that not all consistency models are created equal, and that some of the related terms can intimidate even the bravest database developer. Below is a short primer on consistency:



Term
Definition
What Cloud Spanner Supports
Consistency
Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules.2
Cloud Spanner provides external consistency, which is strong consistency + additional properties (including serializability and linearizability). All transactions across a Cloud Spanner database satisfy this consistency property, not just those within a replica or region.
Serializability
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects. It guarantees that transactions behave the same as if they had executed in some serial order. It's okay for that serial order to be different from the order in which transactions were actually run.3
Cloud Spanner provides external consistency, which is a stronger property than serializability, which means that all transactions appear as if they executed in a serial order, even if some of the reads, writes and other operations of distinct transactions actually occurred in parallel.
Linearizability
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts.4
Cloud Spanner provides external consistency, which is a stronger property than linearizability, because linearizability does not say anything about the behavior of transactions.
Strong Consistency
All accesses are seen by all parallel processes (or nodes, processors, etc.) in the same order (sequentially)5

In some definitions, a replication protocol exhibits "strong consistency" if the replicated objects are linearizable.
The default mode for reads in Cloud Spanner is "strong," which guarantees that they observe the effects of all transactions that committed before the start of the operation, independent of which replica receives the read.
Eventual Consistency
Eventual consistency means that if you stop writing to the database and wait for some unspecified length of time, then
eventually all read requests will return the same value.6
Cloud Spanner supports bounded stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees.


Cloud Spanner, in particular, provides external consistency, which provides all the benefits of strong consistency plus serializability. All transactions (across rows, regions and continents) in a Cloud Spanner database satisfy the external consistency property, not just those within a replica. External consistency states that Cloud Spanner executes transactions in a manner that's indistinguishable from a system in which the transactions are executed serially, and furthermore, that the serial order is consistent with the order in which transactions can be observed to commit. External consistency is a stronger property than both linearizability and serializability.

Consistency in the wild


There are lots of use cases that call for external consistency. For example, a financial application might need to show users' account balances. When users make a deposit, they want to see the result of this deposit reflected immediately when they view their balance (otherwise they may fear their money has been lost!). There should never appear to be more or less money in aggregate in the bank than there really is. Another example might be a mail or messaging app: You click "send" on your message, then immediately view "sent messages" because you want to double check what you wrote. Without external consistency, the app’s request to retrieve your sent messages may go to a different replica that's behind on getting all state changes, and have no record of your message, resulting in a confusing and reduced user experience.

But what does it really mean from a technical standpoint to have external consistency? When performing read operations, external consistency means that you're reading the latest copy of your data in global order. It provides the ability to read the latest change to your data across rows, regions and continents. From a developer’s perspective, it means you can read a consistent view of the state of the entire database (not just a row or object) at any point in time. Anything less introduces tradeoffs and complexity in the application design. That in turn can lead to brittle, hard-to-maintain software and can cause innumerable maintenance headaches for developers and operators. Multi-master architectures and multiple levels of consistency are workarounds for not being able to provide the external consistency that Cloud Spanner does.

What’s the problem with using something less than external consistency? When you choose a relaxed/eventual consistency mode, you have to understand which consistency mode you need to use for each use case and have to hard code rigid transactional logic into your apps to guarantee the correctness and ordering of operations. To take advantage of "transactions" in database systems that have limited or no strong consistency across documents/objects/rows, you have to design your application schema such that you never need to make a change that involves multiple "things" at the same time. That’s a huge restriction and workarounds at the application layer are painful, complex, and often buggy.

Further, these workarounds have to be carried everywhere in the system. For example, take the case of adding a button to set your color scheme in an admin preferences panel. Even a simple feature like this is expected to be carried over immediately across the app and other devices and sessions. It needs a synchronous, strongly consistent update—or a makeshift way to obtain the same result. Using a workaround to achieve strong consistency at the application level adds a velocity-tax to every subsequent new feature—no matter how small. It also makes it really hard to scale the application dev team, because everyone needs to be an expert in these edge cases. With this example, a unit test that passes on a developer workstation does not imply it will work in production at scale, especially in high concurrency applications. Adding workarounds to an eventually consistent data store often introduces bugs that go unnoticed until they bite a real customer and corrupt data. In fact, you may not even recognize the workaround is needed in the first place.

Lots of application developers are under the impression that the performance hit of external or strong consistency is too high. And in some systems, that might be true. Additionally, we're firm believers that having choice is a good thing—as long as the database does not introduce unnecessary complexity or introduce potential bugs in the application. Inside Google, we aim to give application developers the performance they need while avoiding unnecessary complexity in their application code. To that end, we’ve been researching advanced distributed database systems for many years and have built a wide variety of data stores to get strong consistency just right. Some examples are Cloud Bigtable, which is strongly consistent within a row; Cloud Datastore, which is strongly consistent within a document or object; and Cloud Spanner, which offers strong consistency across rows, regions and continents with serializability. [Note: In fact, Cloud Spanner offers a stronger guarantee of external consistency (strong consistency + serializability), but we tend to talk about Cloud Spanner having strong consistency because it's a more broadly accepted term.]


Strongly consistent reads and Cloud Spanner


Cloud Spanner was designed from the ground up to serve strong reads (i.e., strongly consistent reads) by default with low latency and high throughput. Thanks to the unique power of TrueTime, Spanner provides strong reads for arbitrary queries without complex multi-phase consensus protocols and without locks of any kind. Cloud Spanner’s use of TrueTime also provides the added benefit of being able do global bounded-staleness reads.

Better yet, Cloud Spanner offers strong consistency for multi-region and regional configurations. Other globally distributed databases present a dilemma to developers: If they want to read the data from geographically distributed regions, they forfeit the ability to do strongly consistent reads. In these other systems, if a customer opts to have strongly consistent reads, then they forfeit the ability to do reads from remote regions.

To take maximum advantage of the external consistency guarantees that Cloud Spanner provides and to maximize your application's performance, we offer the following two recommendations:
  1. Always use strong reads, whenever possible. Strong reads, which provide strong consistency, ensure that you are reading the latest copy of your data. Strong consistency makes application code simpler and applications more trustworthy.
  2. If latency makes strong reads infeasible in some situations, then use reads with bounded-staleness to improve performance, in places where strong reads with the latest data are not necessary. Bounded-staleness semantics ensures you read a guaranteed prefix of the data (for example, within a specified period of time) that is consistent, as opposed to eventual consistency where you have no guarantees and your app can read almost anything forwards or back in time from when you queried it.
Foregoing strong consistency has some real risks. Strong reads across a database ensure that you're reading the latest copy of your data and that it maintains the referential integrity of the entire dataset, making it easier to reason about concurrent requests. Using weaker consistency models introduces the risk of software bugs and can be a waste of developer hours—and potentially—customer trust.

What about writes?


Strong consistency is even more important for write operations—especially read-modify-write transactions. Systems that don't provide strong consistency in such situations create a burden for application developers, as there's always a risk of putting your data into an inconsistent state.

Perhaps the most insidious type of problem is write skew. In write skew, two transactions read a set of objects and make changes to some of those objects. However, the modifications that each transaction makes affect what the other transaction should have read. For example, consider a database for an airline based in San Francisco. It’s the airline’s policy to always have a free plane in San Francisco, in the event that this spare plane is needed to replace another plane with maintenance problems or for some other need. Imagine two transactions that are both reserving planes for upcoming flights out of San Francisco:

Begin Transaction
SELECT * FROM Airplanes WHERE location = "San Francisco" AND Availability = "Free";
If number of airplanes is > 1:  # to enforce "one free plane" rule
Pick 1 airplane
Set its Availability to "InUse"
Commit
Else: Rollback


Without strong consistency (and, in particular, serializable isolation for these transactions), both transactions could successfully commit, thus potentially breaking our one free plane rule. There are many more situations where write skew can cause problems7.

Because Cloud Spanner was built from the ground up to be a relational database with strong, transactional consistency—even for complex multi-row and multi-table transactions—it can be used in many situations where a NoSQL database would cause headaches for application developers. Cloud Spanner protects applications from problems like write skew, which makes it appropriate for mission-critical applications in many domains including finance, logistics, gaming and merchandising.

How does Cloud Spanner differ from multi-master replication?


One topic that's often combined with scalability and consistency discussions is multi-master replication. At its core, multi-master replication is a strategy used to reduce mean time to recovery for vertically scalable database systems. In other words, it’s a disaster recovery solution, and not a solution for global, strong consistency. With a multi-master system, each machine contains the entire dataset, and changes are replicated to other machines for read-scaling and disaster recovery.

In contrast, Cloud Spanner is a truly distributed system, where data is distributed across multiple machines within a replica, and also replicated across multiple machines and multiple data centers. The primary distinction between Cloud Spanner and multi-master replication is that Cloud Spanner uses paxos to synchronously replicate writes out of region, while still making progress in the face of single server/cluster/region failures. Synchronous out-of-region replication means that consistency can be maintained, and strongly consistent data can be served without downtime, even when a region is unavailable—no acknowledged writes are delayed/lost due to the unavailable region. Cloud Spanner’s paxos implementation elects a leader so that it's not necessary to do time-intensive quorum reads to obtain strong consistency. Additionally, Cloud Spanner shards data horizontally across servers, so individual machine failures impact less data. While a node is recovering, replicated nodes on other clusters that contain that dataset can assume mastership easily, and serve strong reads without any visible downtime to the user.

A strongly consistent solution for your mission-critical data


For storing critical, transactional data in the cloud, Cloud Spanner offers a unique combination of external, strong consistency, relational semantics, high availability and horizontal scale. Stringent consistency guarantees are critical to delivering trustworthy services. Cloud Spanner was built from the ground up to provide those guarantees in a high-performance, intuitive way. We invite you to try it out and learn more.

See more on Cloud Spanner and external consistency.

1 https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
2 https://en.wikipedia.org/wiki/Consistency_(database_systems)
3 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 329.
4 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 329.
5 https://en.wikipedia.org/wiki/Strong_consistency
6 Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly, 2017, p. 322.
7 Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly, 2017, p. 246.

Trash talk: How moving Apigee Sense to GCP reduced our “data litter” and decreased our costs



In the year-plus since Apigee joined the Google Cloud family, we’ve had the opportunity to deploy several of our services to Google Cloud Platform (GCP). Most recently, we completely moved Apigee Sense to GCP to use its advanced machine learning capabilities. Along the way, we also experienced some important performance improvements as judged by a drop in what we call “data litter.” In this post, we explain what data litter is, and our perspective on how various GCP services keep it at bay. Through this account, you may come to recognize your own application, and come to see data litter as an important metric to consider.

What is data litter?


First, let’s take a look at Apigee Sense and its application characteristics. At its core, Apigee Sense protects APIs running on Apigee Edge from attacks and unwanted exploitation. Those attacks are usually performed by automated processes, or "bots," which run without the permission of the API owner. Sense is built around a four-element "CAVA" cycle: collect, analyze, visualize and act. It enhances human vigilance with statistical machine learning algorithms.

We collect a lot of traffic data as a by-product of billions of API calls that pass through Apigee Edge daily. The output end of each of the four elements in the CAVA cycle is stored in a database system. Therefore, the costs, performance and scalability of data management and data analysis toolchains are of great interest to us.

When optimizing an analytics application, there are several things that demand particular attention: latency, quality, throughput and cost.
  • Latency is the delay between when something happens and when we become aware of it. In the case of security, we define it as the delay between when a bot attacks and when we notice that the attack.
  • The quality of our algorithmic smarts is measured by true and false positives and negatives.
  • Throughput measures the average rate at which data arrives into the analytics application.
  • Cost, of course, measures the average rate at which dollars (or other currency) leave your wallet.
To this mix I like to consider a fifth metric: "data litter," which in many ways measures the interplay between the four traditional metrics. Fundamentally, all analytics systems are GIGO (garbage in / garbage out). That is, if the data entering the system is garbage, it doesn’t matter how quickly it is processed, how smart our algorithms are, or how much data we can process every second. The money we spend does matter, but only because of questions about the wisdom of continuing to spend it.

Sources of data litter


Generally speaking, there are three main sources of data litter in an analytical application like Apigee Sense.
  1. Timeliness of analysis: It’s the nature of a data-driven analysis engine like Sense to attempt to make a decision with all the data available to it when the decision needs to be made. A delayed decision is of little value in foiling an ongoing bot attack. Therefore, when there's little data available to make decisions, the engine makes a less-informed decision and moves on. Any data that arrives subsequently is discarded because it is no longer useful, as the decision has already been made. The result? Data litter. 
  2. Elasticity of data processing: If data arrives too quickly for the analysis engine to consume, it piles up and causes "data back pressure." There are two remedies. First, to increase the size (and cost) of the analysis engine, or, alternately, to drop some data to relieve the pressure. Because you can’t scale up an analysis engine instantly, or because it is cost-prohibitive, we build a pressure release valve into the pipeline, causing data litter.
  3. Scalability of the consumption chain: If the target database is down, or unable to consume the results at the rate at which they're produced, you might as well stop the pipeline and discard the incoming data. It's pointless to analyze data when there's no way to use or store the results of the analysis. This too causes data litter.
Therefore, data litter is a holistic measure of the quality of the analysis system. It will be low only when the pipeline, analysis engine and target database are all well-tuned and constantly performing to expectations.


The easiest way to deal with the first kind of data litter is to slow down the pipeline by increasing latency. The easiest way to address the second kind is to throw money at the problem and run the analysis engine on a larger cluster. And the final problem is best addressed by adding more or bigger hardware to the database. Whichever path we take, we either increase latency and lose relevance, or lose money.

Moving to GCP


At Apigee, we track data litter with the data coverage metric, which is, roughly speaking, the inverse measure of how much of data gets dropped or otherwise doesn’t contribute to the analysis. When we moved the Sense analytics chain to GCP, the data coverage metric went from below 80% to roughly 99.8% for one of our toughest customer use cases. Put another way, our data litter decreased from over 20%, or one in five, to approximately one in five hundred. That’s a decrease of a factor of approximately 100, or two orders of magnitude!

The chart below shows the fraction of data available and used for decision making before and after our move to GCP. The chart shows the numbers for four different APIs, representing a subset of Sense customers.
These improvements were measured even while the cost of the deployed system, as well its the pipeline latency, were simultaneously tightened. Meanwhile, our throughput and algorithms stayed the same, and latencies and cost both dropped. Since the release a couple of months ago, these savings, along with the availability and performance benefits of the system, have persisted, while our customer base and the processed traffic has grown. So we're getting more reliable answers more quickly than we did before and paying less than we did for almost the exact same use case. Wow!


Where did the data litter go?


There were two problems that accounted for the bulk of the data litter in the Sense pipeline. These were the elasticity of data processing and the scalability of the transactional store.

To alert customers of an attack as quickly as possible, we designed our system with adequate latency to avoid systematic data litter. In our environment, two features of the GCP platform contributed most significantly to the reduction of unplanned data litter.:
  1. System elasticity. The data rate coming into the system is never uniform, and is especially high when there is an ongoing attack. The system is most under pressure when it is of highest value and needs to have enough elasticity to be able to deal with spikes without being provisioned significantly above the median data rates. Without this, the pressure release valve needs to be constantly engaged. 
  2. Transactional processing power. The transactional load on the database at the end of the chain peaks during an attack. It also determines the performance characteristics of the user experience and of protective enforcement, both of which add to the workload when an API is under attack. Therefore, transactional loads need to be able to comfortably scale to meet the demands of the system near its limits. 
As part of this transition, we moved our analysis chain to Cloud Dataproc, which provided significantly more nimble and cost controlled elasticity. Because the cost of the analytics pipeline represented our most significant constraint, we were able to size our processing capacity limits more aggressively. This gave us the additional elastic capacity needed to meet peak demands without increasing our cost.

We also moved our target database to BigQuery. BigQuery distributes and scales cost-effectively and without hiccups well beyond our needs, and indeed, beyond most reasonable IT budgets. This completely eliminated the back pressure issue from the end of the chain.

Because two of the three sources of data litter are now gone, our team is able to focus on improving the timeliness of our analysis—ensuring that we move data from where it's gathered through the analysis engine and make more intelligent and more relevant decisions with lower latency. This is what Sense was intended to do.

By moving Apigee Sense to GCP, we feel that we’ve taken back the control of our destiny. I'm sure that our customers will notice the benefits not just in terms of a more reliable service, but also in the velocity with which we are able to ship new capabilities to them.

Whitepaper: Lift and shift to Google Cloud Platform



Today we're announcing the availability of a new white paper entitled “How to Lift-and-Shift a Line of Business Application onto Google Cloud Platform.” This is the first in a series of four white papers focused on application migration and modernization. Stay tuned to the GCP blog as we release the next installments in the coming weeks.

The “Lift-and-Shift” white paper walks you through migrating a Microsoft Windows-based, two-tier, expense reporting, web-application that currently resides on-premises, in your data center. The white paper provides background information, and a three-phased project methodology, as well as pointers to application code on github. You'll be able to replicate the scenario on-premises, and walk through migrating your application to Google Cloud Platform (GCP).

The phased project includes implementation of initial GCP resources, including GCP networking, a site-to-site VPN and virtual machines (VMs), as well as setting up Microsoft SQL Server availability groups, and configuring Microsoft Active Directory (AD) replication in your new hybrid environment.

Want to learn more about how to lift and shift your own application by reading through (or following the same steps) in the white paper? If you're ready to get started, you can download your copy of the white paper and start your migration today.

Simplify Cloud VPC firewall management with service accounts



Firewalls provide the first line of network defense for any infrastructure. On Google Cloud Platform (GCP), Google Cloud VPC firewalls do just that—controlling network access to and between all the instances in your VPC. Firewall rules determine who's allowed to talk to whom and more importantly who isn’t. Today, configuring and maintaining IP-based firewall rules is a complex and manual process that can lead to unauthorized access if done incorrectly. That’s why we’re excited to announce a powerful new management feature for Cloud VPC firewall management: support for service accounts.

If you run a complex application on GCP, you’re probably already familiar with service accounts in Cloud Identity and Access Management (IAM) that provide an identity to applications running on virtual machine instances. Service accounts simplify the application management lifecycle by providing mechanisms to manage authentication and authorization of applications. They provide a flexible yet secure mechanism to group virtual machine instances with similar applications and functions with a common identity. Security and access control can subsequently be enforced at the service account level.


Using service accounts, when a cloud-based application scales up or down, new VMs are automatically created from an instance template and assigned the correct service account identity. This way, when the VM boots up, it gets the right set of permissions and within the relevant subnet, so firewall rules are automatically configured and applied.

Further, the ability to use Cloud IAM ACLs with service accounts allows application managers to express their firewall rules in the form of intent, for example, allow my “application x” servers to access my “database y.” This remediates the need to manually manage Server IP Address lists while simultaneously reducing the likelihood of human error.
This process is leaps-and-bounds simpler and more manageable than maintaining IP address-based firewall rules, which can neither be automated nor templated for transient VMs with any semblance of ease.

Here at Google Cloud, we want you to deploy applications with the right access controls and permissions, right out of the gate. Click here to learn how to enable service accounts. And to learn more about Cloud IAM and service accounts, visit our documentation for using service accounts with firewalls.