Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

How SREs find the landmines in a service – CRE life lessons



In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.

Onboarding review


Q: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?

That’s a great approach  if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support  and whether it’s in a supportable state  is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it  you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they're unlikely to be active participants in making the service materially more reliable.

At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:

  • SRE entrance review: where an SRE team assess whether a developer-supported service should be onboarded by SRE, and what the onboarding preconditions should be.
  • SRE onboarding/takeover: where a dev and SRE team agree in principle that the SRE team should take on primary operational responsibility for a service, and start negotiating the exact conditions for takeover (how and when the SREs will onboard the service).

It’s important to remember the motivations of the various parties in this process:

  • Developers want someone else to pick up support for the service, and make it run as well as possible. They want users to feel that the service is working properly, otherwise they'll move to a service run by someone else.
  • The SRE team wants to be sure that they're not being “sold a pup” with a hard-to-support service, and have a vision for making the production service lower in toil and more robust.
  • Meanwhile the company management wants to reduce the number of embarrassing service outages, as long as it doesn’t cost them too much in engineer time.

The SRE entrance review

During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:

  1. Assess how the service would benefit from SRE ownership
  2. Identify service design, implementation and operational deficiencies that could be a barrier to SRE takeover
  3. And if SRE ownership is determined to be beneficial, identify the bug fixes, process changes and necessary service behavior needed before onboarding the service

An SRE team typically designates a single person or a small subset of the team to familiarize themselves with the service, and evaluate it for fitness for takeover.

The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.

A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.

In the review phase, SREs aim to understand:

  • what the service does
  • day-to-day service operation (traffic variation, releases, experiment management, config pushes)
  • how the service tends to break and how this manifests in alerts
  • rough edges in monitoring and alerting
  • where the service configuration diverges from the SRE team’s practices
  • major operational risks for the service


The SRE team also considers:

  • whether the service follows SRE team best practices, and if not, how to retrofit it
  • how to integrate the service with the SRE team’s existing tools and processes
  • the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?


The SRE takeover


The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).

There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.

Extant bugs
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Reliability
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.

The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.

Automation
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.

In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.

If there’s excessive manual work involved in supporting a service, automation usually solves the problem.

Monitoring/alerting
The dominant concern with most services undergoing SRE takeover is the paging rate  how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.

Generally, excessive paging rates are the result of one of three things:

  1. Paging on something that’s not intrinsically important e.g., task restart or hitting 80% capacity of disk. Instead, downgrade the page to a bug (if it’s not urgent) or eliminate it entirely. Moving to symptom-based monitoring (“users are actually seeing problems”) can help improve this situation.
  2. Page storms where one small incident/outage generates many pages. Try to group related pages for an incident into a single outage, to get a clearer picture of the system’s outage metrics.
  3. A system that’s having too many genuine problems. In this case SRE takeover in the near future is unlikely, but SREs may be able to help diagnose and resolve the root causes of the problems.
SREs generally want to see several weeks of low paging levels before agreeing to take over a service.

More general ways to improve the service might include:

  • integrating the service with standard SRE tools and practices e.g., load shedding, release processes and configuration pushes
  • extending and improving playbook entries to rely less on the developer team’s tribal knowledge
  • aligning the service’s configurations with the SRE team’s common languages and infrastructure
Ultimately, an SRE entrance review should produce guidance that's useful to developers even if the SRE team declines to onboard the service. In that event, the guidance from the review should still help developers make their service easier to operate and more reliable.

Smoothing the path


SREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.

One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”

Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.

Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).

At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.

Google App Engine standard now supports Java 8



Java 8 support has been one of the top requests from the App Engine developer community. Today, we're excited to announce the beta availability of Java 8 on App Engine standard environment. Supporting Java 8 on App Engine standard environment is a significant milestone. In addition to support for an updated JDK and Jetty 9 with Servlet 3.1 specs, this launch enables enhanced application performance. Further, this release improves the developer experience with full gRPC and Google Cloud Java Library support, and we have finally removed the class whitelist.

App Engine standard now fully supports off-the-shelf frameworks such as Spring Boot and alternative languages like Kotlin or Apache Groovy. At the same time, the new runtime environment still provides all the great benefits developers have come to depend on and love about App Engine standard, including rapid deployments in seconds, near instantaneous scale up and scale down (including to zero instances when no traffic is detected), native microservices and versioning support, traffic splitting between any two languages (including Java 7 and Java 8), local development tooling and App Engine APIs.

Developer tooling is critical to the Java community. The new runtime supports Stackdriver, Cloud SDK, Maven, Gradle, IntelliJ and Eclipse plugins. In particular, the IntelliJ and Eclipse plugins provide a modern experience optimized for developer flow. Watch the Google Cloud Next 2017 session “Power your Java Workloads on Google Cloud Platform” to learn more about the new IntelliJ plugin, Stackdriver Debugger, traffic splitting, auto scaling and other App Engine features.

As always, developers can choose between App Engine standard and flexible environments  deploy your application to one environment now, and another environment later. Or deploy to both simultaneously, mixing and matching environments as well as languages. (Here’s a guide on how to choose between App Engine flexible and standard environments.)

Below is a one-minute video that demonstrates how easy it is to deploy your first application to App Engine.



To get started with Java 8 for App Engine standard, follow this quickstart. Or if you’re a current App Engine standard Java 7 user, upgrade to the new runtime by adding java8 to your appengine-web.xml file, as described in the video above. Be sure to deploy a new version of your service and direct a small portion of your traffic to this instance and monitor for errors.

You can find samples of all the code in the documentation here. For sample applications running Kotlin, Spring-Boot and SparkJava, check out this repository.

We've been investing heavily in language and infrastructure updates for both App Engine environments (we recently announced the general availability of Java 8 on App Engine flexible and Python upgrades), with many more to come. We’d love to hear from you during the Java 8 beta period and beyond. Submit your feedback on the Maven, Gradle, IntelliJ and Eclipse plugins, as well as the Google Cloud Java Libraries on their respective GitHub repositories.

Happy Coding!

Enterprise identity made easy in Google Cloud Platform with Cloud Identity



As an organization, you want to be able to control how your users access Google’s products and other services online. Millions of G Suite customers already rely on Google Cloud’s identity services to secure their online identities, perform single sign on and enforce multi-factor authentication. We're excited to announce that the same identity management features used for years in G Suite will be made available for free to Google Cloud Platform (GCP) customers to manage their developers online with Cloud Identity.

Introducing Cloud Identity support in GCP

Starting today, we’re rolling out native support for Cloud Identity right into GCP. Cloud Identity makes it easy to provision and manage users and groups directly from the Google Admin Console. Once you sign up for Cloud Identity, you'll also get access to the Cloud Resource Manager to administer your new GCP organization. Cloud Resource Manager allows you to centrally manage all of your organization's GCP projects and IAM roles. With Cloud Identity and Cloud Resource Manager, you now have full control over how your organization uses Google Cloud.

Try it today


To start using Cloud Identity, head to the Cloud Console to find the new “Identity” section under Cloud IAM. Here you'll be able to find the Cloud Identity sign up flow, where you'll create your new Cloud Identity admin account and Cloud Identity organization. For more information, check out our Getting Started Guide.

Enterprise identity made easy in Google Cloud Platform with Cloud Identity



As an organization, you want to be able to control how your users access Google’s products and other services online. Millions of G Suite customers already rely on Google Cloud’s identity services to secure their online identities, perform single sign on and enforce multi-factor authentication. We're excited to announce that the same identity management features used for years in G Suite will be made available for free to Google Cloud Platform (GCP) customers to manage their developers online with Cloud Identity.

Introducing Cloud Identity support in GCP

Starting today, we’re rolling out native support for Cloud Identity right into GCP. Cloud Identity makes it easy to provision and manage users and groups directly from the Google Admin Console. Once you sign up for Cloud Identity, you'll also get access to the Cloud Resource Manager to administer your new GCP organization. Cloud Resource Manager allows you to centrally manage all of your organization's GCP projects and IAM roles. With Cloud Identity and Cloud Resource Manager, you now have full control over how your organization uses Google Cloud.

Try it today


To start using Cloud Identity, head to the Cloud Console to find the new “Identity” section under Cloud IAM. Here you'll be able to find the Cloud Identity sign up flow, where you'll create your new Cloud Identity admin account and Cloud Identity organization. For more information, check out our Getting Started Guide.

Versioning APIs at Google



Versioning APIs is difficult, and everyone in the API space has opinions about how to do it properly. It’s also almost impossible to avoid. As teams build new software, occasionally they need to get rid of a feature (or provide that feature in a different way). Versioning gives your API users a reliable way to understand semantic changes in the API. While some companies will go to great lengths to never change a version, we don’t have that luxury: with the number of APIs we operate, the number of teams developing them here and the number of developers relying on them, we version APIs so developers know what to expect from them.

Versioning APIs should be done according to a consistent and comprehensive policy. At Google, we follow the general principles of semantic versioning for our APIs. The principles behind semantic versioning are simple: each release gets a number, X and a number Y. X indicates a major version, and Y indicates a minor version. A new major version indicates a backward-incompatible change while a new minor version indicates a backward-compatible change.

Our major versions are reflected in the path of our APIs, immediately following the domain. Why? Well, it means that any API URL that you call will never rename or drop any of the fields you rely on. If you're doing a GET on coolcloudapi.googleapis.com/v1/coolthings/12301221312132, you can rely on the fact that the JSON returned will never have fields renamed or removed.

There are pros and cons to this approach, of course, and many smart people have heated debates over the “right” way to version. Some people prefer encoding a version request in a header, others “keep track” of the version that any individual API consumer is used to getting. We’ve seen and heard them all, and collectively we’ve decided that, for our broad purposes, encoding the major version in the URL makes the most sense most of the time.

Note that the minor version is not encoded in the URL. That means that if we enhance the Cool Cloud API by adding a new field, you may one day be surprised when a call to coolcloudapi.googleapis.com/v1/coolthings/12301221312132 starts returning some additional data. But we’ll never "break" your app by removing fields.

When we release a new major version, we generally write a single backend that can handle both versions. All requests (regardless of version) are sent to the backend, and it uses the version in the path to decide which surface to return.

For customers using Cloud Endpoints, our API gateway, we’re starting to release the features that will enable you to follow these same versioning practices.

First, our proxy can now serve multiple versions of your API and reports the API version. This will let you see how much traffic different versions of your API receive. In practice, this means that you can tell how much of your traffic has migrated to a new version.

Second, it can give you a strategy for deprecating and turning down an API  by finding out who's still using the old version. But that’s a topic for another blog post for another day.

Versioning is the thorn on the rose of making better APIs. We believe in the approach we’ve adopted internally, and are happy to share the best practices we’ve developed with the community. To get started with Cloud Endpoints, check out our 10-minute-quickstart or in-depth tutorials, or reach out to us on our Google Group at [email protected]  we’d love to hear from you!

Why should your app get SRE support? – CRE life lessons



Editor’s note: When you start to run many applications or services in your company, then you'll start to bump up against the limit of what your primary SRE (or Ops) team can support. In this installment of CRE Life Lessons we're going to look at how you can make good, principled and defensible decisions about which of your company’s applications and services you should give to your SREs to support, and how to decide when that subset needs to change.

In Google, we're fortunate to have Site Reliability Engineering (SRE) teams supporting both our horizontal infrastructure such as storage, networking and load balancing, and our major applications such as Search, Maps and Photos. Nevertheless, the combination of software engineering and system engineering skills required of the role make it hard to find and recruit SREs, and demand for them steadily outstrips supply.

Over time we’ve found some practical limits to the number of applications that an SRE team can support, and learned the characteristics of applications that are more trouble to support than others. If your company runs many production applications, your SRE team is unlikely to be able to support them all.

Q: How will I know when my company’s SRE team is at its limit? How do I choose the best subset of applications to support? When should the SRE team drop support for an application?

Good questions all; let’s explore them in more detail.

Practical limitations on SRE support


At Google, the rule of thumb for the minimum SRE team needed to staff a pager rotation without burn-out is six engineers; for a 24/7 pager rotation with a target response time under 30 minutes, we don’t want any engineer to be on-call for more than 12 continuous hours because we don’t want paging alerts interrupting their sleep. This implies two groups of six engineers each, with a wide geographic spread so that each team can handle pages mostly in their daytime.

At any one time, there's usually a designated primary who responds to pages, and a secondary who catches fall-through pages e.g., if the primary is temporarily out of contact, or is in the middle of managing an incident. The primary and secondary handle normal ops work, freeing the rest of the team for project work such as improving reliability, building better monitoring or increasing automation of ops tasks. Therefore every engineer has two weeks out of six focused on operational work -- one as primary, one as secondary.

Q: Surely 12 to 16 engineers can handle support for all the applications your development team can feasibly write?

Actually, no. Our experience is that there is a definite cognitive limit to how many different applications or services an SRE team can effectively manage; any single engineer needs to be sufficiently familiar with each app to troubleshoot, diagnose and resolve most production problems with each app. If you want to make it easy to support many apps at once, you’ll want to make them as similar as possible: design them to use common patterns and back-end services, standardize on common tools for operational tasks like rollout, monitoring and alerting, and deploy them on similar schedules. This reduces the per-app cognitive load, but doesn’t eliminate it.

If you do have enough SREs then you might consider making two teams (again, subject to the 2 x 6 minimum staffing limit) and give them separate responsibilities. At Google, it’s not unusual for a single SRE team to split into front-end and back-end shards, each taking responsibility for supporting only that half of the system, as it grows in size. (We call this team mitosis.)

Your SRE team’s maximum number of supported services will be strongly influenced by factors such as:

  • the regular operational tasks needed to keep the services running well, for example releases, bug fixes, non-urgent alerts/bugs. These can be reduced (but not eliminated) by automation;
  • “interrupts” -- unscheduled non-critical human requests. We’ve found these awkwardly resistant to efforts to reduce them; the most effective strategy has been self-service tools that address the 50-70% of repeated queries;
  • emergency alert response, incident management and follow-up. The best way to spend less time on these is to make the service more reliable, and to have better-tuned alerts (i.e., that are actionable and which, if they fire, strongly indicate real problems with the service).


Q: What about the four weeks out of six during which an SRE isn’t doing operational work  could we use that time to increase our SRE team’s supported service capacity?

You could do this but at Google we view this as “eating your seed corn.” The goal is to have the machines do all the things that are possible for machines to do, and for that to happen you need to leave breathing room for your SREs to do project work such as producing new automation for your service. In our experience, once a team crosses the 50% ops work threshold, it quickly descends a slippery slope to 100% ops. In that condition you’re losing the engineering effort that will give you medium-to-long term operational benefits such as reducing the frequency, duration and impact of future incidents. When you move your SRE team into nearly full-time ops work, you lose the benefit of its engineering design and development skills.

Note in particular that SRE engineering project work can reduce operational load by addressing many of the factors we described above, which were limiting how many services an SRE team could support.

Given the above, you may well find yourself in a position where you want your SRE team to onboard a new service but in practice they are not able to support it on on a sustainable basis.

You’re out of SRE support capacity - now what?

At Google our working principle is that any service that’s not explicitly supported by SRE must be supported by its developer team; if you have enough developers to write a new application then you probably have enough developers to support it. Our developers tend to use the same monitoring, rollout and incident management tools as the SREs they work with, so the operational support workload is similar. In any case, we like developers that wrote an application to directly support it for a little while so they can get a good feel for how customers are experiencing it. The things they learn doing so help SREs to onboard the service later.

Q: What about the next application we want the developers to write? Won’t they be too busy supporting the current application?

This may be true  the current application may be generating a high operational workload, due to excessive alerts, or a lack of automation. However, this gives the developer team a practical incentive to spend time making the application easier to support — tuning alerts, spending developer time on automation, and reducing the velocity of functional changes.

When developers are overloaded with operational work, SREs might be able to lend operational expertise and development effort to reduce the developers’ workloads to a manageable level. However, SREs still shouldn’t take on operational responsibility for the service, as this won’t solve the fundamental problem.

When one team develops an application and another team bears the brunt of the operational work for it, moral hazard thrives. Developers want high development velocity; it’s not in their interest to spend days running down and eliminating every odd bug that occasionally causes their server to run out of memory and need to be restarted. Meanwhile, the operational team is getting paged to do those restarts several times per day it’s very much in their interest to get that bug fixed since it’s their sleep that is being interrupted. Not surprisingly, when developers bear the operational load for their own system, they too are incented to spend time making it easier to support. This also turns out to be important for persuading an SRE team to support their application, as we shall see later.

Choosing which applications to support


The easiest way to prioritize the applications for SRE to support is by revenue or other business criticality, i.e., how important it will be if the service goes down. After all, having an SRE team supporting your service should improve its reliability and availability.

Q: Sounds good to me; surely prioritizing by business impact is always the right choice?

Not always. There are services which actually don’t need much support work; a good example is a simple infrastructure service (say, a distributed key-value store) that has reached maturity and is updated only infrequently. Since nothing is really changing in the service, it’s unlikely to break spontaneously. Even if it’s a critical dependency of several user-facing applications, it might not make sense to dedicate SRE support; rather, let its developers hold the pager and handle the low volume of operational work.

In Google we consider that SRE teams have seven areas of focus that developers typically don’t:

  • Monitoring and metrics. For example, detecting response latency, error or unanswered query rate, and peak utilization of resources
  • Emergency response. Running on-call rotations, traffic-dip detection, primary/secondary/escalation, writing playbooks, running Wheels of Misfortune
  • Capacity planning. Doing quarterly projections, handling a sudden sustained load spike, running utilization-improvement projects
  • Service turn-up and turn-down. For services which run in many locations (e.g., to reduce end-user latency), planning location turn-up/down schedules and automating the process to reduce risks and operational load
  • Change management. Canarying, 1% experiments, rolling upgrades, quick-fail rollbacks, and measuring error budgets
  • Performance. Stress and load testing, resource-usage efficiency monitoring and optimization.
  • Data Integrity. Ensuring that non-reconstructible data is stored resiliently and highly available for reads, including the ability to rapidly restore it from backups


With the possible exception of “emergency response” and “data integrity,” our key-value store wouldn’t benefit substantially from any of these areas of expertise, and the marginal benefit of having SREs rather than developers support it is low. On the other hand, the opportunity cost of spending SRE support capacity on it is high; there are likely to be other applications which could benefit from more of SREs’ expertise.

One other reason that SREs might take on responsibility for an infrastructure service that doesn’t need SRE expertise if it is a crucial dependency of services they already run. In that case, there could be a significant benefit to them of having visibility into, and control of, changes to that service.

In part 2 of this blog post, we’ll take a look at how our SRE team could determine how  and indeed, whether  to onboard a business-critical service once it has been identified as able to benefit from SRE support.

Google Compute Engine ranked #1 in price-performance by Cloud Spectator



Cloud Spectator, an independent benchmarking and consulting agency, has released a new comparative benchmarking study that ranks Google Cloud #1 for price-performance and block storage performance against AWS, Microsoft Azure and IBM SoftLayer.

In January 2017, Cloud Spectator tested the overall price-performance, VM performance and block storage performance of four major cloud service providers: Google Compute Engine, Amazon Web Services, Microsoft Azure, and IBM SoftLayer. The result is a rare apples-to-apples comparison among major Cloud Service Providers (CSPs), whose distinct pricing models can make them difficult to compare.

According to Cloud Spectator, “A lack of transparency in the public cloud IaaS marketplace for performance often leads to misinformation or false assumptions.” Indeed, RightScale estimates that up to 45% of cloud spending is wasted on resources that never end up being used — a serious hit to any company’s IT budget.

The report can be distilled into three key insights, which upend common misconceptions about cloud pricing and performance:
  • Insight #1: VM performance varies across cloud providers. In testing, Cloud Spectator observed differences of up to 1.4X in VM performance and 6.1X in block storage performance.
  • Insight #2: You don’t always get what you pay for. Cloud Spectator’s study found no correlation between price and performance.
  • Insight #3: Resource contention (the “Noisy Neighbor Effect”) can affect performance — but CSPs can limit those effects. Cloud Spectator points out that noisy neighbors are a real problem with some cloud vendors. To try and handle the problem, some vendors throttle down their customers access to resources (like disks) in an attempt to compensate for other VMs (so called Noisy Neighbors) on the same host machine.

You can download the full report here, or keep reading for key findings.

Key finding: Google leads for overall price-performance

Value, defined as the ratio of price and performance, varies by 2.4x across the compared IaaS providers, with Google achieving the highest CloudSpecs Score (see Methodology, below) among the four cloud IaaS providers. This is due to strong disk performance and the most inexpensive packaged pricing found in the study.


To learn more, download “2017 Best Hyperscale Cloud Providers: AWS vs. Azure vs. Google vs. SoftLayer,” a report by Cloud Spectator.


Methodology

Cloud Spectator’s price-performance calculation, the CloudSpecs Score™, provides information on how much performance the user receives for each unit of cost. The CloudSpecs Score™ is an indexed, comparable score ranging from 0-100 indicative of value based on a combination of cost and performance. The calculation of the CloudSpecs Score™ is: price-performance_value = [VM performance score] / [VM cost] best_VM_value = max{price-performance_values} CloudSpecs Score™ = 100*price-performance_value / best_VM_value
Overall storage CloudSpecs Score™ was calculated by averaging block storage and vCPU-memory price-performance scores together so that they have equal weight for each VM size. Then, all resulting VM size scores were averaged together.


Google Cloud Platform expands to Australia with new Sydney region – open now



Starting today, developers can choose to run applications and store data in Australia using the new Google Cloud Platform (GCP) region in Sydney. This is our first GCP region in Australia and the fourth in Asia Pacific, joining Taiwan, Tokyo and the recently launched Singapore.

GCP customers down under will see significant reductions in latency when they run their applications in Sydney. Our performance testing shows 80% to 95% reductions in round-trip time (RTT) latency when serving customers from New Zealand and Australian cities such as Sydney, Auckland, Wellington, Melbourne, Brisbane, Perth and Adelaide, compared to using regions in Singapore or Taiwan.

The Sydney GCP region is launching with three zones and several GCP services, and App Engine and Datastore will be available shortly:
Google Cloud customers benefit from our commitment to large-scale infrastructure investments. With the addition of each new region, developers have more choice on how to run applications closest to their customers. Google’s networking backbone, meanwhile, transforms compute and storage infrastructure into a global-scale computer, giving developers around the world access to the same cloud infrastructure that Google engineers use every day.

In Asia-Pacific, we’re already building another region in Mumbai, as well as new network infrastructure to tie them all together, including the SJC cable and Indigo cable fiber optic systems.

What customers are saying

Here’s what the new regions means to a few of our customers and partners.
"The regional expansion of Google Cloud Platform to Australia will help enable PwC's rapidly growing need to experiment and innovate and will further extend our work with Google Cloud.

It not only provides a reliable and resilient platform that can support our firm's core technology needs, it also makes available to us, GCP's market leading technologies and capabilities to support the unprecedented demand of our diverse and evolving business."


—Hilda Clune, Chief Information Officer, PwC Australia
"Monash University has one of the most ambitious digital transformation agendas in tertiary education. We're executing our strategy at pace and needed a platform which would give us the scale, flexibility and functionality to respond rapidly to our development and processing needs. Google Cloud Platform (GCP) and in particular App Engine have been a great combination for us, and we're very excited at the results we're getting. Having Google Cloud Platform hosted now in Australia is a big bonus." 
—Trevor Woods, Chief Information Officer, Monash University
Modern geophysical technologies place a huge demand on supercomputing resources. Woodside utilises Google Cloud as an on-demand solution for our large computing requirements. This has allowed us to push technological boundaries and dramatically reduce turnaround time.
— Sean Salter, VP Technology,Woodside Energy Ltd.

Next steps

We want to help you build what’s next for you. If you’re looking for help to understand how to deploy GCP, please contact local partners: Shine Solutions, Servian, 3WKS, Axalon, Onigroup, PwCDeloitte, Glintech, Fronde or Megaport.

For more details on Australia’s first region, please visit our Sydney region page where you’ll get access to free resources, whitepapers, an on-demand training video series called "Cloud On-Air" and more. These will help you get started on GCP. Give us a shout to request early access to new regions and help us prioritize what we build next.

New Singapore GCP region – open now



The Singapore region is now open as asia-southeast1. This is our first Google Cloud Platform (GCP) region in Southeast Asia (and our third region in Asia), and it promises to significantly improve latency for GCP customers and end users in the area.

Customers are loving GCP in Southeast Asia; the total number of paid GCP customers in Singapore has increased by 100% over the last 12 months.

And the experience for GCP customers in Southeast Asia is better than ever too; performance testing shows 51% to 98% reductions in round-trip time (RTT) latency when serving customers in Singapore, Jakarta, Kuala Lumpur and Bangkok compared to using other GCP regions in Taiwan or Tokyo.

Customers with a global footprint like BBM Messenger, Carousell and Go-Jek have been looking forward to the launch of the Singapore region.
"We are excited to be able to deploy into the GCP Singapore region, as it will allow us to offer our services closer to BBM Messenger key markets. Coupled with Google's global load balancers and extensive global network, we expect to be able to provide a low latency, high-speed experience for our users globally. During our POCs, we found that GCP outperformed most vendors on key metrics such as disk I/O and network performance on like-for-like benchmarks. With sustained usage discounts and continuous support from Google's PSO and account team, we are excited to make GCP the foundation for the next generation of BBM consumer services. Matthew Talbot, CEO of Creative Media Works, the company that runs BBM Messenger Consumer globally.
"As one of the largest and fastest growing mobile classifieds marketplaces in the world, Carousell needed a platform that was agile enough for a startup, but could scale quickly as we expand. We found all these qualities in the Google Cloud Platform (GCP), which gives us a level of control over our systems and environment that we didn't find elsewhere, along with access to cutting edge technologies. We're thrilled that GCP is launching in Singapore, and look forward to being inspired by the way Google does things at scale."  — Jordan Dea-Mattson, Vice President Engineering, Carousell

"We are extremely pleased with the performance of GCP, and we are excited about the opportunities opening in Indonesia and other markets, and making use of the Singapore Cloud Region. The outcomes we’ve achieved in scaling, stability and other areas have proven how fantastic it is to have Google and GCP among our key service partners." — Ajey Gore, CTO, Go-Jek
We’ve launched Singapore with two zones and the following services:
In addition, you can combine any of the services you deploy in Singapore with other GCP services around the world such as DLP, Spanner and BigQuery.

Singapore Multi-Tier Cloud Security certification

Google Cloud is pleased to announce that having completed the required assessment, it has been recommended, by an approved certification body, for Level 3 certification of Singapore's Multi-Tier Cloud Security (MTCS) standard (SS 584:2015+C1:2016). Customers can expect formal approval of Google Cloud's certification in the coming months. As a result of achieving this certification, organizations who require compliance with the strictest levels of the MTCS standard can now confidently adopt Google Cloud services and host this data on Google Cloud's infrastructure.

Next steps

If you’re looking for help to understand how to deploy GCP, please contact local partners Sakura Sky, CloudCover, Cloud Comrade and Powerupcloud.

For more details on the Singapore region, please visit our Singapore region portal, where you’ll get access to free resources, whitepapers, on-demand video series called "Cloud On-Air" and more. These will help you get started on GCP. Our locations page provides updates on other regions coming online soon. Give us a shout to request early access to new regions and help us prioritize what we build next.

Best practices for App Engine startup time: Google Cloud Performance Atlas



[Editor’s note: In the past couple of months, Colt McAnlis of Android Developers fame joined the Google Cloud developer advocate team. He jumped right in and started blogging — and vlogging  for the new Google Cloud Performance Atlas series, focused on extracting the best performance from your GCP assets. Check out this synopsis of his first video, where he tackles the problem of cold boot performance in App Engine standard environment. Vroom vroom!]

One of the fantastic features of App Engine standard environment is that it has load balancing built into it, and can spin up or spin down instances based upon traffic demands. This is great in situations where your content goes viral, or for daily ebb-and-flows of traffic, since you don’t have to spend time thinking about provisioning whatsoever.

As a baseline, it’s easy to establish that App Engine startup time is really fast. The following graph charts instance type vs. startup time for a basic Hello World application:


250ms is pretty fast to boot up an App Engine F2 type instance class. That’s faster than fetching a Javascript file from most CDNs on a 4G connection, and shows that App Engine responds quickly to requests to create new instances.

There are great resources that detail how App Engine manages instances, but for our purposes, there’s one main concept we’re concerned with: loading requests.

A loading request triggers App Engine’s load balancer to spin up a new instance. This is important to note, since the response time for a loading request will be significantly higher than average, since the request must wait for the instance to boot up before it's serviced.

As such, the key to being able to respond to rapid load balancing while keeping user experience high is to optimize the cold-boot performance of your App Engine application. Below, we’ve gathered a few suggestions on addressing the most common problems to cold-boot performance.

Leverage resident instances

Resident instances are instances that stick around regardless of the type of load your app is handling; even when you’ve scaled to zero, these instances will still be alive.

When spikes do occur, resident instances service requests that cannot be serviced in the time it would take to spin up a new instance; requests are routed to them while a new instance spins up. Once the new instance is up, traffic is routed to it and the resident instance goes back to being idle.


The point here is that resident instances are the key to rapid scale and not shooting users’ perception of latency through the roof. In effect, resident instances hide instance startup time from the user, which is a good thing!

For more information, check our our Cloud Performance Atlas article on how Resident instances helped a developer reduce their startup time.

Be careful with initializing global variables during parallel requests

While using global variables is a common programming practice, they can create a performance pitfall in certain scenarios relating to cold boot performance. If your global variable is initialized during the loading request AND you’ve got parallel requests enabled, your application can fall into a bit of a trap, where multiple parallel requests end up blocking, waiting on the first loading request to finish initializing of your global variable. You can see this effect in the logging snapshot below:
The very first request is our loading request, and the next batch is a set of blocked parallel requests, waiting for a global variable to initialize. You can see that these blocked requests can easily end up with 2x higher response latency, which is less than ideal.

For more info, check our our Cloud Performance Atlas article on how Global variables caused one developer a lot of headaches.

Be careful with dependencies

During cold-boot time, your application code is busy scanning and importing dependencies. The longer this takes, the longer it will take for your first line of code to execute. Some languages can optimize this process to be exceptionally fast, other languages are slower, but provide more flexibility.

And to be fair, most of the time, a standard application importing a few modules should have a negligible impact on performance. However, when third-party libraries get big enough, we start to see them do weird things with import semantics, which can mess up your boot time significantly.
Addressing dependency issues is no small feat. You might have to use warm-up requests, lazy-load your imports, or in the most extreme case, prune your dependency tree.

For more info, check our our Cloud Performance Atlas article on how the developer of a platypus-based calculator tracked down a dependency problem.


Every millisecond counts

In the end, optimizing cold-boot performance for App Engine instances is critical for scaling quickly and keeping user perception of latency in a good place. If you’d like to know more about ways to optimize your Google Cloud applications, check out the rest of the Google Cloud Performance Atlas blog posts and videos. Because when it comes to performance, every millisecond counts.