Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Cloud Speech API improves longform audio recognition and adds 30 new language variants



Since its launch in 2016, businesses have used the Google Cloud Speech API to improve speech recognition for everything from voice-activated commands to call center routing to data analytics. And since then, we’ve gotten a lot of feedback that our users would like even more functionality and control. That’s why today we’re announcing Cloud Speech API features that expand support for long-form audio and further extend our language support to help even more customers inject AI into their businesses.

Here’s more on what the updated Cloud Speech API can do:


Word-level timestamps

Our number one most requested feature has been providing timestamp information for each word in the transcript. Word-level timestamps let users jump to the moment in the audio where the text was spoken, or display the relevant text while the audio is playing. You can find more information on timestamps here.

Happy Scribe uses Cloud Speech API to power its easy-to-use and affordable voice-to-text transcription service, helping professionals such as reporters and researchers transcribe interviews.
“Having the ability to map the audio to the text with timestamps significantly reduces the time spent proofreading transcripts.”  
 Happy Scribe Co-founder, André Bastie
VoxImplant enables companies to build voice and video applications, including IVR and speech analytics applications.
“Now with Google Cloud Speech API timestamps, we can accurately analyze phone call conversations between two individuals with real-time speech-to-text transcription, helping our customers drive business impact. The ability to easily find the place in a call when something was said using timestamps makes Cloud Speech API much more useful and will save our customers’ time”  
 VoxImplant CEO, Alexey Aylarov

Support for files up to 3 hours long

To help our users with long-form audio needs, we’re increasing the length of supported files from 80 minutes to up to 3 hours. Additionally, files longer than 3 hours could be supported on a case-by-case basis by applying for a quota extension through Cloud Support.

Expanded language coverage

Cloud Speech API already supports 89 language varieties. Today, coinciding with the broader announcement this morning, we’re adding 30 additional language varieties, from Bengali to Latvian to Swahili, covering more than one billion additional speakers. Our new expanded language support helps Cloud Speech API customers reach more users in more countries for an almost global reach. In addition, it enables users in more countries to use speech to access products and services that up until now have never been available to them.

You can find our complete list of supported languages here.

We hope these updates will help our users do more with Cloud Speech API. To learn more, visit Cloud.google.com/speech/.

CRE life lessons: The practicalities of dark launching



In the first part of this series, we introduced you to the concept of dark launches. In a dark launch, you take a copy of your incoming traffic and send it to the new service, then throw away the result. Dark launches are useful when you want to launch a new version of an existing service, but don’t want nasty surprises when you turn it on.

This isn’t always straightforward as it sounds, however. In this blog post, we’ll look at some of the circumstances that can make things difficult for you, and teach you how to work around them.

Finding a traffic source

Do you actually have existing traffic for your service? If you’re launching a new web service which is not more-or-less-directly replacing an existing service, you may not.

As an example, say you’re an online catalog company that lets users browse items from your physical store’s inventory. The system is working well, but now you want to give users the ability to purchase one of those items. How would you do a dark launch of this feature? How can you approximate real usage when no user is even seeing the option to purchase an item?

One approach is to fire off a dark-launch query to your new component for every user query to the original component. In our example, we might send a background “purchase” request for an item whenever the user sends a “view” request for that item. Realistically, not every user who views an item will go on to purchase it, so we might randomize the dark launch by only sending a “purchase” request for one in every five views.

This will hopefully give you an approximation of live traffic in terms of volume and pattern. Note that this can’t be expected to be totally accurate when it comes to to live traffic when the service is launched. But, it’s better than nothing.

Dark launching mutating services

Generally, a read-only service is fairly easy to dark-launch. A service with queries that mutate backend storage is far less easy. There are still strong reasons for doing the dark launch in this situation, because it gives you some degree of testing that you can’t reasonably get elsewhere, but you’ll need to invest significant effort to get the most from dark-launching.

Unless you’re doing a storage migration, you’ll need to make significant effort/payoff tradeoffs doing dark launches for mutating queries. The easiest option is to disable the mutates for the dark-launch traffic, returning a dummy response after the mutate is prepared but before it’s sent. This is safe, but it does mean that you’re not getting a full measurement of the dark launched service — what if it has a bug that causes 10% of the mutate requests to be incorrectly specified?

Alternatively, you might choose to send the mutation to a temporary duplicate of your existing storage. This is much better for the fidelity of your test, but great care will be needed to avoid sending real users the response from your temporary duplicate. It would also be very unfortunate for everyone if, at the end of your dark launch, you end up making the new service live when it’s still sending mutations to the temporary duplicate storage.

Storage migration

If you’re doing a storage migration — moving an existing system’s stored data from one storage system to another (for instance, MySQL to MongoDB because you’ve decided that you don’t really need SQL after all) — you’ll find that dark launches will be crucial in this migration, but you’ll have to be particularly careful about how you handle mutation-inducing queries. Eventually you’ll need mutations to take effect in both your old and new storage systems, and then you’ll need to make the new storage system the canonical storage for all user queries.

A good principle is that, during this migration, you should always make sure that you can revert to the old storage system if something goes wrong with the new one. You should know which of your systems (old and new) is the master for a given set of queries, and hence holds the canonical state. The mastership generally needs to be easily mutable and able to revert responsibility to the original storage system without losing data.

The universal requirement for a storage migration is a detailed written plan reviewed by not just your system stakeholders but also by your technical experts from the involved systems. Inevitably, your plan will miss things and will have to adapt as you move through the migration. Moving between storage systems can be an awfully big adventure — expect us to address this in a future blog post.

Duplicate traffic costs

The great thing about a well-implemented dark launch is that it exercises the full service in processing a query, for both the original and new service. The problem this brings is that each query costs twice as much to process. That means you should do the following:


  • Make sure your backends are appropriately provisioned for 2x the current traffic. If you have quota in other teams’ backends, make sure it’s temporarily increased to cover the dark launch as well.
  • If you’re connection-sensitive, ensure that your frontends have sufficient slack to accommodate a 2x connection count.
  • You should already be monitoring latency from your existing frontends, but keep a close eye on this monitoring stat and consider tightening your existing alerting thresholds. As service latency increases, service memory likely also increases, so you’ll want to be alert for either of these stats breaching established limits.


In some cases, the service traffic is so large that a 100% dark launch is not practical. In these instances, we suggest that you determine the largest percentage launch that is practical and plan accordingly, aiming to get the most representative selection of traffic in the dark launch. Within Google, we tend to launch a new service to Googlers first before making the service public. However, experience has taught us that Googlers are often not representative of the rest of the world in how they use a service.

An important consideration if your service makes substantial use of caching is that a sub-50% dark launch is unlikely to see material benefits from caching and hence will probably significantly overstate estimated load at 100%.

You may also choose to test-load your new service at over 100% of current traffic by duplicating some traffic — say, firing off two queries to the new service for every original query. This is fine, but you should scale your quota increases accordingly. If your service is cache-sensitive, then this approach will probably not be useful as your cache hit rate will be artificially high.

Because of the load impact of duplicate traffic, you should carefully consider how to use load shedding in this experiment. In particular, all dark launch traffic should be marked “sheddable” and hence be the first requests to be dropped by your system when under load.

In any case, if your service on-call sees an unexpected increase in CPU/memory/latency, they should drop the dark launch to 0% and see if that helps.

Summary

If you’re thinking about a dark launch for a new service, consider writing a dark launch plan. In that plan, make sure you answer the following questions:


  • Do you have existing traffic which you can fork and send to your new service?
  • Where will you fork the traffic: the application frontend, or somewhere else?
  • Will you fire off the message to the new backend asynchronously, or will you wait for it and impose a timeout?
  • What will you do with requests that generate mutations?
  • How and where will you log the responses from the original and new services, and how will you compare them?
    • Are you logging the following things: response code, backend latency, and response message size?
    • Will you be diffing responses? Are there fields that cannot meaningfully be diffed which you should skip in your comparison?
  • Have you made sure that your backends can handle 2x the current peak traffic, and have you given them temporary quota for it?
    • If not, at what percentage traffic will you stop the dark launch?
  • How are you going to select traffic for participation in the dark launch percentage: randomly, or by hashing on a key such as user ID?
  • Which teams need to know that this dark launch is happening? Do they know how to escalate concerns?
  • What’s your rollback plan after you make your new service live?


It may be that you don’t have enough surprises or excitement in your life; in that case, you don’t need to worry about dark launches. But if you feel that your service gives you enough adrenaline rushes already, dark launching is a great technique to make service launches really, really boring.

Cloud SQL for PostgreSQL updated with new extensions



Among relational databases, PostgreSQL is the open-source solution of choice for a wide range of workloads. Back in March, we added support for PostgreSQL in Cloud SQL, our managed database service, with a limited set of features and extensions. Since then, we’ve been amazed by your interest, with many of you taking the time to suggest desired PostgreSQL extensions on the Issue Tracker and the Cloud SQL discussion group. This feedback has resulted in us adding the following 19 extensions, across four categories:
  • PostGIS: better support for geographic applications
  • Data type: a variety of new data types
  • Language: enhanced functionality with new processing languages
  • Miscellaneous: text search, cryptographic capabilities and integer aggregators, to name but a few
An extension is a piece of software that adds functionality, often data types and procedural languages, to PostgreSQL itself. If you already have a Cloud SQL for PostgreSQL database instance running, you can enable one or more of these extensions.

We're continuing our journey with PostgreSQL on Cloud SQL. As we prepare for general availability, we’re working on automatic failover for high availability, read replicas, additional extensions and precise restores with point-in-time recovery. Stay tuned!

Thanks for your feedback and please keep it coming on the Issue Tracker and in the Cloud SQL discussion group! Your input helps shape the future of Cloud SQL and all Google Cloud products.

Demystifying container vs VM-based security: Security in plaintext



Containerized workloads have gained in popularity over the past few years for enterprises and startups alike. Containers can significantly improve development speed, lower costs by improving resource utilization, and improve production consistency; however, their unique security implications in comparison to traditional VM-based applications are often not well understood. At Google, we’ve been running container-based production infrastructure for more than a decade and want to share our perspective on how container security compares to traditional applications.

Containerized workloads differ from traditional applications in several major ways. They also provide a number of advantages:

  • Modularized applications (monolithic applications vs. microservices)
  • Lower release overhead (convenient packaging format and well defined CI/CD practices)
  • Shorter lifetimes, less risk to have outdated packages (months to years vs. days to hours)
  • Less drift from original state during runtime (less direct access for maintenance, since workload is short-lived and can easily be rebuilt and re-pushed)
Now let’s examine how these differences can affect various aspects of security.

Understanding the container security boundary

The most common misconception about container security is that containers should act as security boundaries just like VMs, and as they are not able to provide such guarantee, they are a less secure deployment option. However, containers should be viewed as a convenient packaging and delivering mechanism for applications, rather than as mini VMs.

In the same way that traditional applications are not perfectly isolated from one another within a VM, an attacker or rogue program could break out of a running container and gain control of other containers running on the same VM. However, with a properly secured cluster, a container breakout would require an unpatched vulnerability in the kernel, in the common container infrastructure (e.g., docker), or in other services exposed to the workload from the VM. To help reduce the risk of these attacks Google Container Engine provides fully managed nodes and actively monitors for vulnerabilities and outdated packages in the VM  including third party add-ons and performs auto update and auto repair when necessary. This helps minimize the attack window for a container breakout when a new vulnerability is discovered.

A properly secured and updated VM provides process level isolation that applies to both regular applications as well as container workloads, and customers can use Linux security modules to further restrict a container’s attack surface. For example, Kubernetes, an open source production-grade container orchestration system, supports native integration with AppArmor, Seccomp and SELinux to impose restrictions on syscalls that are exposed to containers. Kubernetes also provides additional tooling to further support container isolation. PodSecurityPolicy allows customers to impose restriction on what a workload can do or access at the Node level. For particularly sensitive workloads that require VM level isolation, customers can use taint and toleration to help ensure only workloads that trust each other are scheduled on the same VM.

Ultimately, in the case of applications running in both VMs and containers, the VM provides the final security barrier. Just like you wouldn’t run programs with mixed security levels on the same VM, you shouldn’t run pods with mixed security levels on the same node due to the lack of guaranteed security boundaries between pods.


Minimizing outdated packages

One of the most common attack vectors for applications running in a VM is vulnerabilities in outdated packages. In fact, 99.9% of exploited vulnerabilities are compromised more than a year after the CVE was published (Verizon Data Breach Investigation Report, 2015). With monolithic applications, application maintainers often patch OSes and applications manually and VM-based workloads often run for an extended period of time before they're refreshed.

In the container world, microservices and well defined CI/CD pipelines make it easier to release more frequently. Workloads are typically short-lived (days or even hours), drastically reducing the attack surface for outdated application packages. Container Engine’s host OS is hardened and updated automatically. Further, for customers who adopt fully managed nodes, the guest OS and system containers are also patched and updated automatically, which helps to further reduce the risk from known vulnerabilities.

In short, containers go hand in hand with CI/CD pipelines that allow for very regular releases and update the containers with the latest patches as frequently as possible.


Towards centralized governance

One of the downsides of running traditional applications on VMs is that it’s nearly impossible to understand exactly what software is running in your production environment, let alone control exactly what software is being deployed. This is a result of three primary root causes:
  1. The VM is an opaque application packaging format, and it's hard to establish a streamlined workflow to examine and catalog its content prior to deployment
  2. VM image management is not standardized or widely adopted, and it’s often hard to track down every image that has ever been deployed to a project
  3. Due to VM workloads’ long lifespans, administrators must frequently manipulate running workloads to update and maintain both the applications and the OS, which can cause significant drift from the application’s original state when it was deployed
And because it’s hard to determine the accurate states of traditional applications at scale, the typical security controls will approximate by focusing on anomaly detection in application and OS behaviors and settings.

In contrast, containers provide a more transparent, easy-to-inspect and immutable format for packaging applications, making it easy to establish a workflow to inspect and catalog container content prior to deployment. Containers also come with a standardized image management mechanism (a centralized image repository that keeps track of all versions of a given container). And because containers are typically short-lived and can easily be rebuilt and re-pushed, there's typically less drift of a running container from its deploy-time state.

These properties help turn container dev and deploy workflows into key security controls. By making sure that only the right containers built by the right process with the right content are deployed, organizations can gain control and knowledge of exactly what’s running in their production environment.


Shared security ownership

In some ways, traditional VM-based applications offer a simpler security model than containerized apps. Their runtime environment is typically created and maintained by a single owner, and IT maintains total control over the code they deploy to production. Infrequent and drawn-out releases also mean that centralized security teams can examine every production push in detail.

Containers, meanwhile, enable agile release practices that allow faster and more frequent pushes to production, leaving less time for centralized security reviews, and shifting the responsibility for security back to developers.

To mitigate the risks introduced by faster development and decentralized security ownership, organizations adopting containers should also adopt best practices highlighted in the previous section such as having a private registry to centrally control external dependencies in a production deployment (e.g., open-source base images); image scanning as part of CI/CD process to identify vulnerabilities and problematic dependencies; and deploy-time controls to help ensure only known good software gets deployed to production.

Overall, an automated and streamlined secure software supply chain that ensures software quality and provenance can provide significant security advantages and can still incorporate periodic manual review.

Summary


While many of the security limitations of VM-based applications hold true for containers (for now), using containers for application packaging and deployment creates opportunities for more accurate and streamlined security controls.

Watch this space for future posts that dig deep on containers, security and effective software development teams.

Visit our webpage to learn more about the Google Cloud Platform (GCP) security model.

Announcing price cuts on Local SSDs for on-demand and preemptible instances



Starting today, you'll pay up to 63% less for Local solid-state disks (SSDs) attached to on-demand Google Compute Engine virtual machines. That’s $0.080 per GB per month in most US regions. We’re also introducing even lower prices for Local SSDs used with Preemptible VM instances: up to 71% cheaper than before. That’s $0.064 per GB per month in most US regions.

At Google we're always looking to reduce total cost of ownership for our customers, pass along price reductions achieved through technology advancements and adjust our pricing so you can take advantage of technology that will help you innovate, in a manner that's simple for our users.

Local SSD is our high performance, physically attached block storage offering that persists as long as your instance exists. Supporting both NVMe and ISCSI interfaces, Local SSD provides the high IOPs and bandwidth performance that the world’s most demanding workloads require. Local SSD is often the preferred option for your scratch disks, caching layers and scale-out databases like NoSQL.

A key feature of Local SSDs is that you can attach any amount of Local SSD storage to an any machine shape. You aren’t locked in at a fixed ratio of Local SSD capacity to a VM’s vCPU count and memory. Also, Local SSDs are available on the same instances as GPUs, giving you flexibility in building the most high performance systems.

In addition to dropping prices on Local SSDs attached to regular, on-demand instances, we’re lowering the price for Local SSDs attached to Preemptible VMs. Preemptible VMs are just like any other Compute Engine VM, with the caveat that they cannot run for more than 24 hours and that we can preempt (shut down) the VM earlier if we need the capacity for other purposes. This allows us to use our data center capacity more efficiently and share the savings with you. You may request special Local SSD quota for use with Preemptible instances, though your current Local SSD quota works as well (learn more).

Google Cloud Platform (GCP) customers use Preemptible VMs to greatly reduce their compute costs, and have come up with lots of interesting use cases along the way. Our customers are using Preemptible VMs with Local SSDs to analyze financial markets, process data, render movies, analyze genomic data, transcode media and complete a variety of business and engineering tasks, using thousands of Preemptible VM cores in a single job.

We hope that the price reduction on Local SSDs for on-demand and Preemptible VMs will unlock new opportunities and help you solve more interesting business, engineering and scientific problems.

For more details, check out our documentation for Local SSDs and Preemptible VMs. For more pricing information, take a look at the Compute Engine Local SSD pricing page or try out our pricing calculator. If you have questions or feedback, go to the Getting Help page.

We’re excited to see what you build with our products. If you want to share stories and demos of the cool things you've built with Compute Engine, reach out on Twitter, Facebook or G+.

Introducing automated deployment to Kubernetes and Google Container Engine with Codefresh



Editor’s Note: Today we hear from our partner Codefresh, which just launched a deep integration with Google Container Engine to make it easier to deploy containers to Kubernetes. Read on for more details about the integration and how to automate deployments to Container Engine in just a few minutes.

Codefresh is an opinionated toolchain for delivering containers. Our customers use it to handle both the automated and manual tasks associated with building, testing, debugging and deploying containers. Container-based applications running on Kubernetes are more scalable and reliable, and we want to streamline the process for getting containers deployed. That’s why we’re proud to announce Codefresh’s 10-minute setup for deploying to Kubernetes.

We’ve tested this integration with new and advanced users. Novice Kubernetes users tell us that Codefresh makes it incredibly easy to get their applications deployed to Kubernetes. Advanced users tell us that they like how they can easily access the full features of Kubernetes and configure them for their applications.

How to start deploying to Kubernetes in four steps

In just a few steps, you can get up and running with Codefresh and start deploying containers to Kubernetes. Here’s a short video that shows how it’s done.

Alternately, here’s an overview:

Step 1: Create cluster On Google Cloud
From Google Cloud Console, Navigate to Container Engine and click "Create a container cluster."
Step 2: Connect Codefresh to Google Cloud Platform (GCP)
Login to Codefresh (it’s free), go to Admin->Integrations and login with Google.
Step 3: Add a cluster
Once you’ve added a cluster, it’s available in automated pipelines and manual image deployments.
Step 4: Start deploying!
Set ports, replicas, expose services or just let the defaults be your guide.
Step 5 (optional): Tweak generated Yaml files
Codefresh’s configuration screens also generate deployment.yml and pod.yml files, which you can then edit directly. Advanced users can use their own yml files and let Codefresh handle the authentication, deployment, etc.

Connecting the build, test, deploy pipeline

Once you’ve configured Codefresh and GCP, you can automate deployment with testing, approval workflows and certification. With Codefresh, developers and DevOps teams can agree upfront on rules and criteria for when images should go to manual testing, onto canary clusters or deployment in production.
Further, this mix of infrastructure and automation allows teams to iterate faster and ultimately provide higher-quality code changes.

Join us for a webinar co-hosted by GCP and Codefresh

Want to learn more? Google Container Engine Product Manager, William Denniss, will join Full-Stack Developer, Dan Garfield of Codefresh to show how development velocity speeds up when connected to a Kubernetes-native pipeline. Register here for the August 30th webinar.

Want to get started deploying to Kubernetes? Codefresh is offering 200 builds per month for free and $500 in GCP credits for new accounts1. Try it out.



1 Terms and conditions apply

Introducing automated deployment to Kubernetes and Google Container Engine with Codefresh



Editor’s Note: Today we hear from our partner Codefresh, which just launched a deep integration with Google Container Engine to make it easier to deploy containers to Kubernetes. Read on for more details about the integration and how to automate deployments to Container Engine in just a few minutes.

Codefresh is an opinionated toolchain for delivering containers. Our customers use it to handle both the automated and manual tasks associated with building, testing, debugging and deploying containers. Container-based applications running on Kubernetes are more scalable and reliable, and we want to streamline the process for getting containers deployed. That’s why we’re proud to announce Codefresh’s 10-minute setup for deploying to Kubernetes.

We’ve tested this integration with new and advanced users. Novice Kubernetes users tell us that Codefresh makes it incredibly easy to get their applications deployed to Kubernetes. Advanced users tell us that they like how they can easily access the full features of Kubernetes and configure them for their applications.

How to start deploying to Kubernetes in four steps

In just a few steps, you can get up and running with Codefresh and start deploying containers to Kubernetes. Here’s a short video that shows how it’s done.

Alternately, here’s an overview:

Step 1: Create cluster On Google Cloud
From Google Cloud Console, Navigate to Container Engine and click "Create a container cluster."
Step 2: Connect Codefresh to Google Cloud Platform (GCP)
Login to Codefresh (it’s free), go to Admin->Integrations and login with Google.
Step 3: Add a cluster
Once you’ve added a cluster, it’s available in automated pipelines and manual image deployments.
Step 4: Start deploying!
Set ports, replicas, expose services or just let the defaults be your guide.
Step 5 (optional): Tweak generated Yaml files
Codefresh’s configuration screens also generate deployment.yml and pod.yml files, which you can then edit directly. Advanced users can use their own yml files and let Codefresh handle the authentication, deployment, etc.

Connecting the build, test, deploy pipeline

Once you’ve configured Codefresh and GCP, you can automate deployment with testing, approval workflows and certification. With Codefresh, developers and DevOps teams can agree upfront on rules and criteria for when images should go to manual testing, onto canary clusters or deployment in production.
Further, this mix of infrastructure and automation allows teams to iterate faster and ultimately provide higher-quality code changes.

Join us for a webinar co-hosted by GCP and Codefresh

Want to learn more? Google Container Engine Product Manager, William Denniss, will join Full-Stack Developer, Dan Garfield of Codefresh to show how development velocity speeds up when connected to a Kubernetes-native pipeline. Register here for the August 30th webinar.

Want to get started deploying to Kubernetes? Codefresh is offering 200 builds per month for free and $500 in GCP credits for new accounts1. Try it out.



1 Terms and conditions apply

Independent research firm names Google Cloud the Insight PaaS Leader



Forrester Research, a leading analyst firm, just named Google Cloud Platform (GCP) the leader in The Forrester Wave™: Insight Platforms-As-A-Service, Q3 2017, its analysis of cloud providers offering Platform as a Service. According to the report, an insight PaaS makes it easier to:

  • Manage and access large, complex data sets
  • Update and evolve applications that deliver insight at the moment of action
  • Update and upgrade technology
  • Integrate and coordinate team member activities

For this Wave, Forrester evaluated eight separate vendors. It looked at 36 evaluation criteria spanning three broad buckets  current offering, strategy and market presence.

Of the eight vendors, Google Cloud’s insight PaaS scored highest for both current offering and strategy.
“Google was the only vendor in our evaluation to offer insight execution features like full machine learning automation with hyperparameter tuning, container management and API management. Google will appeal to firms that want flexibility and extreme scalability for highly competent data scientists and cloud application development teams used to building solutions on PaaS.”  The Forrester Wave: Insight Platforms-As-A-Service, Q3 2017
Our presence in the Insight Platform as a Service market goes way back. We started with a vision for serverless computing back in 2008 with Google App Engine and added serverless data processing in 2010 with Google BigQuery. In 2016 we added machine learning (Cloud Machine Learning Engine) to GCP to help bring the power of TensorFlow (Google’s open source machine learning framework) to everyone. We continue to be amazed by what companies like Snap and The Telegraph are doing with these technologies and look forward to building on these insight services to help you build the amazing applications of tomorrow.

Sign up here to get a complimentary copy of the report.

CRE life lessons: What is a dark launch, and what does it do for me?



Say you’re about to launch a new service. You want to make sure it’s ready for the traffic you expect, but you also don’t want to impact real users with any hiccups along the way. How can you find your problem areas before your service goes live? Consider a dark launch.

A dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user. (Note: We’ve also seen dark launches referred to as “feature toggles,” but this doesn’t generally capture the “dark” or hidden traffic aspect of the launch.)

Dark launches allow you to do two things:

  1. Verify that your new service handles realistic user queries in the same way as the existing service, so you don’t introduce a regression.
  2. Measure how your service performs under realistic load.
Dark launches typically transition gradually from a small percentage of the original traffic to a full (100%) dark launch where all traffic is copied to the new backend, discovering and resolving correctness and scaling issues along the way. If you already have a source of traffic for your new site — for instance, when you’re migrating from an existing frontend to a new frontend — then you’re an excellent candidate for a dark launch.


Where to fork traffic: clients vs. servers

When considering a dark launch, one key question is where the traffic copying/forking should happen. Normally this is the application frontend, i.e. the first service, which (after load balancing) receives the HTTP request from your user and calculates the response. This is the ideal place to do the fork, since it has the lowest friction of change — specifically, in varying the percentage of external traffic sent to the new backend. Being able to quickly push a configuration change to your application frontend that drops the dark launch traffic fraction back down to 0% is an important — though not crucial — requirement of a dark launch process.

If you don’t want to alter the existing application frontend, you could replace it with a new proxy service which does the traffic forking to both your original and a new version of the application frontend and handles the response diffing. However, this increases the dark launch’s complexity, since you’ll have to juggle load balancing configurations to insert the proxy before the dark launch and remove it afterwards. Your proxy almost certainly needs to have its own monitoring and alerting — all your user traffic will be going through it, and it’s completely new code. What if it breaks?

One alternative is to send traffic at the client level to two different URLs, one for the original service, and the other for the new service. This may be the only practical solution if you’re dark launching an entirely new app frontend and it’s not practical to forward traffic from the existing app frontend — for instance, if you’re planning to move a website from being served by an open-source binary to your own custom application. However, this approach comes with its own set of challenges.



The main risk in client changes is the lack of control over the client’s behavior. If you need to turn down the traffic to the new application, then you’ll at least need to push a configuration update to every affected mobile application. Most mobile applications don’t have a built-in framework for dynamically propagating configuration changes, so in this case you’ll need to make a new release of your mobile app. It also potentially doubles the traffic from mobile apps, which may increase user data consumption.

Another client change risk is that the destination change gets noticed, especially for mobile apps whose teardowns are a regular source of external publicity. Response diffing and logging results is also substantially easier within an application frontend than within a client.


How to measure a dark launch

It’s little use running a dark launch if you’re not actually measuring its effect. Once you’ve got your traffic forked, how do you tell if your new service is actually working? How will you measure its performance under load?

The easiest way is to monitor the load on the new service as the fraction of dark launch traffic ramps up. In effect, it’s a very realistic load test, using live traffic rather than canned traffic. Once you’re at 100% dark launch and have run over a typical load cycle — generally, at least one day — you can be reasonably confident that your server won’t actually fall over when the launch goes live.

If you’re planning a publicity push for your service, you should try to maximize the additional load you put on your service and adjust your launch estimate based on a conservative multiplier. For example, say that you can generate 3 dark launch queries for every live user query without affecting end-user latency. That lets you test how your dark-launched service handles three times the peak traffic. Do note, however, that increasing traffic flow through the system by this amount carries operational risks. There is a danger that your “dark” launch suddenly generates a lot of “light” — specifically, a flickering yellow-orange light which comes from the fire currently burning down your service. If you’re not already talking to your SREs, you need to open a channel to them right now to tell them what you’re planning.

Different services have different peak times. A service that serves worldwide traffic and directly faces users will often peak Monday through Thursday in the morning in the US as this is where users normally dominate traffic. By contrast, a service like a photo upload receiver is likely to peak on weekends when users take more photos, and will get huge spikes on major holidays like New Years. Your dark launch should try to cover the heaviest live traffic that it’s reasonable to wait for.

We believe that you should always measure service load during a dark launch as it is very representative data for your service and requires near-zero effort to do.

Load is not the only thing you should be looking at, however, as the following measurements should also be considered.


Logging needs

The point where incoming requests are forked to the original and new backends — generally, the application front end — is typically also the point where the responses come back. This is, therefore, a great place to record the responses for later analysis. The new backend results aren’t being returned to the user, so they’re not normally visible directly in monitoring at the application frontend. Instead, the application will want to log these responses internally.

Typically the application will want to log response code (e.g. 20x/40x/50x), latency of the query to the backend, and perhaps the response size, too. It should log this information for both the old and new backends so that the analysis can be a proper comparison. For instance, if the old backend is returning a 40x response for a given request, the new backend should be expected to return the same response, and the logs should enable developers to make this comparison easily and spot discrepancies.

We also strongly recommend that responses from original and new services are logged and compared throughout dark launches. This tells you whether your new service is behaving as you expect with real traffic. If your logging volume is very high, and you choose to use sampling to reduce the impact on performance and cost, make sure that you account in some way for the undetected errors in your traffic that were not included in the logs sample.

Timeouts as a protection

It’s quite possible that the new backend is slower than the original — for some or all traffic. (It may also be quicker, of course, but that’s less interesting.) This slowness can be problematic if the application or client is waiting for both original and new backends to return a response before returning to the client.

The usual approaches are either to make the new backend call asynchronous, or to enforce an appropriately short timeout for the new backend call after which the request is dropped and a timeout logged. The asynchronous approach is preferred, since the latter can negatively impact average and percentile latency for live traffic.

You must set an appropriate timeout for calls to your new service, and you should also make those calls asynchronous from the main user path, as this minimizes the effect of the dark launch on live traffic.


Diffing: What’s changed, and does it matter?

Dark launches where the responses from the old and new services can be explicitly diff’ed produce the most confidence in a new service. This is often not possible with mutations, because you can’t sensibly apply the same mutation twice in parallel; it’s a recipe for conflicts and confusion.

Diffing is nearly the perfect way to ensure that your new backend is drop-in compatible with the original. At Google, it’s generally done at the level of protocol buffer fields. There may be fields where it’s acceptable to tolerate differences, e.g. ordering changes in lists. There’s a trade-off between the additional development work required for a precise meaningful comparison and the reduced launch risk this comparison brings. Alternatively, if you expect a small number of responses to differ, you might give your new service a “diff error budget” within which it must fit before being ready to launch for real.

You should explicitly diff original and new results, particularly those with complex contents, as this can give you confidence that the new service is a drop-in replacement for the old one. In the case of complex responses, we strongly recommend either setting a diff “error budget” (accept up to 1% of responses differing, for instance) or excluding low-information, hard-to-diff fields from comparison.

This is all well and good, but what’s the best way to do this diffing? While you can do the diffing inline in your service, export some stats, and log diffs, this isn't always the best option. It may be better to offload diffing and reporting out of the service that issues the dark launch requests.

Within Google, we have a number of diffing services. Some run batch comparisons, some process data in live streams, others provide a UI for viewing diffs in live traffic. For your own service, work out what you need from your diffing and implement something appropriate.

Going live

In theory, once you’ve dark-launched 100% of your traffic to the new service, making it go “live” is almost trivial. At the point where the traffic is forked to the original and new service, you’ll return the new service response instead of the original service response. If you have an enforced timeout on the new service, you’ll change that to be a timeout on the old service. Job done! Now you can disable monitoring of your original service, turn it off, reclaim its compute resources, and delete it from your source code repository. (A team meal celebrating the turn-down is optional, but strongly recommended.) Every service running in production is a tax on support and reliability, and reducing the service count by turning off a service is at least as important as adding a new service.

Unfortunately, life is seldom that simple. (As id Software’s John Cash once noted, “I want to move to ‘theory,’ everything works there.”) At the very least, you’ll need to keep your old service running and receiving traffic for several weeks in case you run across a bug in the new service. If things start to break in your new service, your reflexive action should be to make the original service the definitive request handler because you know it works. Then you can debug the problem with your new service under less time pressure.

The process of switching services may also be more complex than we’ve suggested above. In our next blog post, we’ll dig into some of the plumbing issues that increase the transition complexity and risk.

Summary

Hopefully you’ll agree that dark launching is a valuable tool to have when launching a new service on existing traffic, and that managing it doesn’t have to be hard. In the second part of this series, we’ll look at some of the cases that make dark launching a little more difficult to arrange, and teach you how to work around them.

Google Cloud Platform at SIGGRAPH 2017



For decades, the SIGGRAPH conference has brought together pioneers in the field of computer graphics. This year at SIGGRAPH 2017, we're excited to announce several updates and product releases that reinforce Google Cloud Platform (GCP)’s leadership in cloud-based media and entertainment solutions.

As part of our ongoing collaboration with Autodesk, our hosted ZYNC Render service now supports its 3ds Max 3D modeling, animation and rendering software. Widely used in the media and entertainment, architecture and visualization industries, artists using ZYNC Render can scale their rendering needs to tens of thousands of cores on-demand to meet the ever-increasing need for high resolution, large format imagery. Support for 3ds Max builds on our success with Autodesk; since we announced Autodesk Maya support in April 2016, users have logged nearly 27 million core hours on that platform, and we look forward to what 3ds Max users will create.
ZYNC Render for Autodesk 3ds Max
At launch of 3ds Max support, we’ll also offer support for leading renderers such as Arnold, an Autodesk product, and V-Ray from Chaos Group.

In addition, we’re showing a technology preview of V-Ray GPU for Autodesk Maya on ZYNC Render. Utilizing NVIDIA GPUs running on GCP, V-Ray GPU provides highly scalable, GPU-enhanced rendering performance.

We’re also previewing support for Foundry’s VR toolset CaraVR on ZYNC Render. Running on ZYNC Render, CaraVR can now leverage the massive scalability of Google Compute Engine to render large VR datasets.

We’re also presenting remote desktop workflows that leverage Google Cloud GPUs such as the new NVIDIA P100, which can perform both display and compute tasks. As a result, we're taking full advantage of V-Ray 3.6 Hybrid Rendering technology, as well as NVIDIA's NVLink to share data across multiple NVIDIA P100 cards. We're also showing how to deploy and manage a “farm” of hundreds of GPUs in the cloud.

Google Cloud’s suite of media and entertainment offerings is expansive  from content ingestion and creation to graphics rendering to distribution. Combined with our online video platform Anvato, core infrastructure offerings around compute, GPU and storage, cutting-edge machine learning and Hollywood studio-specific security engagements, Google Cloud provides comprehensive and end-to-end solutions for creative professionals to build media solutions of their choosing.

To learn more about Google Cloud in the media and entertainment field, visit our Google Cloud Media Solutions page. And to experience the power of GCP for yourself, sign up for a free trial.