Tag Archives: networking

Google Cloud using P4Runtime to build smart networks



Data networks are difficult to design, build and manage, and often don’t work as well as we would like. Here at Google, we deploy and use a lot of network capacity in and between data centers to deliver our portfolio of services, and the costs and burdens of deploying and managing these networks have only grown with the the scale and complexity of these networks. Almost ten years ago, we took steps to address this by adopting software-defined networking (SDN) as the basis for our network architecture. SDN allowed us to program our networks with software running on standard servers and became a fundamental component of our largest systems. In that time, we’ve continued to develop and improve our SDN technology, and now it’s time to take the next step with P4Runtime.

We are excited to announce our collaboration with the Open Networking Foundation (ONF) on Stratum, an open source project to implement an open reference platform for a truly "software-defined" data plane, designed and built around P4Runtime from the beginning. P4 Runtime allows the SDN control plane to establish a contract with the dataplane about forwarding behavior, and to then establish forwarding behavior through simple RPCs. As part of the project, we’re working with network vendors to make this functionality available in networking products across the industry. As a small-but-complete SDN embedded software solution, Stratum will help bring P4Runtime to a variety of network devices.

But just what is it about P4Runtime that helps with the challenges of building large-scale and reliable networks? Network hardware is typically closed, runs proprietary software and is complex, thanks to the need to operate autonomously and run legacy protocols. Modern data centers and wide-area networks are large, must be fast and simple and are often built using commodity network switch chips interconnected into a large fabric. And despite high-quality whitebox switches and open SDN technology such as OpenFlow, there still aren’t a lot of good, portable options on the market to build these networks.

At Google, we designed our own hardware switches and switch software, but our goal has always been to leverage industry SDN solutions that interoperate with our data centers and wide-area networks. P4Runtime is a new way for control plane software to program the forwarding path of a switch and provides a well-defined API to specify the switch forwarding pipelines, as well as to configure these pipelines via simple RPCs. P4Runtime can be used to control any forwarding plane, from a fixed-function ASIC to a fully programmable network switch.

Google Cloud is looking to P4Runtime as the foundation for our next generation of data centers and wide area network control-plane programming, to drive industry adoption and to enable others to benefit from it. With P4Runtime we’ll be able to continue to build the larger, higher performance and smarter networks that you’ve come to expect.

Three ways to configure robust firewall rules



If you administer firewall rules for Google Cloud VPCs, you want to ensure that firewall rules you create can only be associated with correct VM instances by developers in your organization. Without that assurance, it can be difficult to manage access to sensitive content hosted on VMs in your VPCs or allow these instances access to the internet, and you must carefully audit and monitor the instances to ensure that such unintentional access is not given through the use of tags. With Google VPC, there are now multiple ways to help achieve the required level of control, which we’ll describe here in detail.

As an example, imagine you want to create a firewall rule to restrict access to sensitive user billing information in a data store running on a set of VMs in your VPC. Further, you’d like to ensure that developers who can create VMs for applications other than the billing frontend cannot enable these VMs to be governed by firewall rules created to allow access to billing data.
Example topology of a VPC setup requiring secure firewall access.
The traditional approach here is to attach tags to VMs and create a firewall rule that allows access to specific tags, e.g., in the above example you could create a firewall rule that allows all VMs with the billing-frontend tag access to all VMs with the tag billing-data. The drawback of this approach is that any developer with Compute InstanceAdmin role for the project can now attach billing-frontend as a tag to their VM, and thus unintentionally gain access to sensitive data.

Configuring Firewall rules with Service Accounts


With the general availability of firewall rules using service accounts, instead of using tags, you can block developers from enabling a firewall rule on their instances unless they have access to the appropriate centrally managed service accounts. Service accounts are special Google accounts that belong to your application or service running on a VM and can be used to authenticate the application or service for resources it needs to access. In the above example, you can create a firewall rule to allow access to the billing-data@ service account only if the originating source service account of the traffic is billing-frontend@.
Firewall setup using source and target service accounts. (Service accounts names are abbreviated for simplicity.)
You can create this firewall rule using the following gcloud command:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-service-accounts billing-frontend@web.iam.gserviceaccount.com \
    --target-service-accounts billing-data@web.iam.gserviceaccount.com
If, in the above example, the billing frontend and billing data applications are autoscaled, you can specify the service accounts for the corresponding applications in the InstanceTemplate configured for creating the VMs.

The advantage of using this approach is that once you set it up, the firewall rules may remain unchanged despite changes in underlying IAM permissions. However, you can currently only associate one service account with a VM and to change this service account, the instance must be in a stopped state.

Creating custom IAM role for InstanceAdmin


If you want the flexibility of tags and the limitations of service accounts is a concern, you can create a custom role with more restricted permissions that disable the ability to set tags on VMs; do this by removing the compute.instances.setTag permission. This custom role can have other permissions present in the InstanceAdmin role and can then be assigned to developers in the organization. With this custom role, you can create your firewall rules using tags:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-tags billing-frontend \
    --target-tags billing-data
Note, however, that permissions assigned to a custom role are static in nature and must be updated with any new permissions that might be added to the InstanceAdmin role, as and when required.

Using subnetworks to partition workloads


You can also create firewall rules using source and destination IP CIDR ranges if the workloads can be partitioned into subnetworks of distinct ranges as shown in the example diagram below.
Firewall setup using source and destination ranges.
In order to restrict developers’ ability to create VMs in these subnetworks, you can grant Compute Network User role selectively to developers on specific subnetworks or use Shared VPC.

Here’s how to configure a firewall rule with source and destination ranges using gcloud:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-ranges 10.20.0.0/16 \
    --destination-ranges 10.30.0.0/16
This method allows for better scalability with large VPCs and allows for changes in the underlying VMs as long as the network topology remains unchanged. Note, however, that if a VM instance has can_ip_forward enabled, it may send traffic using the above source range and thus gain access to sensitive workloads.

As you can see, there’s a lot to consider when configuring firewall rules for your VPCs. We hope these tips help you configure firewall rules in a more secure and efficient manner. To learn more about configuring firewall rules, check out the documentation.

Simplify Cloud VPC firewall management with service accounts



Firewalls provide the first line of network defense for any infrastructure. On Google Cloud Platform (GCP), Google Cloud VPC firewalls do just that—controlling network access to and between all the instances in your VPC. Firewall rules determine who's allowed to talk to whom and more importantly who isn’t. Today, configuring and maintaining IP-based firewall rules is a complex and manual process that can lead to unauthorized access if done incorrectly. That’s why we’re excited to announce a powerful new management feature for Cloud VPC firewall management: support for service accounts.

If you run a complex application on GCP, you’re probably already familiar with service accounts in Cloud Identity and Access Management (IAM) that provide an identity to applications running on virtual machine instances. Service accounts simplify the application management lifecycle by providing mechanisms to manage authentication and authorization of applications. They provide a flexible yet secure mechanism to group virtual machine instances with similar applications and functions with a common identity. Security and access control can subsequently be enforced at the service account level.


Using service accounts, when a cloud-based application scales up or down, new VMs are automatically created from an instance template and assigned the correct service account identity. This way, when the VM boots up, it gets the right set of permissions and within the relevant subnet, so firewall rules are automatically configured and applied.

Further, the ability to use Cloud IAM ACLs with service accounts allows application managers to express their firewall rules in the form of intent, for example, allow my “application x” servers to access my “database y.” This remediates the need to manually manage Server IP Address lists while simultaneously reducing the likelihood of human error.
This process is leaps-and-bounds simpler and more manageable than maintaining IP address-based firewall rules, which can neither be automated nor templated for transient VMs with any semblance of ease.

Here at Google Cloud, we want you to deploy applications with the right access controls and permissions, right out of the gate. Click here to learn how to enable service accounts. And to learn more about Cloud IAM and service accounts, visit our documentation for using service accounts with firewalls.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

One year of Cloud Performance Atlas



In March of this year, we kicked off a new content initiative called Cloud Performance Atlas, where we highlight best practices for GCP performance, and how to solve the most common performance issues that cloud developers come across.

Here’s the top topics from 2017 that developers found most useful.


5. The bandwidth delay problem


Every now and again, I’ll get a question from a company who recently updated their connection bandwidth from their on-premises systems to Google Cloud, and for some reason, aren’t getting any better performance as a result. The issue, as we’ve seen multiple times, usually resides in an area of TCP called “the bandwidth delay problem.”

The TCP algorithm works by transferring data in packets between two connections. A packet is sent to a connection, and then an acknowledgement packet is returned. To get maximum performance in this process, the connection between the two endpoints has to be optimized so that neither the sender or receiver is waiting around for acknowledgements from prior packets.

The most common way to address this problem is to adjust the window sizes for the packets to match the bandwidth of the connection. This allows both sides to continue sending data until an ACK arrives back from the client for an earlier packet, thereby creating no gaps and achieving maximum throughput. As such, a low window size will limit your connection throughput, regardless of the available or advertised bandwidth between instances.

Find out more by checking out the video, or article!

4. Improving CDN performance with custom keys


Google Cloud boasts an extremely powerful CDN that can leverage points-of-presence around the globe to get your data to users as fast as possible.

When setting up Cloud CDN for your site, one of the most important things is to ensure that you’re using the right Custom Cache Keys to configure what assets get cached, and which ones don’t. In most cases, this isn’t an issue, but if you’re leveraging a large site with content re-used across protocols (i.e., http and https) you can run into a problem where your cache fill costs can increase more than expected.

You can see how we helped a sports website get their CDN keys just right in the video, and article.


3. Google Cloud Storage and the sequential filename challenge


Google Cloud Storage is a one-stop-shop for all your content serving needs. However, one developer continued to run into a problem of slow upload speeds when pushing their content into the cloud.

The issue was that Cloud Storage uses the file path and name of the files being uploaded to segment and shard the connection to multiple frontends (improving performance). As we found out, if those file names are sequential then you could end up in a situation where multiple connections get squashed down to a single upload thread (thus hurting performance)!

As shown in the video and article, we were able to help a nursery camera company get past this issue with a few small fixes.

2. Improving Compute Engine boot time with custom images


Any cloud-based service needs to grow and shrink its resource allocations to respond to traffic load. Most of the time, this is a good thing, especially during the holiday season. ;) As traffic increases to your service/application, your backends will need to spin up more Compute Engine VMs to provide a consistent experience to your users.

However, if it takes too long for your VMs to start up, then the quality and performance for you users can be negatively impacted, especially if your VM needs to do a lot of things during its startup script, like compile code, or install large packages.

As we showed in the video, (article) you can pre-compute a lot of that work into a custom image of boot disks. When your VMs are loaded, they simply need to copy in the custom image to the disk (with everything already installed), rather than doing everything from scratch.

If you’re looking to improve your GCE boot performance, custom images are worth checking out!

1. App Engine boot time


Modern managed languages (Java, Python, Javascript, etc.) typically have a run-time dependencies step that occurs at the init phase of the program when code is imported and instantiated.

Before execution can begin, any global data, functions or state information are also set up. Most of the time, these systems are global in scope, since they need to be used by so many subsystems (for example, a logging system).

In the case of App Engine, this global initialization work can end up delaying start-time, since it must complete before a request can be serviced. And as we showed in the video and article, as your application responds to spikes in workload, this type of global variable contention can put a hurt on your request response times.


See you soon!


For the rest of 2017 our Cloud Performance team is enjoying a few hot cups of tea, relaxing with the holidays and counting down the days until the new year. In 2018, we’ve got a lot of awesome new topics to cover, including increased networking performance, Cloud Functions and Cloud Spanner!

Until then, make sure you check out the Cloud Performance Atlas videos on Youtube or our article series on Medium.

Thanks again for a great year everyone, and remember, every millisecond counts!

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

5 steps to better GCP network performance



We’re admittedly a little biased, but we’re pretty proud of our networking technology. Jupiter, the Andromeda network virtualization stack and TCP-BBR all ride on datacenters around the world and  the intercontinental cables that connect them all.

As a Google Cloud customer, your applications already have access to this fast, global network, giving your VM-to-VM communication top-tier performance. Furthermore, because Google peers its egress traffic directly with a number of companies (including Cloudflare), you can get content to your customers faster, with lower egress costs.

With that in mind, it’s really easy to make small configuration changes, location updates or architectural changes that can inadvertently limit the networking performance of your system. Here are the top five things you can do to get the most out of Google Cloud.

1. Know your tools

Testing your networking performance is the first step to improving your environment. Here are the tools I use on a daily basis:
  • Iperf is a commonly used network testing tool that can create TCP/UDP data streams and measure the throughput of the network that carries them. 
  • Netperf is another good network testing tool, which is also used by the PerfKitBenchmark suite to test performance and benchmark the various cloud providers against one another. 
  • traceroute is a computer network diagnostic tool to measure and display packets’ routes across a network. It records the route’s history as the round-trip times of the packets received from each successive host in the route; the sum of the mean times in each hop is a measure of the total time spent to establish the connection.
These tools are battle-hardened, really well documented, and should be the cornerstone of your performance efforts.

2. Put instances in the right zones


One important thing to remember about network latency is that it’s a function of physics.

The speed of light traveling in a vacuum is 300,000 km/s, meaning that it takes about 10ms to travel a distance of ~3000km — about the distance of New York to Santa Fe. But because the internet is built on fiber-optic cable, which slows things down by a factor of ~1.52, data can only travel 1013km one way in that same 10ms.

So, the farther away two machines are, the higher their latency will be. Thankfully, Google has datacenter locations all around the world, making it easy to put your compute close to your users.


It’s worthwhile to take a regular look at where your instances are deployed, and see if there’s an opportunity to open up operations in a new region. Doing so will help reduce latency to the end user, and also help create a system of redundancy to help safeguard against various types of networking calamity.

3. Choose the right core-count for your networking needs


According to the Compute Engine documentation:

Outbound or egress traffic from a virtual machine is subject to maximum network egress throughput caps. These caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.

In other words, the more virtual CPUs in a guest, the more networking throughput you get. You can see this yourself by setting up a bunch of instance types, and logging their IPerf performance:
You can clearly see that as the core count goes up, so does the avg. and max. throughput. Even with our simple test, we can see that hard 16Gbps limit on the higher machines.

As such, it’s critically important to choose the right type of instance for your networking needs. Picking something too large can cause you to over-provision (and over pay!), while too few cores places a hard limit on your maximum throughput speeds.

4. Use internal over external IPs


Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. Below, you can see for a N1 machine, the bandwidth measured through iperf to the external IP was only 884 MBits/sec

user@instance-2:~$ iperf -c 104.155.145.79 ------------------------------------------------------------
Client connecting to 104.155.145.79, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.128.0.3 port 53504 connected with 104.155.145.79 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.03 GBytes   884 Mbits/sec

However, the internal IP between the two machines boasted 1.56 GBits / sec:

user@instance-2:~$ iperf -c 10.128.0.2
------------------------------------------------------------
Client connecting to 10.128.0.2, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.128.0.3 port 38978 connected with 10.128.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.27 GBytes  1.95 Gbits/sec

5. Rightsize your TCP window


If you have ever wondered why a connection transmits at a fraction of the available bandwidth — even when both the client and the server are capable of higher rates — then it might be due to a window size mismatch.

The Transmission Control Protocol (aka TCP) works by sending windows of data over the internet, relying on a straightforward system of handshakes and acknowledgements to ensure arrival and integrity of the data, and in some cases, to resend it. On the plus side, this results in a very stable internet. On the downside, it results in lots of extra traffic. And when the sender or receiver stop and wait for ACKs for previous windows/packets, this creates gaps in the data flow, limiting the maximum throughput of the connection.

Imagine, for example, a saturated peer that is advertising a small receive window, bad network weather and high packet loss resetting the congestion window, or explicit traffic shaping limiting the throughput of your connection. To address this problem, window sizes should be just big enough such that either side can continue sending data until it receives an ACK for an earlier packet. Keeping windows small limits your connection throughput, regardless of the available or advertised bandwidth between instances.

For the best performance possible in your application, you should really fine-tune window sizes depending on your client connections, estimated egress and bandwidth constraints. The good news is that the TCP window sizes on standard GCP VMs are are tuned for high-performance throughput. So be sure you test the defaults before you make any changes (sometimes, it might not be needed!)


Every millisecond counts

Getting peak performance across a cloud-native architecture is rarely achieved by fixing just one problem. It’s usually a combination of issues, the “death by a thousand cuts” as it were, that chips away at your performance, piece by piece. By following these five steps, you’ll be able to isolate, identify and address some of the most common culprits of poor network performance, to help you take advantage of all the networking performance that’s available to you.

If you’d like to know more about ways to optimize your Google Cloud applications, check out the rest of the Google Cloud Performance Atlas blog posts and videos. Because, when it comes to performance, every millisecond counts.

DNSSEC now available in Cloud DNS



Today, we're excited to announce that Google is adding DNSSEC support (beta) to our fully managed Google Cloud DNS service. Now you and your users can take advantage of the protection provided by DNSSEC without having to maintain it once it's set up.

Why is DNSSEC an important add-on to DNS?

Domain Name System Security Extensions (DNSSEC) adds security to the Domain Name System (DNS) protocol by enabling DNS responses to be validated. Having a trustworthy Domain Name System (DNS) that translates a domain name like www.example.com into its associated IP address is an increasingly important building block of today’s web-based applications. Attackers can hijack this process of domain/IP lookup and redirect users to a malicious site through DNS hijacking and man-in-the-middle attacks. DNSSEC helps mitigate the risk of such attacks by cryptographically signing DNS records. As a result, it prevents attackers from issuing fake DNS responses that may misdirect browsers to nefarious websites.

Google Cloud DNS and DNSSEC

Cloud DNS is a fast, reliable and cost-effective Domain Name System that powers millions of domains on the internet. DNSSEC in Cloud DNS enables domain owners to take easy steps to protect their domains against DNS hijacking and man-in-the-middle attacks. Advanced users may choose to use different signing algorithms and denial-of-existence types. We support several sizes of RSA and ECDSA keys, as well as both NSEC and NSEC3. Enabling support for DNSSEC brings no additional charges or changes to the terms of service. 
To start using DNSSEC, simply turn the feature to "on" within your DNS zone.
DNSSEC will be automatically enabled for that zone.
To learn more about getting started with DNSSEC for Cloud DNS, please refer to the documentation page.

Andromeda 2.1 reduces GCP’s intra-zone latency by 40%



Google Cloud customers now enjoy significantly improved intra-zone network latency with the release of Andromeda 2.1, a software-defined network (SDN) stack that underpins all of Google Cloud Platform (GCP). The latest version of Andromeda reduces network latency between Compute Engine VMs by 40% over Andromeda 2.0 and by nearly a factor of 8 since we first launched Andromeda in 2014.

This kind of network performance is especially important as more applications move into the cloud and are accessed via web browsers. While the headline metric is often bandwidth, network latency is frequently the more important determiner of application performance. For example, low latency is essential for financial transactions, ad-tech, video, gaming and retail, as well as workloads such as HPC applications, memcache and in-memory databases. Likewise, HTTP-based microservices will see significant improvement in responsiveness with reduced latency.

Andromeda 2.1 latency improvements come from a form of hypervisor bypass that builds on virtio, the Linux paravirtualization standard for device drivers. Andromeda 2.1 enhancements enable the Compute Engine guest VM and the Andromeda software switch to communicate directly via shared memory network queues, bypassing the hypervisor completely for performance-sensitive per-packet operations.

In our previous approach, the hypervisor thread served as a bridge between the guest VM and the Andromeda software switch. Packets flowed from the VM to a hypervisor thread, to the local host’s Andromeda software switch, then over the physical network to another Andromeda software switch, and back up through the hypervisor to the VM. Further, any time the thread wasn’t bridging packets, it was descheduled, increasing tail latency for new packet processing. In many cases, a single network round-trip required four costly hypervisor thread wakeups!

Andromeda 2.1's optimized datapath using hypervisor bypass.


Andromeda 2.1 performance in action


The new Andromeda 2.1 stack delivers noteworthy reductions in VM-to-VM network latency. The figure below shows the factor by which the latency has reduced over time compared to the median round-trip time of the original stack.
Factor by which latency has improved over time

This reduction in network round-trip times translates into real-world performance boosts for latency sensitive applications. Take Aerospike, a high-performance in-memory NoSQL database. The new Andromeda stack delivers both a reduction in request latency and improved request throughput for Aerospike, as shown below.



Considering Andromeda SDN is a foundational building block of Google Cloud, you should see similar improvements in intra-zone latency, regardless of what applications you're running.

Andromeda SDN delivers flexibility and reliability 


Andromeda SDN enables more flexibility than other hardware-based stacks. With SDN, we can quickly develop and overhaul our entire virtual network infrastructure. We can roll out new cloud network services and features, apply security patches and gain significant performance improvements. Better yet, we can confidently deploy to Google Cloud with no downtime, reboots or even VM migrations, because the flexibility of SDN allows us to thoroughly test our code. Watch this space to learn about the new features and enhanced network performance made possible by our Andromeda SDN foundation.

Google Cloud Dedicated Interconnect gets global routing, more locations, and is GA



We have major updates to Dedicated Interconnect, which helps enable fast private connections to Google Cloud Platform (GCP) from numerous facilities across the globe, so you can extend your on-premises network to your GCP Virtual Private Cloud (VPC) network. With faster private connections offered by Dedicated Interconnect, you can build applications that span on-premises infrastructure and GCP without compromising privacy or performance.

Dedicated Interconnect is now GA and ready for production-grade workloads, and covered by a service level agreement. Dedicated Interconnect can be configured to offer a 99.9% or a 99.99% uptime SLA. Please see the Dedicated Interconnect documentation for details on how to achieve these SLAs.

Going global with the help of Cloud Router


Dedicated Interconnect now supports global routing for Cloud Router, a new feature that allows subnets in GCP to be accessible from any on-premise network through the Google network. This feature presents a new flag in Cloud Router that allows the network to advertise all the subnets in a project. For example, a connection from your on-premise data center in Chicago to GCP’s Dedicated Interconnect location in Chicago now gives you access to all subnets running in all GCP regions around the globe, including those in the Americas, Asia and Europe. We believe this functionality is unique among leading cloud providers. This feature is generally available, and you can learn more about it in the Cloud Router documentation.
Using Cloud Router Global Routing to connect on-premises workloads via "Customer Peering Router" with GCP workloads in regions anywhere in the world.

Dedicated Interconnect is your new neighbor


Dedicated Interconnect is also available from four new locations: Mumbai, Munich, Montreal and Atlanta. This means you can connect to Google’s network from almost anywhere in the world. For a full list of locations, visit the Dedicated Interconnect locations page. Please note, in the graphic below, many locations (blue dots) offer service from more than one facility.
In addition to those four new Google locations, we’re also working with Equinix to offer Dedicated Interconnect access in multiple markets across the globe, ensuring that no matter where you are, there's a Dedicated Interconnect connection close to you.
"By providing direct access to Google Cloud Dedicated Interconnect, we are helping enterprises leverage Google’s network  the largest in the world and accelerate their hybrid cloud strategies globally. Dedicated Interconnect offered in collaboration with Equinix enables customers to easily build the cloud of their choice with dedicated, low-latency connections and SLAs that enterprise customers have come to expect from hybrid cloud architectures." 
Ryan Mallory, Vice President, Global Solutions Enablement, Equinix

Here at Google Cloud, we’re really excited about Dedicated Interconnect, including the 99.99% uptime SLA, four new locations, and Cloud Router Global Routing. Dedicated Interconnect will make it easier for more businesses to connect to Google Cloud, and we can’t wait to see the next generation of enterprise workloads that Dedicated Interconnect makes possible.

If you’d like to learn which connection option is right for you, more about pricing and whole lots more, please take a look at the Interconnect product page.