Tag Archives: networking

Outline: secure access to the open web

Censorship and surveillance are challenges that many journalists around the world face on a daily basis. Some of them use a virtual private network (VPN) to provide safer access to the open internet, but not all VPNs are equally reliable and trustworthy, and even fewer are open source.

That’s why Jigsaw created Outline, a new open source, independently audited platform that lets any organization easily create and operate their own VPN.

Outline’s most striking feature is arguably how easy it is to use. An organization starts by downloading the Outline Manager app, which lets them sign in to DigitalOcean, where they can host their own VPN, and set it up with just a few clicks. They can also easily use other cloud providers, provided they have shell access to run the installation script. Once an Outline server is set up, the server administrator can create access credentials and share with their network of contacts, who can then use the Outline clients to connect to it.


A core element to any VPN’s security is the protocol that the server and clients use to communicate. When we looked at the existing protocols, we realized that many of them were easily identifiable by network adversaries looking to spot and block VPN traffic. To make Outline more resilient against this threat, we chose Shadowsocks, a secure, handshake-less, and open source protocol that is known for its strength and performance, and enjoys the support of many developers worldwide. Shadowsocks is a combination of a simplified SOCKS5-like routing protocol, running on top of an encrypted channel. We chose the AEAD_CHACHA20_POLY1305 cipher, which is an IETF standard and provides the security and performance users need.

Another important component to security is running up-to-date software. We package the server code as a Docker image, enabling us to run on multiple platforms, and allowing for automatic updates using Watchtower. On DigitalOcean installations, we also enable automatic security updates on the host machine.

If security is one of the most critical parts of creating a better VPN, usability is the other. We wanted Outline to offer a consistent, simple user experience across platforms, and for it to be easy for developers around the world to contribute to it. With that in mind, we use the cross-platform development framework Apache Cordova for Android, iOS, macOS and ChromeOS, and Electron for Windows. The application logic is a web application written in TypeScript, while the networking code had to be written in native code for each platform. This setup allows us to reutilize most of code, and create consistent user experiences across diverse platforms.

In order to encourage a robust developer community we wanted to strike a balance between simplicity, reproducibility, and automation of future contributions. To that end, we use Travis for continuous builds and to generate the binaries that are ultimately uploaded to the app stores. Thanks to its cross-platform support, any team member can produce a macOS or Windows binary with a single click. We also use Docker to package the build tools for client platforms, and thanks to Electron, developers familiar with the server's Node.js code base can also contribute to the Outline Manager application.

You can find our code in the Outline GitHub repositories and more information on the Outline website. We hope that more developers join the project to build technology that helps people connect to the open web and stay more safe online.

By Vinicius Fortuna, Jigsaw

Repairing network hardware at scale with SRE principles



To support our Google Cloud Platform (GCP) customers, we run a complex global network that depends on multiple providers and a lot of hardware. Google network engineering uses a diverse set of vendor equipment to route user traffic from an internet service provider to one of our serving front ends inside a GCP data center. This equipment is proprietary and made by external networking vendors such as Arista, Cisco and Juniper. Each vendor has distinct operational methods, configurations and operational consoles.

With hundreds of distinct components utilized across our global network, we routinely deal with hardware failures—for example, a failed power supply, line card or control plane card. The complexity of today’s cloud networks means that there are a huge number of places where failure can occur. When we first began building and operating our own data centers, Google had a team of engineers, network engineers and site reliability engineers (SREs) who performed fault detection, mitigation and repair work on these devices, using manual processes guided by a ticket system. Google’s SRE principles are prescriptive, and aim to guide developers and operations teams toward better systems reliability. As with DevOps, avoiding toil—the manual tasks that can eat up too much time—is an essential goal.

We realized after becoming familiar with common hardware problems that any ticket type that we encountered repeatedly and that follows a predetermined sequence of steps can easily be automated. Our team created a list of playbooks over time that detailed steps of how to deal with each hardware failure scenario, taking into account relevant software and hardware bugs and typical steps to resolution. Each playbook is used when an alert is received. Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

Building the automation interface

“In the old way of doing things, we treat our servers like pets, for example, Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.”
- Randy Bias

The above quote describes a classic engineering scenario often applied within SRE: "Pets vs. cattle," which describes a way of looking at data center hardware as either individual components or a herd of them. The two categories of equipment can be described as follows:

Pet:
  • An individual device you work on. You're familiar with all of its particular failure modes. 
  • When it gets sick, you come to the rescue.

Cattle:
  • A fleet of devices with a common interface.
  • You manage the "herd" of devices as a group.
  • The common interface lets you perform the same basic operations on any device, regardless of its manufacturer.
Before we moved to automating network hardware failure resolution, we were stuck handling our networking equipment like pets, with an eye toward what made it unique, rather than as cattle, with an eye toward what made it a commodity. We needed to make it easier not to custom-manage all these networking devices. Our initial automation design aimed to turn our fleet into cattle by providing a common interface for interacting with networking equipment. Specifically, we used the underlying primitives to implement a higher-level interface for performing common operations—in this case, the basic operations of a line card in a network device, regardless of vendor: "Bring it online," "Take it offline" and "Check the status." We defined the following interface for a line card, using the Go programming language.


type Linecard interface {
  Online() error
  Offline() error 
  Status() error
}
The error qualifier in Go simply means that the function returns an error object if it fails. The underlying code implementing this interface for a Juniper line card varies significantly from implementation on the Cisco line card, but the caller of the function is insulated from the implementation. The upper level code imports the library, and when it operates on a line card, it can only perform one of those three actions we specified above.

We then realized that we could apply the same interface to many hardware components—for example, a fan. For certain vendors, the Online() and Offline() functions did nothing, because those vendors didn't support turning a fan off, so we just used the interface to check the status.
type Fan interface {
  Online() error
  Offline() error 
  Status() error
}
Building upon this line of thought, we realized that we could generalize this interface to define a common interface for all hardware components within a device.
type Component interface {
  Online() error
  Offline() error 
  Status() error
}
By structuring the code this way, anyone can add a device from a new vendor. Moreover, anyone can add any type of new component as a library. Once the library implements this common interface, it can be registered as a handler for that specific vendor and component.

Deciding what to automate

The system needed to interact with humans at various stages of the automation. To decide what to automate, we drew a flow chart of the normal human-based repair sequence and drew boxes around stages we believed we could replace with automation. We used the task of replacing a vendor control plane board as an example. Many of the steps have self-explanatory names, but these are definitions of some of the more complex ones:
  • Determine control plane: Find faulty control plane unit.
  • Determine state: Is it the master or the backup? 
  • Copy image to control plane: Copy the appropriate software image to the master control plane.
  • Offline control plane: Send the backup control plane offline.
  • Toggle mastership: Make the replaced control plane the new master.
Figure 1: Manual workflow for replacing a vendor control plane board
When we needed to carry out this workflow, a Google network engineer performed each step in Figure 1, with the exception of pulling out and replacing the failed control plane, which was performed by someone on-site at a data center location.

Once we had defined this task, we created an automated workflow. The goal of the new system was to provide a UI for our hardware engineers in a data center that allowed them to perform one of those operations at a specific time under specific conditions and with various automated safety checks, followed by an entire device audit at the end of the operation. Previously, a human had performed all of these steps, but now a human only needed to perform the step “hardware gets replaced” in Figure 2—the hardware replacement.
Figure 2: Automated workflow for replacing a vendor control plane board
Automation, before and after
Figure 3: High-level system view.
You can see in Figure 3 what the system looked like after automation. Before automating this workflow, there would have been a lot of manual work. When an alert initially came in, an engineer would have stopped traffic to the device, and offlined by hand the bad component. Our network operations center (NOC) team would then work with the vendor—for example, Juniper or Cisco— to get a replacement part on-site. Next, we would file a change request in our change management system, noting the date of the operation.

On the day of the operation:
  • The data center technician would click “start” on the change management system to begin the repair.
  • Our system picks up this change and is ready to begin the repair.
  • The technician clicks “start” on our UI.
  • An “offline” state machine starts proceeding through the various steps to take the component offline safely.
  • The UI notifies the user each step of the way.
  • Once the state machine has completed, it notifies the technician, who can safely replace the component.
  • Once the component is replaced and re-cabled, the technician returns to the UI and begins the “online” state machine, which safely returns the component into production.
When we reviewed our original automation design, we noticed there would be a lot of work involved in building the various systems needed to implement the automated workflow. To facilitate collaboration, we created ticket items for each component of the system, so multiple engineers could work on the project in parallel.

Automation lessons learned

We used an iterative approach in our planning and execution. We first focused on replacing the line card for one vendor, then moved on to multiple vendors and multiple components. Due to the modular design of the code base and the interacting systems, adding more modules and scaling the code horizontally was easy. 

For example, adding a new library that handled fan replacements meant simply creating the code to handle this and ensuring it implemented the above interface. Then it registered itself in the main function.

We had the option to extend or repurpose existing automation systems owned by our software management teams to meet our needs. We had to carefully consider whether to use those systems or build our own, potentially duplicating work if we chose the latter. Ultimately, we built our own automation because the other systems were understaffed. Trying to extend their tools would have disrupted other teams' project work and delayed our own project.

What worked well
Leveraging multiple engineers to automate our internal part of the workflow allowed us to take the project from design to implementation within a short period—about one year.

What didn’t
We haven't yet fully automated our hardware replacement workflow. Doing so involves troubleshooting hardware issues with vendors and persuading them that each individual failure merits a device or component replacement. We work around this gap in our automation by keeping spares on site for use with our repair automation, and handling the vendor workflow portion of the process separately and mostly manually through our NOC. We are currently working toward a fully automated vendor interaction with our vendor partners.

Measuring automation success
We can measure the hours our automation saves engineers using Google's production change logging service, which all internal tools use to record changes made to the production environment. The service logs changes made by tools manually invoked by engineers as well as tools that provide end-to-end automation without manual input. Thus we can compare how long each network repair action used to take when performed manually vs. the number of repair actions that are undertaken by today's fully automated system. These two data sets allow us to calculate the total time savings from automation. As shown in Figure 4, network hardware repair automation saves us hundreds of hours every month.

Tips for reducing toil through automation

While strategies for eliminating toil must be tailored to your individual environment and use cases, some approaches are universal. Based upon our own experience eliminating toil by automating network repair tasks, we recommend the following: 
  • Measure your toil.  
  • Tackle the biggest sources of toil first, and don't try to solve all problems at once.  
  • Carefully consider whether to enhance existing tools or build new ones. Even if you can partially repurpose another team's work, would creating a tool from scratch actually make more sense cost- or resource-wise? 
  • Take a design-driven approach. Iterate on the design, starting small and iterating quickly. Don't try to design the perfect approach from the start.  
  • Measure your time savings to determine your return on investment.
Automation has proved useful for our team of network site reliability engineers at GCP. Learn more about the practice of SRE and how you might apply its principles to your own network projects.

5 must-see network sessions at Google Cloud NEXT 2018



Whether you’re moving data to or from Google Cloud, or are knee-deep plumbing your cloud network architecture, there’s a lot to learn at Google Cloud Next 2018 next week in San Francisco). Here’s our shortlist of the five must-see networking breakout sessions at the show, in chronological order from Wednesday to Thursday.
Operations engineer, Rebekah Roediger, delivering cloud network capacity one link at a time, in our Netherlands cloud region (europe-west4).

GCP Network and Security Telemetry
Speakers: Ines Envid, Senior Product Manager, Yuri Solodkin, Staff Software Engineer and Vineet Bhan, Head of Security Partnerships
Network and security telemetry is fundamental to operate your deployments in public clouds with confidence, providing the required visibility on the behavior of your network and access control firewalls.
When: July 24th, 2018 12:35pm


A Year in GCP Networking
Speakers: Srinath Padmanabhan, Networking Product Marketing Manager, Google Cloud and Nick Jacques, Lead Cloud Engineer, Target
In this session, we will talk about the valuable advancements that have been made in GCP Networking over the last year. We will introduce you to the GCP Network team and will tell you about what you can do to extract the most value from your GCP Deployment.
When: July 24th, 2018 1:55pm


Cloud Load Balancing Deep Dive and Best Practices
Speakers: Prajakta Joshi, Sr. Product Manager and Mike Columbus, Networking Specialist Team Manager
Google Cloud Load Balancing lets enterprises and cloud-native companies deliver highly available, scalable, low-latency cloud services with a global footprint. You will see demos and learn how enterprise customers deploy Cloud Load Balancing and the best practices they use to deliver smart, secure, modern services across the globe.
When: July 25th, 2018 12:35pm


Hybrid Connectivity - Reliably Extending Your Enterprise Network to GCP
Speaker: John Veizades, Product Manager, Google Cloud
In this session, you will learn how to connect to GCP with highly reliable and secure networking to support extending your data center networks into the cloud. We will cover details of resilient routing techniques, access to Google API from on premise networks, connection locations, and partners that support connectivity to GCP -- all designed to support mission-critical network connectivity to GCP.
When: July 26th, 2018 11:40am


VPC Deep Dive and Best Practices
Speakers: Emanuele Mazza, Networking Product Specialist, Google, Neha Pattan, Software Engineer, Google and Kamal Congevaram Muralidharan, Senior Member Technical Staff, Paypal
This session will walk you through the unique operational advantages of GCP VPC for your enterprise cloud deployments. We’ll go through detailed use cases, how to seal and audit your VPC, how to extend your VPC to on-prem in hybrid scenarios, and how to deploy highly available services.
When: July 26th, 2018 9:00am


Be sure to reserve your spot in these sessions today—space is filling up!

Our Los Angeles cloud region is open for business



Hey, LA — the day has arrived! The Los Angeles Google Cloud Platform region is officially open for business. You can now store data and build highly available, performant applications in Southern California.

Los Angeles Mayor Eric Garcetti said it best: “Los Angeles is a global hub for fashion, music, entertainment, aerospace, and more—and technology is essential to strengthening our status as a center of invention and creativity. We are excited that Google Cloud has chosen Los Angeles to provide infrastructure and technology solutions to our businesses and entrepreneurs.”

The LA cloud region, us-west2, is our seventeenth overall and our fifth in the United States.

Hosting applications in the new region can significantly improve latency for end users in Southern California, and by up to 80% across Northern California and the Southwest, compared to hosting them in the previously closest region, Oregon. You can visit www.gcping.com to see how fast the LA region is for you.

Services


The LA region has everything you need to build the next great application:

Of note, the LA region debuted with one of our newest products: Cloud FilestoreBETA, our managed file storage service for applications that require a filesystem interface and a shared filesystem for data.

The region also has three zones, allowing you to distribute apps and storage across multiple zones to protect against service disruptions. You can also access our multi-regional services (such as BigQuery) in the United States and all the other GCP services via our Google Network, and combine any of the services you deploy in LA with other GCP services around the world. Please visit our Service Specific Terms for detailed information on our data storage capabilities.

Google Cloud Network

Google Cloud’s global networking infrastructure is the largest cloud network as measured by number of points of presence. This private network provides a high-bandwidth, highly reliable, low-latency link to each region across the world. With it, you can reach the LA region as easily as any region. In addition, the global Google Cloud Load Balancing makes it easy to deploy truly global applications.

Also, if you’d like to connect to the Los Angeles region privately, we offer Dedicated Interconnect at two locations: Equinix LA1 and CoreSite LA1.

LA region celebration

We celebrated the launch of the LA cloud region the best way we know how: with our customers. At the celebration, we announced new services to help content creators take advantage of the cloud: Filestore, Transfer Appliance and of course, the new region itself, in the heart of media and entertainment country. The region’s proximity to content creators is critical for cloud-based visual effects and animation workloads. With proximity comes low latency, which lets you treat the cloud as if it were part of your on-premises infrastructure—or even migrate your entire studio to the cloud.
Paul-Henri Ferrand, President of Global Customer Operations, officially announces the opening of our Los Angeles cloud region.


What customers are saying


“Google Cloud makes the City of Los Angeles run more smoothly and efficiently to better serve Angelenos city-wide. We are very excited to have a cloud region of our own that enables businesses, big or small, to leverage the latest cloud technology and foster innovation.”
- Ted Ross, General Manager and Chief Information Officer for City of LA Information Technology Agency, City of LA

“Using Google Cloud for visual effects rendering enables our team to be fast, flexible and to work on multiple large projects simultaneously without fear of resource starvation. Cloud is at the heart of our IT strategy and Google provides us with the rendering power to create Oscar-winning graphics in post-production work.”
- Steve MacPherson, Chief Technology Officer, Framestore

“A lot of our short form projects pop up unexpectedly, so having extra capacity in region can help us quickly capitalize on these opportunities. The extra speed the LA region gives us will help us free up our artists to do more creative work. We’re also expanding internationally, and hiring more artists abroad, and we’ve found that Google Cloud has the best combination of global reach, high performance and cost to help us achieve our ambitions.”
- Tom Taylor, Head of Engineering, The Mill

What SoCal partners are saying


Our partners are available to help design and support your deployment, migration and maintenance needs.

“Cloud and data are the new equalizers, transforming the way organizations are built, work and create value. Our premier partnership with Google Cloud Platform enables us to help our clients digitally transform through efforts like app modernization, data analytics, ML and AI. Google’s new LA cloud region will enhance the deliverability of these solutions and help us better service the LA and Orange County markets - a destination where Neudesic has chosen to place its corporate home.”
- Tim Marshall, CTO and Co-Founder, Neudesic

“Enterprises everywhere are on a journey to harness the power of cloud to accelerate business objectives, implement disruptive features, and drive down costs. The Taos and Google Cloud partnership helps companies innovate and scale, and we are excited for the new Google Cloud LA region. The data center will bring a whole new level of uptime and service to our Southern California team and clients.”
- Hamilton Yu, President and COO, Taos

“As a launch partner for Google Cloud and multi-year recipient of Google’s Partner of the Year award, we are thrilled to have Google’s new cloud region in Los Angeles, our home base and where we have a strong customer footprint. SADA Systems has a track record of delivering industry expertise and innovative technical services to customers nationwide. We are excited to leverage the scale and power of Google Cloud along with SADA’s expertise for our clients in the Los Angeles area to continue their cloud transformation journey.”
- Tony Safoian, CEO & President, SADA Systems

Getting started


For additional details on the LA region, please visit our LA region page where you’ll get access to free resources, whitepapers, the "Cloud On-Air" on-demand video series and more. Our locations page provides updates on the availability of additional services and regions. Contact us to request early access to new regions and help us prioritize where we build next.

GCP arrives in the Nordics with a new region in Finland



Click here for the Finnish version, thank you!

Our sixteenth Google Cloud Platform (GCP) region, located in Finland, is now open for you to build applications and store your data.

The new Finland region, europe-north1, joins the Netherlands, Belgium, London, and Frankfurt in Europe and makes it easier to build highly available, performant applications using resources across those geographies.

Hosting applications in the new region can improve latencies by up to 65% for end-users in the Nordics and by up to 88% for end-users in Eastern Europe, compared to hosting them in the previously closest region. You can visit www.gcping.com to see for yourself how fast the Finland region is from your location.

Services


The Nordic region has everything you need to build the next great application, and three zones that allow you to distribute applications and storage across multiple zones to protect against service disruptions.

You can also access our Multi-Regional services in Europe (such as BigQuery) and all the other GCP services via the Google Network, the largest cloud network as measured by number of points of presence. Please visit our Service Specific Terms to get detailed information on our data storage capabilities.

Build sustainably


The new region is located in our existing data center in Hamina. This facility is one of the most advanced and efficient data centers in the Google fleet. Our high-tech cooling system, which uses sea water from the Gulf of Finland, reduces energy use and is the first of its kind anywhere in the world. This means that when you use this region to run your compute workloads, store your data, and develop your applications, you are doing so sustainably.

Hear from our customers


“The road to emission-free and sustainable shipping is a long and challenging one, but thanks to exciting innovation and strong partnerships, Rolls-Royce is well-prepared for the journey. For us being able to train machine learning models to deliver autonomous vessels in the most effective manner is key to success. We see the Google Cloud for Finland launch as a great advantage to speed up our delivery of the project.”
– Karno Tenovuo, Senior Vice President Ship Intelligence, Rolls-Royce

“Being the world's largest producer of renewable diesel refined from waste and residues, as well as being a technologically advanced refiner of high-quality oil products, requires us to take advantage of leading-edge technological possibilities. We have worked together with Google Cloud to accelerate our journey into the digital future. We share the same vision to leave a healthier planet for our children. Running services on an efficient and sustainably operated cloud is important for us. And even better that it is now also available physically in Finland.”
– Tommi Touvila, Chief Information Officer, Neste

“We believe that technology can enhance and improve the lives of billions of people around the world. To do this, we have joined forces with visionary industry leaders such as Google Cloud to provide a platform for our future innovation and growth. We’re seeing tremendous growth in the market for our operations, and it’s essential to select the right platform. The Google Cloud Platform cloud region in Finland stands for innovation.”
– Anssi Rönnemaa, Chief Finance and Commercial Officer, HMD Global

“Digital services are key growth drivers for our renewal of a 108-year old healthcare company. 27% of our revenue is driven by digital channels, where modern technology is essential. We are moving to a container-based architecture running on GCP at Hamina. Google has a unique position to provide services within Finland. We also highly appreciate the security and environmental values of Google’s cloud operations.”
– Kalle Alppi, Chief Information Officer, Mehiläinen

Partners in the Nordics


Our partners in the Nordics are available to help design and support your deployment, migration and maintenance needs.


"Public cloud services like those provided by Google Cloud help businesses of all sizes be more agile in meeting the changing needs of the digital era—from deploying the latest innovations in machine learning to cost savings in their infrastructure. Google Cloud Platform's new Finland region enables this business optimization and acceleration with the help of cloud-native partners like Nordcloud and we believe Nordic companies will appreciate the opportunity to deploy the value to their best benefit.”
– Jan Kritz, Chief Executive Officer, Nordcloud

Nordic partners include: Accenture, Adapty, AppsPeople, Atea, Avalan Solutions, Berge, Cap10, Cloud2, Cloudpoint, Computas, Crayon, DataCenterFinland, DNA, Devoteam, Doberman, Deloitte, Enfo, Evry, Gapps, Greenbird, Human IT Cloud, IIH Nordic, KnowIT, Koivu Solutions, Lamia, Netlight, Nordcloud, Online Partners, Outfox Intelligence AB, Pilvia, Precis Digital, PwC, Quality of Service IT-Support, Qvik, Skye, Softhouse, Solita, Symfoni Next, Soprasteria, Tieto, Unifoss, Vincit, Wizkids, and Webstep.

If you want to learn more or wish to become a partner, visit our partners page.

Getting started


For additional details on the region, please visit our Finland region page where you’ll get access to free resources, whitepapers, the "Cloud On-Air" on-demand video series and more. Our locations page provides updates on the availability of additional services and regions. Contact us to request access to new regions and help us prioritize what we build next.

Behind the scenes with the Dragon Ball Legends GCP backend



Dragon Ball Legends, a new mobile game from Bandai Namco Entertainment (BNE), is based on its popular Dragon Ball Z franchise, and is rolling out to gamers around the world as we speak. But planning the cloud infrastructure to power the game dates back to February 2017, when BNE approached Google Cloud to talk about the interesting challenges they were facing, and how we could help.

Based on their anticipated demand, BNE had three ambitious requirements for their game:
  1. Extreme scalability. The game would be launched globally, so it needed backend that could scale with millions of players and still perform well.
  2. Global network. Because the game allows real-time player versus player battles, it needs a reliable and low-latency network across regions.
  3. Real-time data analytics. The game is designed to evolve with players in real-time, so it was critical to have a data analytics pipeline to stream data to a data warehouse. Then the operation team can measure and evaluate how people are playing the game and adjust it on-the-fly.
All three of these are areas where we have a lot of experience. Google has multiple global services with more than a billion users, and we use the data those services generate to improve them over time. And because Google Cloud Platform (GCP) runs on the same infrastructure as these Google services, GCP customers can take advantage of the same enabling technologies.

Let’s take a look at how BNE worked with Google Cloud to build the infrastructure for Dragon Ball Legends.


Challenge #1: Extreme scalability

MySQL is extensively used by gaming companies in Japan because engineers are used to working with relational databases with schema, SQL queries and strong consistency. This simplifies a lot on the application side that doesn’t have to handle any database limitations like eventual consistency or schema enforcement. MySQL is a widely used even outside gaming and most backend engineers already have strong experience using this database.

While MySQL offers many advantages, it has one big limitation: scalability. Indeed, as a scale-up database if you want to increase MySQL performance, you need to add more CPU, RAM or disk. And when a single instance of MySQL can’t handle the load anymore, you can divide the load by sharding—splitting users into groups and assigning them to multiple independent instances of MySQL. Sharding has a number of drawbacks, however. Most gaming developers calculate the number of shards they’ll need for the database before the game launches since resharding is labor-intensive and error-prone. That causes gaming companies tend to overprovision the database to eventually handle more players than they expect. If the game is as popular as expected, everything is fine. But what if the game is a runaway hit and exceeds the anticipated demand? And what about the long tail representing a gradual decrease in active players? What if it’s an out-and-out flop? MySQL sharding is not dynamically scalable, and adjusting its size requires maintenance as well as risk.

In an ideal world, databases can scale in and out without downtime while offering the advantages of a relational database. When we first heard that BNE was considering MySQL sharding to handle the massive anticipated traffic for Dragon Ball Legends, we suggested they consider Cloud Spanner instead.


Why Cloud Spanner?

Cloud Spanner is a fully managed relational database that offers horizontal scalability and high availability while keeping strong consistency with a schema that is similar to MySQL’s. Better yet, as a managed service, it’s looked after by Google SREs, removing database maintenance and minimizing the risk of downtime. We thought Cloud Spanner would be able to help BNE make their game global.


Evaluation to implementation

Before adopting a new technology, engineers should always test it to confirm its expected performance in a real world scenario. Before replacing MySQL, BNE created a new Cloud Spanner instance in GCP, including a few tables with a similar schema to what they used in MySQL. Since their backend developers were writing in Scala, they chose the Java client library for Cloud Spanner and wrote some sample code to load-test Cloud Spanner and see if it could keep up with their queries per second (QPS) requirements for writes—around 30,000 QPS at peak. Working with our customer engineer and the Cloud Spanner engineering team, they met this goal easily. They even developed their own DML (Data Manipulation Language) wrapper to write SQL commands like INSERT, UPDATE and DELETE.


Game release

With the proof of concept behind them, they could start their implementation. Based on the expected daily active users (DAU), BNE calculated how many Cloud Spanner nodes they needed—enough for the 3 million pre-registered players they were expecting. To prepare the release, they organized two closed beta tests to validate their backend, and didn’t have a single issue with the database! In the end, over 3 million participants worldwide pre-registered for Dragon Ball Legends, and even with this huge number, the official game release went flawlessly.

Long story short, BNE can focus on improving the game rather than spending time operating their databases.


Challenge #2: Global network

Let’s now talk about BNE’s second challenge: building a global real-time player-vs-player (PvP) game. BNE’s goal for Dragon Ball Legends was to let all its players play against one another, anywhere in the world. If you know anything about networking, you understand the challenge around latency. Round-trip time (RTT) ( between Tokyo and San Francisco, for example, is on average around 100 ms. To address that, they decided to divide every game second into 250 ms intervals. So while the game looks like it’s real-time to users, it’s actually a really fast turn-based game at its core (you can read more about the architecture here). And while some might say that 250ms offers plenty of room for latency, it’s extremely hard to predict the latency when communicating across the Internet.


Why Cloud Networking?

Here’s what it looks like for a game client to access the game server on GCP over the internet. Since the number of hops can vary every time, this means that playing PvP can sometimes feel fast, sometimes slow.

Once of the main reasons BNE decided to use GCP for the Dragon Ball Legends backend was the Google dedicated network. As you can see in the picture below, when using GCP, once the game client accesses one of the hundreds of GCP Point Of Presence (POP) around the world, it’s on the Google dedicated network. That means none unpredictable hops, for predictable and lowest possible latency.


Taking advantage of the Google Cloud Network

Usually, gaming companies implement PvP by connecting two players directly or through a dedicated game server. Usually combat games that require low latency between players will prefer P2P communication. In general, when two players are geographically close, P2P works very well, but it’s often unreliable when trying to communicate across regions (some carriers even block P2P protocols). For two players from two different continents to communicate through Google’s dedicated network, players first try to communicate through P2P, and if that fails, they failover to an open source implementation of STUN/TURN Server called coturn, which acts as a relay between the two players.. That way, cross continent battles leverage the low-latency and reliable Google network as much as possible.


Challenge #3: Real-time data analytics

BNE’s last challenge was around real-time data analytics. BNE wanted to offer the best user experience to their fans and one of the ways to do that is through live game operations, or LiveOps, in which operators make constant changes to the game so it always feels fresh. But to understand players’ needs, they needed data— usually users’ actions log data. And if they could get this data in near real-time, they could then make decisions on what changes to apply to the game to increase users’ satisfaction and engagement.

To gather this data, BNE used a combination of Cloud Pub/Sub, Cloud Dataflow to transform in users’ data in real-time and insert it into BigQuery.
  • Cloud Pub/Sub offers a globally reliable messaging system that buffers the logs until they can be handled by Cloud Dataflow.
  • Cloud Dataflow is a fully managed parallel processing service that lets you execute ETL in real-time and in parallel.
  • BigQuery is the fully managed data warehouse where all the game logs are stored. Since BigQuery offers petabyte-scale storage, scaling was not a concern. Thanks to heavy parallel processing when querying the logs, BNE can get a response to a query, scanning terabytes of data in a few seconds.
This system lets a game producer visualize a player’s behavior in near real-time and take decision on what new features to bring to the game or what to change inside the game to satisfy all their fans.


Takeaways

Using Cloud Spanner, BNE could focus on developing an amazing game instead of spending time on database capacity planning and scaling. Operations-wise, by using a fully managed scalable database, they drastically reduced risks related to human error as well as an operational overhead.

Using Cloud Networking, they leveraged Google’s dedicated network to offer the best user experience to their fans, even when fighting across regions.

And finally, using Google’s analytics stack (Cloud PubSub, Cloud Dataflow and BigQuery), BNE was able to analyze players’ behaviors in near real-time and make decisions about how to adjust the game to make their fans even happier!

If you want to hear more details about how they evaluated and adopted Cloud Spanner for their game, please join them at their Google Cloud NEXT’18 session in San Francisco.

Introducing QUIC support for HTTPS load balancing



For four years now, Google has been using QUIC, a UDP-based encrypted transport protocol optimized for HTTPS, to deliver traffic for our products – from Google Web Search, to YouTube, to this very blog. If you’re reading this in Chrome, you’re probably using QUIC right now. QUIC makes the web faster, particularly for slow connections, and now your cloud services can enjoy that speed: today, we’re happy to be the first major public cloud to offer QUIC support for our HTTPS load balancers.

QUIC’s key features include establishing connections faster, stream-based multiplexing, improved loss recovery, and no head-of-line blocking. QUIC is designed with mobility in mind, and supports migrating connections from WiFi to Cellular and back.

Benefits of QUIC


If your service is sensitive to latency, QUIC will make it faster because of the way it establishes connections. When a web client uses TCP and TLS, it requires two to three round trips with a server to establish a secure connection before the browser can send a request. With QUIC, if a client has talked to a given server before, it can start sending data without any round trips, so your web pages will load faster. How much faster? On a well-optimized site like Google Search, connections are often pre-established, so QUIC’s faster connections can only speed up some requests—but QUIC still improves mean page load time by 8% globally, and up to 13% in regions where latency is higher.

Cedexis benchmarked our Cloud CDN performance using a Google Cloud project. Here’s what happened when we enabled QUIC.

Encryption is built into QUIC, using AEAD algorithms such as AES-GCM and ChaCha20 for both privacy and integrity. QUIC authenticates the parts of its headers that it doesn’t encrypt, so attackers can’t modify any part of a message.

Like HTTP/2, QUIC multiplexes multiple streams into one connection, so that a connection can serve several HTTP requests simultaneously. But HTTP/2 uses TCP as its transport, so all of its streams can be blocked when a single TCP packet is lost—a problem called head-of-line blocking. QUIC is different: Loss of a UDP packet within a QUIC connection only affects the streams contained within that packet. In other words, QUIC won’t let a problem with one request slow the others down, even on an unreliable connection.

Enabling QUIC

You can enable QUIC in your load balancer with a single setting in the GCP Console. Just edit the frontend configuration for your load balancer and enable QUIC negotiation for the IP and port you want to use, and you’re done.

You can also enable QUIC using gcloud:
gcloud compute target-https-proxies update proxy-name 
--quic_override=ENABLE
Once you’ve enabled QUIC, your load balancer negotiates QUIC with clients that support it, like Google Chrome and Chromium. Clients that do not support QUIC continue to use HTTPS seamlessly. If you distribute your own mobile client, you can integrate Cronet to gain QUIC support. The load balancer translates QUIC to HTTP/1.1 for your backend servers, just like traffic with any other protocol, so you don’t need to make any changes to your backends—all you need to do is enable QUIC in your load balancer.

The Future of QUIC

We’re working to help QUIC become a standard for web communication, just as we did with HTTP/2. The IETF formed a QUIC working group in November 2016, which has seen intense engagement from IETF participants, and is scheduled to complete v1 drafts this November. QUIC v1 will support HTTP over QUIC, use TLS 1.3 as the cryptographic handshake, and support migration of client connections. At the working group’s most recent interop event, participants presented over ten independent implementations.

QUIC is designed to evolve over time. A client and server can negotiate which version of QUIC to use, and as the IETF QUIC specifications become more stable and members reach clear consensus on key decisions, we’ve used that version negotiation to keep pace with the current IETF drafts. Future planned versions will also include features such as partial reliability, multipath, and support for non-HTTP applications like WebRTC.

QUIC works across changing network connections. QUIC can migrate client connections between cellular and Wifi networks, so requests don’t time out and fail when the current network degrades. This migration reduces the number of failed requests and decreases tail latency, and our developers are working on making it even better. QUIC client connection migration will soon be available in Cronet.

Try it out today

Read more about QUIC in the HTTPS load balancing documentation and enable it for your project(s) by editing your HTTP(S) load balancer settings. We look forward to your feedback!

Google Cloud using P4Runtime to build smart networks



Data networks are difficult to design, build and manage, and often don’t work as well as we would like. Here at Google, we deploy and use a lot of network capacity in and between data centers to deliver our portfolio of services, and the costs and burdens of deploying and managing these networks have only grown with the the scale and complexity of these networks. Almost ten years ago, we took steps to address this by adopting software-defined networking (SDN) as the basis for our network architecture. SDN allowed us to program our networks with software running on standard servers and became a fundamental component of our largest systems. In that time, we’ve continued to develop and improve our SDN technology, and now it’s time to take the next step with P4Runtime.

We are excited to announce our collaboration with the Open Networking Foundation (ONF) on Stratum, an open source project to implement an open reference platform for a truly "software-defined" data plane, designed and built around P4Runtime from the beginning. P4 Runtime allows the SDN control plane to establish a contract with the dataplane about forwarding behavior, and to then establish forwarding behavior through simple RPCs. As part of the project, we’re working with network vendors to make this functionality available in networking products across the industry. As a small-but-complete SDN embedded software solution, Stratum will help bring P4Runtime to a variety of network devices.

But just what is it about P4Runtime that helps with the challenges of building large-scale and reliable networks? Network hardware is typically closed, runs proprietary software and is complex, thanks to the need to operate autonomously and run legacy protocols. Modern data centers and wide-area networks are large, must be fast and simple and are often built using commodity network switch chips interconnected into a large fabric. And despite high-quality whitebox switches and open SDN technology such as OpenFlow, there still aren’t a lot of good, portable options on the market to build these networks.

At Google, we designed our own hardware switches and switch software, but our goal has always been to leverage industry SDN solutions that interoperate with our data centers and wide-area networks. P4Runtime is a new way for control plane software to program the forwarding path of a switch and provides a well-defined API to specify the switch forwarding pipelines, as well as to configure these pipelines via simple RPCs. P4Runtime can be used to control any forwarding plane, from a fixed-function ASIC to a fully programmable network switch.

Google Cloud is looking to P4Runtime as the foundation for our next generation of data centers and wide area network control-plane programming, to drive industry adoption and to enable others to benefit from it. With P4Runtime we’ll be able to continue to build the larger, higher performance and smarter networks that you’ve come to expect.

Three ways to configure robust firewall rules



If you administer firewall rules for Google Cloud VPCs, you want to ensure that firewall rules you create can only be associated with correct VM instances by developers in your organization. Without that assurance, it can be difficult to manage access to sensitive content hosted on VMs in your VPCs or allow these instances access to the internet, and you must carefully audit and monitor the instances to ensure that such unintentional access is not given through the use of tags. With Google VPC, there are now multiple ways to help achieve the required level of control, which we’ll describe here in detail.

As an example, imagine you want to create a firewall rule to restrict access to sensitive user billing information in a data store running on a set of VMs in your VPC. Further, you’d like to ensure that developers who can create VMs for applications other than the billing frontend cannot enable these VMs to be governed by firewall rules created to allow access to billing data.
Example topology of a VPC setup requiring secure firewall access.
The traditional approach here is to attach tags to VMs and create a firewall rule that allows access to specific tags, e.g., in the above example you could create a firewall rule that allows all VMs with the billing-frontend tag access to all VMs with the tag billing-data. The drawback of this approach is that any developer with Compute InstanceAdmin role for the project can now attach billing-frontend as a tag to their VM, and thus unintentionally gain access to sensitive data.

Configuring Firewall rules with Service Accounts


With the general availability of firewall rules using service accounts, instead of using tags, you can block developers from enabling a firewall rule on their instances unless they have access to the appropriate centrally managed service accounts. Service accounts are special Google accounts that belong to your application or service running on a VM and can be used to authenticate the application or service for resources it needs to access. In the above example, you can create a firewall rule to allow access to the billing-data@ service account only if the originating source service account of the traffic is billing-frontend@.
Firewall setup using source and target service accounts. (Service accounts names are abbreviated for simplicity.)
You can create this firewall rule using the following gcloud command:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-service-accounts [email protected] \
    --target-service-accounts [email protected]
If, in the above example, the billing frontend and billing data applications are autoscaled, you can specify the service accounts for the corresponding applications in the InstanceTemplate configured for creating the VMs.

The advantage of using this approach is that once you set it up, the firewall rules may remain unchanged despite changes in underlying IAM permissions. However, you can currently only associate one service account with a VM and to change this service account, the instance must be in a stopped state.

Creating custom IAM role for InstanceAdmin


If you want the flexibility of tags and the limitations of service accounts is a concern, you can create a custom role with more restricted permissions that disable the ability to set tags on VMs; do this by removing the compute.instances.setTag permission. This custom role can have other permissions present in the InstanceAdmin role and can then be assigned to developers in the organization. With this custom role, you can create your firewall rules using tags:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-tags billing-frontend \
    --target-tags billing-data
Note, however, that permissions assigned to a custom role are static in nature and must be updated with any new permissions that might be added to the InstanceAdmin role, as and when required.

Using subnetworks to partition workloads


You can also create firewall rules using source and destination IP CIDR ranges if the workloads can be partitioned into subnetworks of distinct ranges as shown in the example diagram below.
Firewall setup using source and destination ranges.
In order to restrict developers’ ability to create VMs in these subnetworks, you can grant Compute Network User role selectively to developers on specific subnetworks or use Shared VPC.

Here’s how to configure a firewall rule with source and destination ranges using gcloud:
gcloud compute firewall-rules create secure-billing-data \
    --network web-network \
    --allow TCP:443 \
    --source-ranges 10.20.0.0/16 \
    --destination-ranges 10.30.0.0/16
This method allows for better scalability with large VPCs and allows for changes in the underlying VMs as long as the network topology remains unchanged. Note, however, that if a VM instance has can_ip_forward enabled, it may send traffic using the above source range and thus gain access to sensitive workloads.

As you can see, there’s a lot to consider when configuring firewall rules for your VPCs. We hope these tips help you configure firewall rules in a more secure and efficient manner. To learn more about configuring firewall rules, check out the documentation.

Simplify Cloud VPC firewall management with service accounts



Firewalls provide the first line of network defense for any infrastructure. On Google Cloud Platform (GCP), Google Cloud VPC firewalls do just that—controlling network access to and between all the instances in your VPC. Firewall rules determine who's allowed to talk to whom and more importantly who isn’t. Today, configuring and maintaining IP-based firewall rules is a complex and manual process that can lead to unauthorized access if done incorrectly. That’s why we’re excited to announce a powerful new management feature for Cloud VPC firewall management: support for service accounts.

If you run a complex application on GCP, you’re probably already familiar with service accounts in Cloud Identity and Access Management (IAM) that provide an identity to applications running on virtual machine instances. Service accounts simplify the application management lifecycle by providing mechanisms to manage authentication and authorization of applications. They provide a flexible yet secure mechanism to group virtual machine instances with similar applications and functions with a common identity. Security and access control can subsequently be enforced at the service account level.


Using service accounts, when a cloud-based application scales up or down, new VMs are automatically created from an instance template and assigned the correct service account identity. This way, when the VM boots up, it gets the right set of permissions and within the relevant subnet, so firewall rules are automatically configured and applied.

Further, the ability to use Cloud IAM ACLs with service accounts allows application managers to express their firewall rules in the form of intent, for example, allow my “application x” servers to access my “database y.” This remediates the need to manually manage Server IP Address lists while simultaneously reducing the likelihood of human error.
This process is leaps-and-bounds simpler and more manageable than maintaining IP address-based firewall rules, which can neither be automated nor templated for transient VMs with any semblance of ease.

Here at Google Cloud, we want you to deploy applications with the right access controls and permissions, right out of the gate. Click here to learn how to enable service accounts. And to learn more about Cloud IAM and service accounts, visit our documentation for using service accounts with firewalls.