Tag Archives: Storage & Databases

Introducing Jib — build Java Docker images better



Containers are bringing Java developers closer than ever to a "write once, run anywhere" workflow, but containerizing a Java application is no simple task: You have to write a Dockerfile, run a Docker daemon as root, wait for builds to complete, and finally push the image to a remote registry. Not all Java developers are container experts; what happened to just building a JAR?

To address this challenge, we're excited to announce Jib, an open-source Java containerizer from Google that lets Java developers build containers using the Java tools they know. Jib is a fast and simple container image builder that handles all the steps of packaging your application into a container image. It does not require you to write a Dockerfile or have docker installed, and it is directly integrated into Maven and Gradle—just add the plugin to your build and you'll have your Java application containerized in no time.

Docker build flow:

Jib build flow:


How Jib makes development better:


Jib takes advantage of layering in Docker images and integrates with your build system to optimize Java container image builds in the following ways:
  1. Simple - Jib is implemented in Java and runs as part of your Maven or Gradle build. You do not need to maintain a Dockerfile, run a Docker daemon, or even worry about creating a fat JAR with all its dependencies. Since Jib tightly integrates with your Java build, it has access to all the necessary information to package your application. Any variations in your Java build are automatically picked up during subsequent container builds.
  2. Fast - Jib takes advantage of image layering and registry caching to achieve fast, incremental builds. It reads your build config, organizes your application into distinct layers (dependencies, resources, classes) and only rebuilds and pushes the layers that have changed. When iterating quickly on a project, Jib can save valuable time on each build by only pushing your changed layers to the registry instead of your whole application.
  3. Reproducible - Jib supports building container images declaratively from your Maven and Gradle build metadata, and as such can be configured to create reproducible build images as long as your inputs remain the same.

How to use Jib to containerize your application

Jib is available as plugins for Maven and Gradle and requires minimal configuration. Simply add the plugin to your build definition and configure the target image. If you are building to a private registry, make sure to configure Jib with credentials for your registry. The easiest way to do this is to use credential helpers like docker-credential-gcr. Jib also provides additional rules for building an image to a Docker daemon if you need it.

Jib on Maven
<plugin>
  <groupId>com.google.cloud.tools</groupId>
  <artifactId>jib-maven-plugin</artifactId>
  <version>0.9.0</version>
  <configuration>
    <to>
      <image>gcr.io/my-project/image-built-with-jib</image>
    </to>
  </configuration>
</plugin>
# Builds to a container image registry.
$ mvn compile jib:build
# Builds to a Docker daemon.
$ mvn compile jib:dockerBuild
Jib on Gradle
plugins {
  id 'com.google.cloud.tools.jib' version '0.9.0'
}
jib.to.image = 'gcr.io/my-project/image-built-with-jib'
# Builds to a container image registry.
$ gradle jib
# Builds to a Docker daemon.
$ gradle jibDockerBuild

We want everyone to use Jib to simplify and accelerate their Java development. Jib works with most cloud providers; try it out and let us know what you think at github.com/GoogleContainerTools/jib.

Announcing MongoDB Atlas free tier on GCP



Earlier this year, in response to strong customer demand, we announced that we were expanding region support for MongoDB Atlas. The MongoDB NoSQL database is hugely popular, and the MongoDB Atlas cloud version makes it easy to manage on Google Cloud Platform (GCP). We heard great feedback from users, so we’re further lowering the barrier to get started on MongoDB Atlas and GCP.

We’re pleased to announce that as of today, MongoDB will offer a free tier of MongoDB Atlas on GCP in three supported regions, strategically located in North America, Europe and Asia Pacific in recognition of our wide user install base.

The free tier will allow developers a no-cost sandbox environment for MongoDB Atlas on GCP. You can test any potential MongoDB workloads on the free tier and decide to upgrade to a larger paid Atlas cluster once you have confidence in our cloud products and performance.

As of today, these specific regions are supported by the Atlas free tier:
  1. Iowa (us-central1)
  2. Belgium (europe-west1)
  3. Singapore (asia-southeast1)
To get started, you’ll just need to log in to your MongoDB console, select “Build a New Cluster,” pick “Google Cloud Platform,” and look for the “Free Tier Available” message. The free tier utilizes MongoDB’s M0 instances. An M0 cluster is a sandbox MongoDB environment for prototyping and early development with 512MB of storage space. It also comes with strong enterprise features such as always-on authentication, end-to-end encryption and high availability, as well as monitoring. Happy experimenting!

Related content:

Bust a move with Transfer Appliance, now generally available in U.S.



As we celebrate the upcoming Los Angeles Google Cloud Platform (GCP) region in one of the creative centers of the world, we are excited to share news about a product that can help you get your data there as fast as possible. Google Transfer Appliance is now generally available in the U.S., with a few new features that will simplify moving data to Google Cloud Storage. Customers have been using Transfer Appliance for almost a year, and we’ve heard great feedback.

The Transfer Appliance is a high-capacity server that lets you transfer large amounts of data to GCP, quickly and securely. It’s recommended if you’re moving more than 20TB of data, or data that would take more than a week to upload.

You can now request a Transfer Appliance directly from your Google Cloud Platform console. Indicate the amount of data you’re looking to transfer, and our team will help you choose the version that is the best fit for your needs.

The service comes in two configurations: 100TB or 480TB of raw storage capacity. We see typical data compression rates of 2x the raw capacity. The 100TB model is priced at $300, plus express shipping (approximately $500); the 480TB model is priced at $1,800, plus shipping (approximately $900).

You can mount Transfer Appliance as an NFS volume, making it easy to drag and drop files, or rsync, from your current NAS to the appliance. This feature simplifies the transfer of file-based content to Cloud Storage, and helps our migration partners expedite the move for customers.
"SADA Systems provides expert cloud consultation and technical services, helping customers get the most out of their Google Cloud investment. We found Transfer Appliance helps us transition the customer to the cloud faster and more efficiently by providing a secure data transfer strategy."
-Simon Margolis, Director of Cloud Platform, SADA Systems
Transfer Appliance can also help you transition your backup workflow to the cloud quickly. To do that, move the bulk of your current backup data offline using Transfer Appliance, and then incrementally back up to GCP over the network from there. Partners like Commvault can help you do this.

With this release, you’ll also find a more visible end-to-end integrity check, so you can be confident that every bit was transferred as is, and have peace of mind in deleting source data.

Transfer Appliance in action

In developing Transfer Appliance, we built a device designed for the data center, so it slides into a standard 19” rack. That has been a positive experience for our early customers, even those with floating data centers (yes, actually floating--see below for more).

We’ve seen our customers successfully use Transfer Appliance for the following use cases:
  • Migrate your data center (or parts of it) to the cloud.
  • Kick-start your ML or analytics project by transferring test data and staging it quickly.
  • Move large archives of content like creative libraries, videos, images, regulatory or backup data to Cloud Storage.
  • Collect data from research bodies or data providers and move it to Google Cloud for analysis.
We’ve heard about lots of innovative, interesting data projects powered by Transfer Appliance. Here are a few of them.

One early adopter, Schmidt Ocean Institute, is a private non-profit foundation that combines advanced science with state-of-the-art technology to achieve lasting results in ocean research. Their goals are to catalyze sharing of information and to communicate this knowledge to audiences around the world. For example, the Schmidt Ocean Institute owns and operates research vessel Falkor, the first oceanographic research vessel with a high-performance cloud computing system installed onboard. Scientists run models and software and can plan missions in near-real time while at sea. With the state-of-the-art technologies onboard, scientists contribute scientific data to the oceanographic community at large, very quickly. Schmidt Ocean Institute uses Transfer Appliance to safely get the data back to shore and publicly available to the research community as fast as possible.

“We needed a way to simplify the manual and complex process of copying, transporting and mailing hard drives of research data, as well as making it available to the scientific community as quickly as possible. We are able to mount the Transfer Appliance onboard to store the large amounts of data that result from our research expeditions and easily transfer it to Google Cloud Storage post-cruise. Once the data is in Google Cloud Storage, it’s easy to disseminate research data quickly to the community.”
-Allison Miller, Research Program Manager, Schmidt Ocean Institute

Beatport, a division of LiveStyle, serves an audience of electronic music DJs, producers and their fans. Google Transfer Appliance afforded Beatport the opportunity to rethink their storage architecture in the cloud without affecting their customer-facing network in the process.

“DJs, music producers and fans all rely on Beatport as the home for the world’s electronic music. By moving our library to Google Cloud Storage, we can access our audio data with the advanced tools that Google Cloud Platform has to offer. Managing tens of millions of lossless quality files poses unique challenges. Migrating to the highly performant Cloud Storage puts our wealth of audio data instantly at the fingertips of our technology team. Transfer Appliance made that move easier for our team.”
-Jonathan Steffen, CIO, beatport
Eleven Inc. creates content, brand experiences and customer activation strategies for clients across the globe. Through years of work for their clients, Eleven built a large library of creative digital assets and wanted a way to cost-effectively store that data in the cloud. Facing ISP network constraints and a desire to free up space on their local asset server quickly, Eleven Inc. used Transfer Appliance to facilitate their migration.

“Working with Transfer Appliance was a smooth experience. Rack, capture and ship. And now that our creative library is in Google Cloud Storage, it's much easier to think about ways to more efficiently manage the data throughout its life-cycle.”
-Joe Mitchell, Director of Information Systems
amplified ai combines extensive IP industry experience with deep learning to offer instant patent intelligence to inventors and attorneys. This requires a lot of patent data for building models. Transfer Appliance helped amplified ai move TBs of this specialized essential data to the cloud quickly.

“My hands are already full building deep learning models on massive, disparate data without also needing to worry about physically moving data around. Transfer Appliance was easy to understand, easy to install, and made it easy to capture and transfer data. It just did what it was supposed to do and saved me time which, for a busy startup, is the most valuable asset.”
-Chris Grainger, Founder & CTO, amplified ai
Airbus Defence and Space Geo Inc. uses their exclusive access to radar and optical satellites to offer a stunning Earth observation images library. As part of a major cloud migration effort, Airbus moved hundreds of TBs of this data to the cloud with Transfer Appliance so they can better serve images to clients from Cloud Storage. They improved data quality along with the migration by using Transfer Appliance.

“We needed to liberate. To flex on demand and scale in the cloud, and unleash our creativity. Transfer Appliance was a catalyst for that. In addition to migrating an amount of data that would not have been possible over the network, this transfer gave us the opportunity to improve our storage in the process—to clean out the clutter.”
-Dave Wright, CTO, Airbus Defense and Space Geo Inc.


National Collegiate Sports Archives (NCSA) is the creator and owner of the VAULT, which contains years worth of college sports footage. NCSA digitizes archival sports footage from leading schools and delivers it via mobile, advertising and social media platforms. With a lot of precious footage to deliver to college sports fans around the globe, NCSA needed a way to move data into Google Cloud Platform quickly and with zero disruption for their users.

“With a huge archive of collegiate sports moments, we wanted to get that content into the cloud and do it in a way that provides value to the business. I was looking for a solution that would cost-effectively, simply and safely execute the transfer and let our teams focus on improving the experience for our users. Transfer Appliance made it simple to capture data in our data center and ship it to Google Cloud. ”
-Jody Smith, Technology Lead, NCSA

Tackle your data migration needs with Transfer Appliance

To get detailed information on Transfer Appliance, check out our documentation. And visit our Data Transfer page to learn more about our other cloud data transfer options.

We’re looking forward to bringing Transfer Appliance to regions outside of the U.S. in the coming months. But we need your help: Where should we deploy first? If you are interested in offline data transfer but not located in the U.S., please indicate so in the request form.

If you’re interested in learning more about cloud data migration strategies, check out this session at Next 2018 next month. For more information, and to register, visit the Next ‘18 website.

Building scalable web applications with Cloud Datastore — new solution



If you manage database systems for large web applications, your job can be quite challenging. When unforeseen situations arise, making configuration changes can be complex and risky due to the stateful nature of those database systems. And before launching a new application, you have to do a lot of capacity planning, such as the number of virtual machines (VMs), the amount of disk storage, and the optimal network configuration, while contending with unknown factors such as the volume and frequency of open database connections and evolving usage patterns. You also need to do regular maintenance work to upgrade database software and scale resources to meet growing demand.

All of this planning and maintenance takes time, money, and attention away from developing new application features, so it is important to find a balance between provisioning enough resources to handle heavy loads and overspending on unused resources.

Cloud Datastore can help minimize these challenges by providing a scalable, highly available, high-performance, and fully-managed NoSQL database system.

We recently published an article that presents an overview of how to build large web applications with Cloud Datastore. The article includes scenarios of full-fledged web applications that use Cloud Datastore jointly with other products in the Google Cloud Platform (GCP) ecosystem.

Check out the article for all the details and next steps for building your own scalable solutions using Cloud Datastore!

What DBAs need to know about Cloud Spanner, part 1: Keys and indexes



Cloud Spanner is a relational and horizontally scalable database service, built from a cloud/distributed design perspective. It brings efficiency and high availability for developers and database administrators (DBAs), and differs structurally from typical databases you’re used to. In this blog series, we’ll explore some of the key differences that DBAs and developers will encounter as you migrate from traditional vertically-scaling (scale-up) relational database management systems (RDBMS) and move to Cloud Spanner. We will discuss some of the dos-and-don'ts, best practices and why things are different in Cloud Spanner.

In this series, we will explore a range of topics, including:
  • Selection of keys and use of indexes
  • How to approach business logic
  • Importing and exporting data
  • Migrating from your existing RDBMS
  • Optimizing performance
  • Access control and logging
You’ll gain an understanding of how to best use Cloud Spanner and release its potential to achieve linearly scalable performance over massive databases. In this first installment, let’s start with a closer look at how the concepts of keys and indexes work in Cloud Spanner.

Choosing keys in Cloud Spanner

Just like in other databases, the choice of key is vitally important to optimize the performance of the database. It’s even more important in Cloud Spanner, due to the way its mechanisms distribute database load. Unlike traditional RDBMS, you’ll need to take care when choosing the primary keys for the tables and choosing which columns to index.

Using well-distributed keys results in a table whose size and performance can scale linearly with the number of Cloud Spanner nodes, while using poorly distributed keys can result in hotspots, where a single node is responsible for the majority of reads and writes to the table.

In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.

Cloud Spanner is a distributed system, with many nodes reading and writing to the database at any one time. However, to achieve scalability, along with global ACID transactions and strong consistency, only one node at any one time can have write responsibility for a given row.

Cloud Spanner distributes management of rows across multiple nodes by breaking up each table into several splits, using ranges of the lexicographically sorted primary key.

This enables Cloud Spanner to achieve high availability and scalability, but it also means that using any continuously increasing or decreasing sequence as a key is detrimental to performance. To explain why, let’s explore how Cloud Spanner creates and manages its table splits.

Table splits and key choice

Cloud Spanner manages splits using Paxos (you can learn more about how in this detailed documentation: Life of Cloud Spanner Reads & Writes and Spanner: Google's Globally Distributed Database). In a regional instance of Cloud Spanner, the responsibility for reading/writing each split is distributed across a group of three nodes, one in each of the three availability zones of the Cloud Spanner instance.

One node in this group of three is elected Split Leader and manages the writes and locks for all the rows in the split. All three nodes in the group can perform reads.

To create a visual example, consider a table with 600 rows that uses a simple, continuous, monotonically increasing integer key (as is common in traditional RDBMS), broken into six splits, running on a two-node (per zone) Cloud Spanner instance. In an ideal situation, the table will have six splits, the leaders of which will be the six individual nodes available in the instance.

This distribution would result in ideal performance when reading/updating the rows, provided that the reads and updates are evenly distributed across the key range.

Problems with hotspots


The problem arises when new rows are appended to the database. Every new row will have an incrementing ID and will be added to the last split, meaning that out of the six available nodes, only one node will be handling all of the writes. In the example above, node 2c would be handling all writes. This node then becomes a hotspot, limiting the overall write performance of the database. In addition, the row distribution becomes unbalanced, with the last split becoming significantly larger, so it’s then handling more row reads than the others.

Cloud Spanner does try to compensate for uneven load by adding and removing splits in the background according to read and write load, and by creating a new split once the split size crosses a set threshold. However, in a frequently appended table, this will not happen quickly enough to avoid creating a hotspot.

Along with monotonically increasing or decreasing keys, this issue also affects tables that are indexed by any deterministic key—for example, an increasing timestamp in an event log table. Timestamp-keyed tables are also more likely to have a read hotspot because, in most cases, recently timestamped rows are accessed more frequently than the others. (Check out Cloud Spanner — Choosing the right primary keys for detailed information on detecting and avoiding hotspots.)

Problems with sequence generators

The concept of sequence generators, or lack thereof, is an important area to explore further. Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.

One workaround could be that the key is generated by the application (for example, by storing the next key value in a separate table in the database, or by getting the current maximum key value from the table). However, you’ll run into the same performance problems. Consider that as the application is also likely to be distributed, there may be multiple database clients trying to append a row at the same time, with two potential results depending on how the new key is generated:
  • If the SELECT for the existing key is performed in the transaction, one application instance trying to append would block all other application instances trying to append due to row locking.
  • If the SELECT for the existing key is done outside of the transaction, then there is a race between each of the application instances trying to append the new row. One will succeed, while others would have to retry (including generating a new key) after the append fails, since the key already exists.

What makes a good key

So if sequential keys will limit database performance in Cloud Spanner, what’s a good key to use? Ideally, the high-order bits should be evenly and semi-randomly distributed when keys are generated.

One simple way to generate such a key is to use random numbers, such as a random universally unique identifier (UUID). Note that there are several classes of UUID. Versions 1 and 2 use deterministic prefixes, such as timestamps or MAC addresses. Ensure that the UUID generation method you use is truly randomly distributed, i.e., v4, at least over the higher order bytes. This will ensure that the keys are evenly distributed among the keyspace, and hence that the load is distributed evenly over the spanner nodes.

Although another approach might be to use some real-world attributes of the data that are immutable and evenly distributed over the key range, this is quite a challenge since most uniformly distributed attributes are discrete and not continuous. For example, the random result of a dice roll is uniformly distributed and has six finite values. A continuous distribution could rely on an irrational number, for example π.

What if I really need an integer sequence as a key?

Though it’s not recommended, in some circumstances an integer sequence key is necessary, either for legacy or for external reasons, e.g., an employee ID.

To use an integer sequence key, you’ll first need a sequence generator that’s safe across a distributed system. One way of doing this is to have a table in Cloud Spanner contain a row for each required sequence that contains the next value in the sequence—so it would look something like this:
CREATE TABLE Sequences (
     Sequence_ID STRING(MAX) NOT NULL, -- The name of the sequence
     Next_Value INT64 NOT NULL
) PRIMARY KEY (Sequence_ID)
When a new ID value is required, the next value of the sequence is read, incremented and updated in the same transaction as the insert for the new row.

Note that this will limit performance when many rows are inserted, as each insert will block all other inserts due to the update of the Sequences table that we created above.

This performance issue can be reduced—though at the cost of possible gaps in the sequence—if each application instance reserves a block of, for example, 100 sequence values at once by incrementing Next_Value by 100, and then manages issuing individual IDs from that block internally.

In the table using the sequence, the key cannot simply be the numeric sequence value itself, as that will cause the last split to become a hotspot (as explained previously). So the application must generate a complex key that randomly distributes the rows among the splits.

This is known as application-level sharding and is achieved by prefixing the sequential ID with an additional column containing a value that’s evenly distributed among the key space—e.g., a hash of the original ID, or bit-reversing the ID. That looks something like this:
CREATE TABLE Table1 (
     Hashed_Id INT64 NOT NULL, 
     ID INT64 NOT NULL,
     -- other columns with data values follow....
) PRIMARY KEY (Hashed_Id, Id)
Even a simple cyclic redundancy check (CRC)32 checksum is good enough to provide a suitably pseudo-random Hashed_Id. It does not have to be secure, just enough to randomize the row order of the sequentially numbered keys, as in the following table:

Note that whenever a row is read directly, both the ID and Hashed_Id must be specified to prevent a table scan, as in this example:
SELECT * FROM Table1
WHERE t1.Hashed_Id = 0xDEADBEEF
      AND t1.Id = 1234
Similarly, whenever this table is joined with other tables in the query by Id, the join must also use both the ID and the Hashed_Id. Otherwise, you’ll lose performance, since a table scan will be required to find the row. This means that the table that references the ID must also include the Hashed_Id, like this:
CREATE TABLE Table2 (
     Id String(MAX),  -- UUID
     Table1_Hashed_Id INT64 NOT NULL, 
     Table1_Id INT64 NOT NULL,
     -- other columns with data values follow....
) PRIMARY KEY (Id)

SELECT * from Table2 t2 INNER JOIN Table1 t1 
     ON t1.Hashed_Id = t2.Table1_Hashed_Id
     AND t1.Id = t2.Table1_Id
WHERE ... -- some criteria

What if I really need to use a timestamp as a key?

In many cases, the row using the timestamp as a key also refers to some other table data. For example, the transactions on a bank account will refer to the source account. In this case, assuming that the source account number is already reasonably evenly distributed, you can use a complex key containing the account number first and then the timestamp:
CREATE TABLE Transactions (
     account_number INT64 NOT NULL,
     timestamp TIMESTAMP NOT NULL,
     transaction_info ...,
) PRIMARY KEY (account_number, timestamp DESC)
The splits will be made primarily using the account number and not the timestamp, thus distributing the newly added rows over various splits.

Note that in this table, the timestamp is sorted by descending order. That’s because in most cases you want to read the most recent transactions—which will be first in the table—so you won’t need to scan through the entire table to find the most recent rows.

If you do not, or cannot have an external reference, or have any other data that you can use in the key in order to distribute the order, then you will need to perform application-level sharding, which is shown in the integer sequence example above.

Note, however, that using a simple hash will make queries by timestamp range extremely slow, since retrieving a range of timestamps will require a full table scan to cover all the hashes. Instead, we recommend generating a ShardId from the timestamp. So, for example,
TimestampShardId = CRC32(Timestamp) % 100
will return a pseudo-random value between 0 and 99 from the timestamp. Then, you can use this ShardId in the table key so that sequential timestamps are distributed across multiple splits, like so:
CREATE TABLE Events (
     TimestampShardId INT64 NOT NULL
     Timestamp TIMESTAMP NOT NULL,
     event_info...
) PRIMARY KEY (TimestampShardId, Timestamp DESC)
For example, a table with dates of the first 10 days of 2018 (which without ShardId would be stored in the table in date order) will give the following ordering:

When a query is made, you must use a BETWEEN clause to be able to select across all shards without performing a table scan:
Select * from Events
WHERE
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Note that the ShardId is only a way of improving key distribution so that Cloud Spanner can use multiple splits to store sequential timestamps. It does not identify an actual database split, and rows in different tables with the same ShardId may well be in different splits.

Migration implications

When you’re migrating from an existing RDBMS that uses keys that are not optimal for Cloud Spanner, take the above considerations into account. If necessary, add key hashes to tables or change the key ordering.

Deciding on indexes in Cloud Spanner

In a traditional RDBMS, indexes are very efficient ways of looking up rows in a table by a value that is not the primary key. Under most circumstances, a row lookup via an index will take approximately the same time as a row lookup via its key. That’s because the table and the index are managed by a single node, so the index can point directly to the on-disk row of the table.

In Cloud Spanner, indexes are actually implemented using tables, which allows them to be distributed and enables the same degree of scalability and performance as normal tables.

However, because of this type of implementation, using indexes to read the data from the table row is less efficient than in a traditional RDBMS. It’s effectively an inner join with the original table, so reading from a table using an indexed key turns into this process:
  • Look up split for index key
  • Read index row from split to get table key
  • Look up split for table key
  • Read table row from split to get row values
  • Return row values
Note that there is no guarantee that the split for the index key is on the same node as the split for the table key, so a simple index query may require cross-node communication just to read one row.

Similarly, updating an indexed table will most likely require a multi-node write to update the table row and the index row. So using an index in Cloud Spanner is always a trade-off between improved read performance and reduced write performance.

Index keys and hotspots

Because indexes are implemented as tables in Cloud Spanner, you’ll encounter the same issues with the indexed columns as you did with the table keys: An index on a column with poorly distributed values (such as a timestamp) will lead to the creation of a hotspot, even if the underlying table is using well-distributed keys. That’s because when rows are appended to the table, the index will also have new rows appended, and writes for these new rows will always be sent to the same split.

Therefore, care must be taken when creating indexes, and we recommend that you create indexes only using columns which have a well-distributed set of values, just like when choosing a table key.

In some cases, you’ll need to do application-level sharding for the indexed columns in order to create a synthetic ShardId column, which can be used in the index to distribute values over the splits.

For example, this configuration below will create a hotspot when appending events due to the index, even if UserId is randomly distributed.
CREATE TABLE Events (
      UserId String(MAX),
      Timestamp TIMESTAMP,
      EventData)
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (Timestamp DESC);
As with a table keyed by timestamp only, a synthetic ShardId column will need to be added to the table, and then used as the first indexed column to help the distribution of the index among the splits.

A simple ShardId generator could be:
TimestampShardId = CRC32(Timestamp) % 100
which will give a hash value between 0 and 99 from the timestamp. You’ll need to add this to the original table as a new column, then use it as the first index key, like this:
CREATE TABLE Events (
     UserId String(MAX),
     Timestamp TIMESTAMP,
     TimestampShardId INT64, 
     EventData)
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (TimestampShardId,Timestamp);
This will remove the hotspot on index update, but will slow down range queries on timestamp, since you’ll have to run the query for each ShardId value (0-99) to get the timestamp range from all shards:
Select * from Events@{FORCE_INDEX=EventsByTimestamp}
WHERE
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Using this type of index and sharding strategy must strike a balance between the additional complexity when reading and the increase in performance of an indexed query.

Other indexes you should know

When you’re migrating to Cloud Spanner, you’ll also want to understand how these other index types function and when you might need to use them:

NULL_FILTERED indexes

By default, Cloud Spanner will index rows using NULL indexed column values. A NULL is considered to be the smallest possible value, so these values will appear at the start of the index.

It’s also possible to disable this behavior by using the CREATE NULL_FILTERED INDEX syntax, which will create an index ignoring rows with NULL indexed column values.

This index will be smaller than the complete index, as it will effectively be a materialized filtered view on the table, and will be faster to query than the full table when a table scan is necessary.

UNIQUE indexes

You can use a UNIQUE index to enforce that a column of a table has unique values. This constraint will be applied at transaction commit time (and at index creation).

Covering Indexes and STORING clause

To optimize performance when reading from indexes, Cloud Spanner can store the column values of the table row in the index itself, removing the need to read the table. This is known as a Covering Index. This is achieved by using the STORING clause when defining the index. The values of the column can then be read directly from the index, so reading from the index performs as well as reading from the table. For example, this table contains employee data:
CREATE TABLE Employees (
      CompanyUUID INT64,
      EmployeeUUID INT64,
      FullName STRING(MAX)
            ...
) PRIMARY KEY (CompanyUUID,EmployeeUUID)
If you often need to look up an employee’s full name, for example, you can create an index on employeeUUID, storing the full name for rapid lookups:
CREATE INDEX EmployeesById 
      ON Employees (EmployeeUUID) 
      STORING (FullName);

Forcing index use

Cloud Spanner’s query engine will only automatically use indexes in rare circumstances (when it is a query fully covered by the index), so it is important to use a FORCE_INDEX directive in the SQL SELECT statement to ensure that Cloud Spanner looks up values from the index. (You can find more details in the documentation.)
Select * 
from  Employees@{FORCE_INDEX=EmployeesById}
Where EmployeeUUID=xxx;
Note that when using the Cloud Spanner Read APIs, you can only perform fully covered queries—i.e., queries where the index stores all of the columns requested. To read the columns from the original table using an index, you must use an SQL query. See Use a Secondary Index section of the Getting Started docs for examples.

Continuing your Cloud Spanner education

There are some big conceptual differences when you’re using a cloud-built, horizontally scalable database like Cloud Spanner in contrast with the RDBMS you’ve been using for years. Once you’re familiar with the way keys and indexes work, you can start to take advantage of the benefits of Cloud Spanner for faster scalability.

In the next episode of this series, we will look at how to deal with business logic that would previously be implemented by triggers and stored procedures, neither of which exist in Cloud Spanner.

Want to learn more about Cloud Spanner in person? We’ll be discussing data migration and more in this session at Next 2018 in July. For more information, and to register, visit the Next ‘18 website.


Related content:

How we used Cloud Spanner to build our email personalization system—from “Soup” to nuts

A closer look at the HANA ecosystem on Google Cloud Platform



Since we announced our partnership with SAP in early 2017, we’ve rapidly expanded our support for SAP HANA, SAP’s in-memory, column-oriented, relational database management system. From the beginning, we knew we’d need to build tools that integrate SAP HANA with Google Cloud Platform (GCP) that make it faster and easier for developers and administrators to take advantage of the platform.

In this blog post, we’ll walk you through the evolution of SAP HANA on GCP and take a deeper look at the ecosystem we’ve built to support our customers.

The evolution of SAP HANA on GCP


For many enterprise customers running SAP HANA databases, instances with large amounts of memory are essential. That’s why we’ve been working to make virtual machines with larger memory configurations available for SAP HANA workloads.

Our initial work with SAP on the certification process for SAP HANA began in early 2017. In that 15 month period, we’ve rapidly evolved from instances with 208GB memory to 4TB memory. This has allowed us to support larger single-node SAP HANA installations of up to 4TB.

Smart Data Access — Google BigQuery

Google BigQuery, our serverless data warehouse, enables low cost, high performance analytics at petabyte scale. We’ve worked with SAP to natively integrate SAP HANA with BigQuery through smart data access which allows you to extend SAP HANA’s capabilities and query data stored within BigQuery by means of virtual tables. This support has been available since SAP HANA 2.0 SPS 03, and you can try it out by following this step-by-step codelabs tutorial.

Fully automated deployment of SAP HANA

In many cases, the manual deployment process can be time consuming, error prone and cumbersome. It’s very important to reduce or eliminate the margin of error and make the deployment conform to SAP’s best practices and standards.


To address this, we’ve launched deployment templates that fully automate the deployment of single node and scale-out SAP HANA configurations on GCP. In addition, we’ve also launched a new deployment automation template that creates a high availability SAP HANA configuration with automatic failover.

With these templates, you have access to fully configured, ready-to-go SAP HANA environments in a matter of minutes. You can also see the resources you created, and a complete catalog of all your deployments, in one location through the GCP console. We’ve also made the deployment process fully visible by providing deployment logs through Google Stackdriver.

Monitoring SAP HANA with Stackdriver

Visibility into what’s happening inside your SAP HANA database can help you identify factors impacting your database, and prepare accordingly. For example, a time series view of how resource utilization or latency within the SAP HANA database changes over time can help administrators plan in advance, and in many cases successfully troubleshoot issues.

Stackdriver provides monitoring, logging, and diagnostics to better understand the health, performance, and availability of cloud-powered applications. Stackdriver’s integration with SAP HANA helps administrators monitor their SAP HANA databases, notifying and alerting them so they can proactively fix issues.

More information on this integration is available in our documentation.

TensorFlow support in SAP HANA

SAP has offered support for TensorFlow Serving beginning with SAP HANA 2.0. This lets you directly build inference into SAP HANA through custom machine learning models hosted in TensorFlow serving applications running on Google Compute Engine.

You can easily build a continuous training pipeline by exporting data in your SAP HANA database to Google Cloud Storage and then using Cloud TPUs to train deep learning models. These models can then be hosted with TensorFlow Serving to be used for inference within SAP HANA.

SAP Cloud Platform, SAP HANA as a Service

SAP HANA as a Service is now also deployable to the Google Cloud Platform. The fully managed cloud service makes it possible for developers to leverage the power of SAP HANA, without spending time on operational and administrative tasks. The service is especially well suited to customers who have the goal of rapid innovation, reducing time to value.

SAP HANA, express edition on Google Cloud Launcher

SAP HANA, express edition is meant for developers and technologists who prefer a hands-on learning experience. A free license is included, allowing users to use a large catalog of tutorial content, online courses and samples to get started with SAP HANA. The Google Launcher provides users a fast and effective provisioning experience for SAP HANA, express edition.

Conclusion

These developments are all part of our continuing goal to make Google Cloud the best place to run SAP applications. We’ll continue listening to your feedback, and we’ll have more updates to share in the coming months. In the meantime, you can learn more about SAP HANA on GCP by visiting our website. And if you’d like to learn about all our announcements at SAPPHIRE NOW, read our Google Cloud blog post.

Last month today: GCP in May



When it comes to Google Cloud Platform (GCP), every month is chock full of news and information. We’re kicking off a monthly recap of key moments you may have missed.

What caught your attention this month:

Announcements about open source projects were some of the most-read this month.
  • Open-sourcing gVisor, a sandboxed container runtime was by far your favorite post in May. gVisor is a sandbox that lets you run containers in strongly isolated environments. It’s isolated like a virtual machine, but more lightweight and also more flexible, since it interfaces with the host OS just like another process.
  • Our introduction of Asylo, an open-source framework for confidential computing, also got your attention. As more and more sensitive workloads move to cloud, lots of businesses want to be able to verify that they’re properly isolated, inside a closed environment that’s only available to authorized users. Asylo democratizes trusted execution environments (TEEs) by allowing them to run on generic hardware. With Asylo, developers will be able to run their workloads encrypted in a highly secure environment, whether it’s on-premises or in the cloud.
  • Rounding out the open-source fun for the month was our introduction of the beta availability of Cloud Memorystore, a fully managed in-memory data store service for Redis. Cloud Memorystore gives you the caching power of Redis to reduce latency, without having to manage the details.


Hot topics: Kubernetes, DevOps and SRE

Google Kubernetes Engine 1.10 debuted in May, and we had a lot to say about the new features that this version enables—from security to brand-new monitoring functionality via Stackdriver Kubernetes Monitoring to networking. Start with this post to see what’s new and how customers like Spotify are using Kubernetes Engine on Google Cloud.

And one of our recent posts also struck a chord, as two of our site reliability engineering (SRE) experts delved into the differences—and similarities—between SRE and DevOps. They have similar goals, mostly around creating flexible, agile dev environments, but SRE generally gets much more specific and prescriptive than DevOps in accomplishing them.

Under the radar: GCP adds infrastructure options

As you look for new ways to use GCP to run your business, our engineers are adding features and new releases to give you more power, resources and coverage.

First, we introduced ultramem Google Compute Engine machine types, which offer more memory and compute resources than any other Compute Engine VM instance. These machines types are especially useful for those of you running enterprise workloads that need a lot of memory, like data analytics or high-performance applications.

We’ve also been busy on the back-end in other ways too, as we continue adding new regional cloud computing infrastructure. Our third zone of the Singapore region opened in May, and we’ll open a Zurich region next year.

Stay tuned in June for more on the technologies behind Google Cloud—we’ve got lots up our sleeve.

Troubleshooting tips: How to talk so your cloud provider will listen (and understand)



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and as a teaser, this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part one of two. Then, check out the second installment specifically on troubleshooting cloud provider communications.

Effective technology troubleshooting requires a systematic approach, as opposed to luck or experience. Troubleshooting can be a learned skill, as discussed in our site reliability engineering (SRE) troubleshooting primer.

But how does that change when you and your operations team are running systems and services on cloud infrastructure? Regardless of where your websites or apps live, you’re the one getting paged when the site goes down and you are the one under pressure to solve the problems and answer questions.

Cloud presents a new way of working for IT teams shifting away from legacy systems. You had full visibility and control of all aspects of your system when it was on-premises, but now you depend on off-site, cloud-based infrastructure, into which you may have limited visibility or understanding. That’s true no matter how top-notch your systems and processes are and how great a troubleshooter you are. You simply can’t see much into many cloud provider systems. The troubleshooting challenge is no longer limited to pure debugging; you also need to effectively communicate with your cloud provider. You’ll need to engage each provider’s support process and engineers to find where problems originated and fix them as soon as possible. This is an area of opportunity for IT teams as they gain new skills and adapt to the cloud-based technology model.

It’s definitely possible to do cloud troubleshooting right. When you’re choosing a provider, make sure to understand how they’ll work with you during any issues. Once you’re working with a provider, you have more control than you think over how you communicate issues and speed along their resolution. We propose this model, inspired by the SRE model of actionable improvements, for working with your cloud provider to troubleshoot more effectively and efficiently. (For more on cloud provider communications throughout the troubleshooting process, see the companion post.)

Understand your cloud provider's support workflow

Your goal here is to figure out the best way to provide the right information to your provider when an issue inevitably arises. This is a useful place to start, especially since you may depend on multiple cloud providers to run your infrastructure. Your interaction with cloud support will typically begin with questions related to migration. Your relationship then progresses into the domain of production integration, and finally, joint troubleshooting of production problems.

Keep in mind that different cloud providers have different philosophies when it comes to customer interaction. Some provide a large degree of freedom and little direct support. They expect you to find answers to most of your questions from online forums such as Stack Overflow. Other providers emphasize tight customer integration and joint troubleshooting. You have some homework to do before you begin serving real customer traffic from your cloud deployment. Talk to your cloud provider's salespeople, project managers and support engineers, to get a sense of how they approach support. Ask them the following questions:
  • What does the lifecycle of a typical support issue report look like?
  • What is your internal escalation process if an issue becomes complex or critical?
  • Do you have an internal SLO for <service name>? If so, what is that SLO?
  • What types of premium support are available?
This step is critical in reducing your frustration when you have to troubleshoot an issue that involves a cloud provider’s giant black box.

Communicate with cloud provider support efficiently

Once you have a sense of the provider’s support workflow, you can figure out the best way to get your information across. There are some best practices for filing the perfect issue report with cloud provider support teams, including what to say in your issue report and why. These guidelines follow the 80/20 rule: we try to give 20% of the details that will be useful in 80% of your issue reports. The same principles apply if you're filing bug reports in issue trackers or posting to user groups and forums.

Your guiding principle when communicating to cloud providers should be clarity: specify the appropriate level of technical detail and communicate expectations explicitly.

Provide basic information
It may seem like common sense, but these basics are essential to include in an issue report. Failing to provide any of these details leads to delays and a poor experience.

Include four critical details
Effective troubleshooting starts with time, product, location and specific identifiers.

1. Time
Here are a few examples of including time in an issue report:
  • Starting at 2017-09-08 15:13 PDT and ending 5 minutes later, we observed...
  • Observed intermittently, starting no earlier than 2017-09-10 and observed 2-5 times...
  • Ongoing since 2017-09-08 15:13 PDT...
  • From 2017-09-08 15:13 PDT to 2017-09-08 22:22 PDT...
Including the onset time and duration allow support teams to focus their time-series monitoring on the relevant period. Be explicit about whether the issue is ongoing or whether it was observed only in the past. If the issue is not ongoing, be explicit about that fact and provide an end time, if possible.

Remember to always include the time zone. ISO 8601 format is a good choice because it is unambiguous and easy to sort. If you instead specify time in a relative way, whomever you're working with must convert local time into an absolute format they can input into time-series monitoring tools. This conversion is error-prone and costly: for example, sending an email about something that happened "earlier yesterday" means that your counterpart has to search for an email header and start doing mental math. This introduces cognitive load, which decreases the person’s mental energy available for solving the technical problem.

If an issue was intermittent over some period of time, state when it was first observed, or perhaps note a time in the past when it was surely not happening. Include the frequency of observations and note one or two specific examples.

Meanwhile, here are some antipatterns to avoid:
  • Earlier today: Not specific enough
  • Yesterday: Requires the recipient to figure out the implied date; can be confusing especially when work crosses the International Date Line
  • 9/8: Ambiguous, as the date might be interpreted as September in United States, or August in other locales. Use ISO 8601 format for clarity.
2. Product
Be as specific as possible in your issue report about the product you're using, including version information where applicable. These, for example, aren’t specific enough to locate the components or logs that can help with diagnosis:
  • REST API returned errors...
  • The data mining query interface is hanging...
Ideally, you should refer to specific APIs or URLs, or include screenshots. If the issue originates in a specific component or tool (for example, the CLI or Terraform), clearly identify that tool. If multiple products are involved, be specific about each one.Describe the behavior you're observing, and the behavior you expected to occur.

Antipatterns:
  • Can't create virtual machine: It's not clear how you're attempting to create those machine, nor does it say what the failure mode is.
  • The CLI command is giving an error:
    • Instead, provide the specific error, and the command syntax so others can run the command themselves.
    • Better: I ran 'mktool create my-instance --zone us-central1' and got the following error message...
3. Location
It's important to specify the region and zone because cloud providers often roll out changes to one region or zone at a time. Therefore, region or zone is a proxy for a cloud-internal software version number.
These are examples of how you might include location information in an issue report:
  • In us-east1-a... 
  • I tried regions eu-west-1 and eu-west-3...
Given this information, the support team can see if there's a rollout underway in a given location, or map your issue to an internal release ID for use in an internal bug report.

4. Specific identifiers
Project identifiers are included in many troubleshooting tools. Specify whether you observed the error in multiple projects, or in one project but not another.

These are examples of specific identifiers:
  • In project 123412341234 or my-project-id... 
  • Across multiple projects (including 123412341234)... 
  • Connecting to cloud external IP 218.239.8.9 from our corporate gateway 56.56.56.56... 
IP addresses are another form of unambiguous identifiers, though they also require additional details when used in an issue report. When specifying an IP, try to describe the context of how it's used. For example, specify whether the IP is connected to a VM instance, a load balancer or a custom route, or if it's an API endpoint. If the IP address isn't part of the cloud platform (for example, your home internet, a VPN endpoint or an external monitoring system), specify that information. Remember, 192.168.0.1 occurs many times in the world.

Antipatterns:
  • One of our instances is unreachable…: Overly vague 
  • We can't connect from the Internet...: Overly vague 
Note that other models, such as the Five Ws (popular in fields like journalism) can provide structure to your report.

Specify impact and response expectations
Basic information to include in your issue report to a cloud provider should also include how it’s affecting your business and when it needs to be resolved.

Priority expectations
Cloud provider support commonly uses the priority you specify to initially route the issue report and to determine the urgency of the issue. The priority rating drives the speed of response, potentially paging oncall personnel.

In addition to selecting the appropriate priority, it's useful to add a sentence describing the impact. Help avoid incorrect assumptions by being explicit about why you selected P1.

Think of the priority in terms of the impact to your business. Your cloud provider may have what appear to be strict definitions of priority (e.g., P1 signified a total outage). Don't let these definitions slow progress on issues that are business critical; extrapolate impact if your issue isn't addressed, or describe the worst case scenario of an exposure related to the issue. For example, the following two descriptions essentially describe impeding P1 issues:
  • Our backlog is increasing. Current impact is minor, but if this issue is not fixed in 12 hrs, we're effectively down. 
  • A key monitoring component has failed. While there is no current impact, this means we have a blind spot that will cause the next failure to become a user visible outage. 
Response time expectations
If you have specific needs related to response time, indicate them clearly. For example, you might specify "I need a response by 5pm because that's when my shift ends." If you have internal outage communication SLOs, make sure you request a response time from your provider that is within that interval so you can meet those SLOs. Cloud providers likely have 24/7 support, but if this isn't the case, or if your relevant personnel are in a particular time zone, communicate your time zone-specific needs to your provider.

Including these details in your issue report will ideally save you time later and speed up your overall provider resolution process. Check out part two for tips on communicating with your cloud provider throughout the actual troubleshooting process.

Related content:
SRE vs. DevOps
Incident management at Google —adventures in SRE-land
Applying the Escalation Policy — CRE life lessons
Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Troubleshooting tips: Help your cloud provider help you



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part two of two. Check out part one on writing better issue reports for cloud provider support.

Troubleshooting computer systems is an act as old as computers themselves. Some might even call it an art. The cloud computing paradigm entails a fundamental change to how IT teams conduct troubleshooting.

Successful IT troubleshooting doesn’t depend only on luck or experience, but is a deliberate process that can be taught. When you’re using cloud-based infrastructure, you’re often troubleshooting via a cloud provider’s help desk, adding another layer to helping users. Because of this shift away from the traditional IT team model, your communications with the provider are essential. (See part one for more on putting together an effective issue report to improve troubleshooting from the start.)

Once you’ve communicated the issue to your provider, you’ll be working with the provider’s support team to get the issue fixed.

The essentials of cloud troubleshooting

Those diagnosing a technical problem with cloud infrastructure are seeking possible explanations (hypotheses) and evidence that explains the problem. In the short term, they look for changes in the system that roughly correlate with the problem, and consider rolling back, as a first step to mitigate the problem and stop the bleeding. The longer-term goal is to identify and fix the root cause so the problem will not recur.

From the site reliability engineering (SRE) perspective,  the general approach for troubleshooting is as follows:

  • Triage: Mitigate the impact if possible
  • Examine: Gather observations and share them
  • Diagnose: Create a hypothesis that explains the observations
  • Test and treat:
    • Identify tests that may prove or disprove the hypothesis
    • Execute the tests and agree on the meaning of the result
    • Move on to the next hypothesis; repeat until solved


When you’re working with a cloud provider on troubleshooting an issue, there are parts of the process you’re unable to control. But you can follow the steps on your end. Here’s what you can do when submitting a report to your cloud provider support team.

1. Communicate any troubleshooting you've already done
By the time you open an issue report, you've probably done some troubleshooting already. You may have checked the provider’s status page, for example. Share the steps you've taken and any key findings. Keep a timeline and log book of what you have done and share it with the provider. This means that you should start keeping a log book as soon as possible, from the start of detection of your problem. Keep in mind that while cloud providers may have telemetry that provides real-time omniscient awareness of the state of their infrastructure, the dependencies that result from your particular implementation may be less obvious. By design, your particular use of cloud resources is proprietary and private, so your troubleshooting vantage point is vital.

If you think you have a diagnosis, explain how you came to that conclusion. If you think others can reproduce the issue, include the steps to do so. A reproducible test in an issue report usually leads to the fastest resolution.

You may have an idea or guess about what's causing the problem. Be careful to avoid confirmation bias—looking for evidence to support your guess without considering evidence to the contrary.

2. Be specific and explicit about the issue
If you've ever played the telephone game, in which players whisper a message from person to person, you've seen how human translation and interpretation can lead to communication gaps. Rather than describing information in your provider communications, try to share it. Doing so reduces the chance that your reader will misinterpret what you're saying and can help speed up troubleshooting. Don’t assume that your provider has access to all of this information; customer privacy means that they may not, by design.

For example:

  • Use a screenshot to show exactly what you see
  • For web-based interfaces, provide a .HAR (Http ARchive) file
  • Attach information like tcpdump output, logs snippets and example stack traces

3. Report production outages quickly
An issue is considered to be a production outage if your application has stopped serving traffic to users or is experiencing similar business-critical impact. Report production outages to your cloud provider support as soon as possible. Issues that block a small number of developers in a developer test environment are normally not considered production outages, so they should be reported at lower priorities.

Normally, when cloud provider support is alerted about a production outage, they quickly triage the situation with the following steps:

  1. Immediately check for known issues affecting the infrastructure.
  2. Confirm the nature of the issue.
  3. Establish communication channels.


Typically, you can expect a quick response with a brief message, which might contain:

  • Whether or not there is a known issue affecting multiple customers
  • An acknowledgement that they can observe the issue you've reported or a request for more details
  • How they intend to communicate (for example, phone, Skype, or issue report)


It’s important to quickly create an issue report including the four critical details (described in part one  and then begin deeper troubleshooting on your side of the equation. If your organization has a defined incident management process (see Managing Incidents), escalating to your cloud provider should be among your initial steps.

4. Report networking issues with specificity
Most cloud providers’ networks are huge and complex, composed of many technologies and teams. It's important to quickly identify a networking-specific problem as such and engage with the team that can repair it.

Many networking issues have similar symptoms, like "can't connect to server," at a high level. This level of detail is typically too generic to be useful in identifying the root cause, so you need to provide more diagnostic information. Network issues relate to connectivity, which always involves at least two specific points: source and destination. Always include information about these points when reporting network issues.

To structure your issue report,  use the conceptual tool of a packet flow diagram:

  • Describe the important hops that a packet takes along a path from source to destination, along with any significant transformations (e.g., NAT) along the way.
  • Start by identifying the affected network endpoints by Internet IP address or by RFC 1918 private address, plus an ASN for the network.
  • Note anything meaningful about the endpoints, such as who controls them and whether they are associated with a DNS hostname. 
  • Note any intermediate encapsulation and/or indirection. For example: VPN tunneling, proxies or NAT gateways.
  • Note any intermediate filtering, like firewalls, CDN or WAF.


Many problems that manifest as high latency or intermittent packet loss will require a path analysis and/or a packet capture for diagnosis. Path analysis is a list of all hops that packets traverse (for example, MTR or tcptraceroute). A packet capture (a.k.a. pcap, derived from the name of the library libpcap) is an observation of real network traffic. It's important to take a packet capture for both endpoints, at the same time, which can be tricky. Practice with the necessary tools (for example tcpdump or Wireshark) and make sure they are installed before you need them.

5. Escalate when appropriate
If circumstances change, you may need to escalate the urgency of an issue so it receives attention quickly. Take this step if business impact increases, if an issue is stuck without progress after a lot of back-and-forth with support, or if some other factor calls for quicker resolution.

The most explicit way to escalate an issue is to change the priority of the issue report (for example, from P3 to P2). Provide comments about why you need to escalate so support can respond appropriately.

6. Create a summary document for long-running or difficult issues
Issue state and relevant information change over time as new facts come to light and hypotheses are ruled out. In the meantime, new people join the investigation. Help communicate relevant, up-to-date information by collecting information in a summary document.

A good summary document has the following dimensions:

  • The latest state summarized at the top
  • Links to all relevant issue reports and internal tracking bugs
  • A list of hypotheses which are potentially true, and hypotheses that have been ruled out already. When you start investigating a particular hypothesis, note that you are doing so, and mention the tests or tools that you intend to use. Often, you can get good advice or prevent duplicate work.


SAMPLE summary document format:

$TIMESTAMP
<Current customer impact> <Working theory and actions being taken> <Next steps>

13:00:00
Customer impact has been mitigated and resolved. Our networking provider was throttling our traffic because we forgot to pay our bill last month. Next step is to be nicer to our finance team.

12:00:00
More than 100 customers are actively complaining about not being able to reach our service. Our networking provider is throttling customer traffic to one of our load balancers. The response team is actively working with our networking provider’s tier 1 support to understand why and how this happened.

11:00:00
We have now received 100 complaints from 50 customers from four different geos that they cannot consistently reach our API at api.acme.com. Our engineers currently believe that an upstream networking issue is causing this. Next steps are to reach out to our networking provider to see if there are any upstream issues.

10:00:00
We have received five complaints from five customers that they are unable to reach api.acme.com. Our engineers are looking into the issue.


Try to keep each issue report focused on a single issue. Don't reopen an issue report to bring up a new issue, even if it's related to the original issue. Do reference similar issues in your new report to help your provider recognize patterns from systemic root causes.

Keep your communication skills sharp

Communicating highly detailed technical information in a clear and actionable manner can be difficult. Doing so requires focus and specific skills. This task is particularly challenging in stressful situations, because our biological response to stress works against the need for clear cognitive reasoning. The following techniques help make communication easier for everyone.

Help reduce cognitive load by writing a detailed issue report
Many issue reports require the reader to make inferences or calculations. This introduces cognitive load, which decreases the mental energy available for solving the technical problem.

When writing an issue report, be as specific and detailed as possible. While this attention to detail requires more time on the part of the writer, consider that an issue report is written once but read many times by many people. People can solve the problem faster together when equipped with comprehensive information. Avoid acronyms and internal company code names. Also, be mindful of protecting customer privacy when disclosing any information to any third party.

Use narrative techniques
Once upon a time, in a land far, far away...

Humans are very good at absorbing information in the form of stories, so you can get your point across quite effectively this way. Start with the context: What was happening when you first observed the problem? What potential fixes did you try? Who are the characters involved, and why does the issue matter to them?
Include visuals Illustrate your issue report with any supporting images you have available, like text formatting or charts and screenshots.

Text formatting
Formatted text like log lines, code excerpts or MySQL results often become illegible when sent through plain-text emails. Add explicit markers (for example, <<<<<< at the end of the line) to help direct attention to important sections. You can use footnotes to point to long-form URLs, or use a URL shortener.

Use bullet points to format lists, and to call out important details like instance names. Use numbered lists to enumerate series of steps.

Charts
Charts are very useful for understanding time-series data. When you’re sending charts with an issue report, keep these best practices in mind:

  • Take a screenshot, including title and axis labels. For absolute values, specify units (requests per minute, errors per second, etc).
  • Annotate the screenshot with arrows or circles to call out important points.
  • Briefly describe what the chart is measuring.
  • Briefly describe how the chart normally looks.
  • In a few sentences, describe your interpretation of the chart and why it is relevant to the problem.


Avoid the following antipatterns:

  • The Y-axis represents a specific error (e.g., exceptions in my-handler) and has no clear relationship to the problem under investigation (e.g., high persistence-layer latency). To remedy this situation, explain why the graph is relevant to the problem.
  • The Y-axis is an absolute number (e.g., 1M per minute) that provides no context about the relative impact.
  • The X-axis doesn't have a time zone.
  • The Y-axis is not zero-based. This can make minor changes in the Y value seem very large.
  • Axis labels are missing or cut off.

Well-crafted issue reports, along with strong communication with your cloud provider, can speed the resolution process and time it takes. The cloud computing model has drastically changed the way that IT teams troubleshoot computer systems. Technical savvy is no longer the sole necessary skill set for effective troubleshooting--you must also be able to communicate clearly and efficiently with cloud providers. While the reality of your deployment may be unique, nuanced, and complex, these building blocks can help you navigate this territory.

Related content:
SLOs, SLIs, SLAs, oh my - CRE life lessons
SRE vs. DevOps: competing standards or close friends?
Introducing Google Customer Reliability Engineering


Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Cloud Source Repositories: more than just a private Git repository



If your goal is to release software continuously at high velocity, you need to be able to automatically build, test, deploy, and debug your code changes, all within minutes. But first you need to integrate your version control systems and your build, deploy, and debugging tools—a time-consuming and complicated process that requires numerous manual configuration steps like downloading plugins and setting up webhooks. And when you’re done, the workflow still isn’t very integrated, forcing developers to jump from one tool to another as they go from code to deployment. So much for high velocity.

Cloud Source Repositories, fully-managed private Git repositories hosted on Google Cloud Platform (GCP), is tightly integrated with other GCP tools, making it easy to automatically build, test, deploy, and debug code right out of the gate. With just a few clicks and without any additional setup or configuration, you can extend Cloud Source Repositories with other GCP tools to perform other tasks as a part of your development workflow. In this post, let’s take a closer look at some of the GCP tools that are integrated with Cloud Source Repositories, and how they simplify developer workflows:

Simplified continuous integration (CI) with Container Builder

Looking to implement continuous integration and validate each check-in to a shared repository with an automated build and test? The integration of Cloud Source Repositories with Container Builder comes in handy here, making it easy to set up a CI on a branch or tag. There are no CI servers to set up or repositories to configure. In fact, you can enable a CI process on any existing or new repo present in Cloud Source Repositories. Simply specify the trigger on which Container Builder should build the image. In the following example, for instance, the trigger specifies that a build will be triggered when changes are pushed to any branch of Cloud Source Repositories.


To demonstrate this trigger in action, the example below changes the background color of the “Hello World” website from yellow to blue.

The first step involves setting blue as the background color using background-color CSS property. Then, you need to add the changed file to the index using a git add command and record the changes to the repository with git commit. The commits are then pushed to the remote server using git push.

Because of the trigger defined above, an automated build is triggered as soon as changes are pushed to Cloud Source Repositories. Container Builder starts automatically building the image based on the changes. Once the image is created, the new version of the app is deployed using kubectl set image. The new changes are reflected and the “Hello World” website now shows a blue background color.

Follow this quickstart to begin continuous integration with Container Builder & Cloud Source Repositories.

Pre-Installed tools and programming languages in Cloud Shell and Cloud Shell Editor

Cloud Source Repositories is integrated out-of-the-box with Cloud Shell and the Cloud Shell Editor. Cloud Shell provides browser-based command-line access, giving you an easy way to build and deploy applications. It is already configured with common tools such as MySql client, kubernetes, and Docker, as well as Java, Go, Python, Node.js, PHP and Ruby, so you don't have to spend time looking for the latest dependencies or installing software. Cloud Editor, meanwhile, acts as a cross-platform IDE to edit code with no setup.

Quick deployment to App Engine

The integration of Cloud Source Repositories and App Engine makes publishing applications a breeze. It provides you a way to deploy apps quickly and lets developers focus just on writing code, without the worry of managing the underlying infrastructure or even scaling the app as its needs grow. You can deploy source code stored in Cloud Source Repositories to App Engine with the gcloud app deploy command, which automatically builds an image and deploys it to App Engine flexible environment. Let’s see this in action.

In the following example, we’ll change the text on the website from “Hello Universe” to “Hello World” before deploying it. Like with the previous example, git add and git commit help stage and commit the files staged to Cloud Source Repositories. Next, the git push command pushes the changes to the master branch.

Once the changes have been pushed to Cloud Source Repositories, you can deploy the new version of the application by running the gcloud app deploy command from the directory where the app.yaml file is located.

The text is now changed to “Hello, World!” from “Hello Universe”.

Try deploying code stored in Cloud Source Repositories to App Engine by following the quickstart here.

Debug in production with Stackdriver Debugger

If your app is running in production and has problems, you need to troubleshoot issues quickly to avoid bad customer experiences. For debugging apps in production, creating breakpoints isn't really an option as you can’t suspend the program. To help locate the root cause of production issues quickly, Cloud Source Repositories is integrated with Stackdriver Debugger, which lets you debug applications in production without stopping or slowing the application.

Stackdriver Debugger allows you to either use a debug snapshot or debug logpoint to debug production applications. Debug Snapshot captures the call stack and variables at a specific code location the first time any instance of that code is executed. Debug Logpoint, on the other hand, writes the log messages to the log stream. You can set a debug snapshot or a debug logpoint for code stored in Cloud Source Repositories with a single click.

Debug Snapshot for debugging

In the following example, a snapshot has been set up for the second line of code in the get function of the MainPage class.

The right-hand panel display details such as the call stack and the values of local variables in scope once the snapshot set above is reached.

Learn more about production debugging by following the quickstart here.

Debug Logpoint for Debugging

The integration of Stackdriver with Cloud Source Repositories also allows for injecting logging statements without restarting the app. It lets you store, search, analyze, monitor, and alert on log data and events. As an example, a logging statement introduced in the above code is highlighted below.

The logs panel highlights the logs printed by logpoint.

Version control with Cloud Functions

If you’re building a serverless app, you’ll be happy to know that Cloud Source Repositories is also integrated with Cloud Functions. You can store your function source code in Cloud Source Repositories and reference it from event-driven serverless apps. The code stored in Cloud Source Repositories can also be deployed in response to specific triggers, ranging from HTTP, Cloud Pub/Sub, and others. Changes made to function source code are automatically tracked over time, and you can roll back to the previous state of any repository.

In the following example, the “helloworld” function is deployed by an HTTP request. The location of the source code for function can be found in the root directory of the Cloud Source repository.

Learn more about deploying your function source code stored in Cloud Source Repositories using the quickstart here.

In short, the integration of Cloud Source Repositories with other Google Cloud tools lets your team to go from code to deployment in minutes, all while managing versioning and aliasing. You even get the ability to perform production debugging on the fly by using built-in monitoring logging tools. Try Cloud Source Repositories along with these integrations here.