Tag Archives: Storage & Databases

Last month today: July on GCP

The month of July saw our Google Cloud Next ‘18 conference come and go, and there was plenty of exciting news, updates and demos to share from the show. Here’s a look at some of the most-read blog posts from July.

What caught your attention this month: Creating the open cloud
  • One of the most-read posts this month covered the launch of our Cloud Services Platform, which allows you to build a true hybrid cloud infrastructure. Some of the key components of Cloud Services Platform include the managed Istio service mesh, Google Kubernetes Engine (GKE) On-Prem and GKE Policy Management, Cloud Build for fully managed CI/CD, and several serverless offerings (more on that below). Combined, these technologies can help you gain consistency, security, speed and flexibility of the cloud in your local data center, along with the freedom of workload portability to the environment of your choice.
  • Another popular read was a rundown of Google Cloud’s new serverless offerings. These include core serverless compute announcements such as new App Engine runtimes, Cloud Functions general availability and more. It also included serverless containers, so you can run serverless workloads in a fully managed container environment; GKE Serverless add-on to easily run serverless workloads on Kubernetes Engine; and Knative, the open-source project on which that add-on is built. There are even more features included in this post, too, like Cloud Build, Stackdriver monitoring and Cloud Firestore integration with GCP. 
Bringing detailed metrics and Kubernetes apps to the forefront
  • Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
  • It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.

Building the cloud back-end
  • For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
  • And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use. 
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!

On GCP, your database your way

When choosing a cloud to host your applications, you want a portfolio of database options—SQL, NoSQL, relational, non-relational, scale up/down, scale in/out, you name it—so you can use the right tool for the job. Google Cloud Platform (GCP) offers a full complement of managed database services to address a variety of workload needs, and of course, you can run your own database in Google Compute Engine or Kubernetes Engine if you prefer.

Today, we’re introducing some new database features along with partnerships, beta news and other improvements that can help you get the most out of your databases for your business.

Here’s what we’re announcing today:
  • Oracle workloads can now be brought to GCP
  • SAP HANA workloads can run on GCP persistent-memory VMs
  • Cloud Firestore launching for all users developing cloud-native apps
  • Regional replication, visualization tool available for Cloud Bigtable
  • Cloud Spanner updates, by popular demand

Managing Oracle workloads with Google partners

Until now, it's been a challenge for customers to bring some of the most common workloads to GCP. Today, we’re excited to announce that we are partnering with managed service providers (MSPs) to provide a fully managed service for Oracle workloads for GCP customers. Partner-managed services like this unlock the ability to run Oracle workloads and take advantage of the rest of the GCP platform. You can run your Oracle workloads on dedicated hardware and you can connect the applications you’re running on GCP.

By partnering with a trusted managed service provider, we can offer fully managed services for Oracle workloads with the same advantages as GCP services. You can select the offering that meets your requirements, as well as use your existing investment in Oracle software licenses.

We are excited to open the doors to customers and partners whose technical requirements do not fit neatly into the public cloud. By working with partners, you’ll have the option to move these workloads to GCP and take advantage of the benefits of not having to manage hardware and software. Learn more about managing your Oracle workloads with Google partners, available this fall.

Partnering with Intel and SAP

This week we announced our collaboration with Intel and SAP to offer Compute Engine virtual machines backed by the upcoming Intel Optane DC Persistent Memory for SAP HANA workloads. Google Compute Engine VMs with this Intel Optane DC persistent memory will offer higher overall memory capacity and lower cost compared to instances with only dynamic random-access memory (DRAM). Google Cloud instances on Intel Optane DC Persistent Memory for SAP HANA and other in-memory database workloads will soon be available through an early access program. To learn more, sign up here.

We’re also continuing to scale our instance size roadmap for SAP HANA production workloads. With 4TB machine types now in general availability, we’re working on new virtual machines that support 12TB of memory by next summer, and 18TB of memory by the end of 2019.

Accelerate app development with Cloud Firestore

For app developers, Cloud Firestore brings the ability to easily store and sync app data at global scale. Today, we're announcing that we’ll soon expand the availability of the Cloud Firestore beta to more users by bringing the UI to the GCP console. Cloud Firestore is a serverless, NoSQL document database that simplifies storing, syncing and querying data for your cloud-native apps at global scale. Its client libraries provide live synchronization and offline support, while its security features and integrations with Firebase and GCP accelerate building truly serverless apps.

We're also announcing that Cloud Firestore will support Datastore Mode in the coming weeks. Cloud Firestore, currently available in beta, is the next generation of Cloud Datastore, and offers compatibility with the Datastore API and existing client libraries. With the newly introduced Datastore mode on Cloud Firestore, you don’t need to make any changes to your existing Datastore apps to take advantage of the added benefits of Cloud Firestore. After general availability of Cloud Firestore, we will transparently live-migrate your apps to the Cloud Firestore backend, and you’ll see better performance right away, for the same pricing you have now, with the added benefit of always being strongly consistent. It’ll be a simple, no-downtime upgrade. Read more here about Cloud Firestore.

Simplicity, speed and replication with Cloud Bigtable

For your analytical and operational workloads, an excellent option is Google Cloud Bigtable, a high-throughput, low-latency, and massively scalable NoSQL database. Today, we are announcing that regional replication is generally available. You can easily replicate your Cloud Bigtable data set asynchronously across zones within a GCP region, for additional read throughput, higher durability and resilience in the face of zonal failures. Get more information about regional replication for Cloud Bigtable.

Additionally, we are announcing the beta version of Key Visualizer, a visualization tool for Cloud Bigtable key access patterns. Key Visualizer helps debug performance issues due to unbalanced access patterns across the key space, or single rows that are too large or receiving too much read or write activity. With Key Visualizer, you get a heat map visualization of access patterns over time, along with the ability to zoom into specific key or time ranges, or select a specific row to find the full row key ID that's responsible for a hotspot. Key Visualizer is automatically enabled for Cloud Bigtable clusters with sufficient data or activity, and does not affect Cloud Bigtable cluster performance. Learn more about using Key Visualizer on our website.
Key Visualizer, now in beta, shows an access pattern heat map so you can debug performance issues in Cloud Bigtable.

Finally, we launched client libraries for Node.js (beta) and C# (beta) this month. We will continue working to provide stronger language support for Cloud Bigtable, and look forward to launching Python (beta), C++ (beta), native Java (beta), Ruby (alpha) and PHP (alpha) client libraries in the coming months. Learn more about Cloud Bigtable client libraries.

Cloud Spanner updates, by popular request

Last year, we launched our Cloud Spanner database, and we’ve already seen customers do proof-of-concept trials and deploy business-critical apps to take advantage of Cloud Spanner’s benefits, which include simplified database administration and management, strong global consistency, and industry-leading SLAs.

Today we’re announcing a number of new updates to Cloud Spanner that our customers have requested. First, we recently announced the general availability of import/export functionality. With this new feature, you can move your data using Apache Avro files, which are transferred with our recently released Apache Beam-based Cloud Dataflow connector. This feature makes Cloud Spanner easier to use for a number of important use cases such as disaster recovery, analytics ingestion, testing and more.

We are also previewing data manipulation language (DML) for Cloud Spanner to make it easier to reuse existing code and tool chains. In addition, you’ll see introspection improvements with Top-N Query Statistics support to help database admins tune performance. DML (in the API as well as in the JDBC driver), and Top-N Query Stats will be released for Cloud Spanner later this year.

Your cloud data is essential to whatever type of app you’re building with GCP. You’ve now got more options than ever when picking the database to power your business.

Partnering with Intel and SAP on Intel Optane DC Persistent Memory for SAP HANA

Our customers do extraordinary things with their data. But as their data grows, they face challenges like the cost of resources needed to handle and store it, and the general sizing limitations with low latency in-memory computing workloads.

Our customers' use of in-memory workloads with SAP HANA for innovative data management use cases is driving the demand for even larger memory capacity. We’re constantly pushing the boundaries on GCP’s instance sizes and exploring increasingly cost-effective ways to run SAP workloads on GCP.

Today, we’re announcing a partnership with Intel and SAP to offer GCP virtual machines supporting the upcoming Intel® Optane™ DC Persistent Memory for SAP HANA workloads. These GCP VMs will be powered by the future Intel® Xeon® Scalable processors (code-named Cascade Lake) thereby expanding VM resource sizing and providing cost benefits for customers.

Compute Engine VMs with Intel Optane DC persistent memory will offer higher overall memory capacity with lower cost compared to instances with only dynamic random-access memory (DRAM). This will help enable you to scale up your instances while keeping your costs under control. Compute Engine has consistently been focused on decreasing your operational overhead through capabilities such as Live Migration. And coupled with the native persistence benefits of Intel Optane DC Persistent Memory, you’ll get faster restart times for your most critical business applications.

Google Cloud instances on Intel Optane DC Persistent Memory for SAP HANA and other workloads will be available in alpha later this year for customer testing. To learn more, please fill out this form to register your interest.

To learn more about this partnership, visit our Intel and SAP partnership pages.

Top storage and database sessions to check out at Next 2018

Whatever your particular area of cloud interest, there will be a lot to learn at Google Cloud Next ‘18 (July 24-26 in San Francisco). When it comes to cloud storage and databases, you’ll find useful sessions that can help you better understand your options as you’re building the cloud infrastructure that will work best for your organization.

Here, we’ve chosen five not-to-miss sessions, where you’ll learn tips on migrating data to the cloud, understand types of cloud storage workloads and get a closer look at which database is best for storing and analyzing your company’s data. Wherever you are in your cloud journey, there’s likely a session you can use.

Top cloud storage sessions

First up, our top picks for those of you delving into cloud storage.

From Blobs to Tables, Where to Store Your Data
Speakers: Dave Nettleton, Robert Saxby

What’s the best way to store all the data you’re creating and moving to the cloud? The answer depends on the industry, apps and users you’re supporting. Google Cloud Platform (GCP) offers many options for storing your data. The choices range from Cloud Storage (multi-regional, regional, nearline, coldline) through Persistent Disk to various database services (Cloud Datastore, Cloud SQL, Cloud Bigtable, Cloud Spanner) and data warehousing (BigQuery). In this session, you’ll learn about the products along with common application patterns that use data storage.

Why attend: With much to consider and many options available, this session is a great opportunity to examine which storage option fits your workloads.

Caching Made Easy, with Cloud Memorystore and Redis
Speaker: Gopal Ashok

In-memory database Redis has plenty of developer fans: It’s high-performance and highly available, making it an excellent choice for caching operations. Cloud Memorystore now includes a managed Redis service. In this session, you’ll hear about its new features. You’ll also learn how you can easily migrate applications using Redis to Cloud Memorystore with minimal changes.
Why attend: Are you building an application that needs sub-millisecond response? GCP provides fully managed service for the popular Redis in-memory datastore.

Google Cloud Storage - Best Practices for Storage Classes, Reliability, Performance and Scalability
Speakers: Geoff Noer, Michael Yu

Learn about common Google Cloud Storage workloads, such as content storage and serving, analytics/ML and data protection. Understand how to choose the best storage class, depending on what kind of data you have and what kind of workload you're supporting. You’ll also learn more about Multi-Regional, Regional, Nearline and Coldline storage.
Why attend: You’ll learn about ways to optimize Cloud Storage to the unique requirements of different storage use cases.

Top database sessions

Here are our top picks for database sessions to explore at Next ‘18.

Optimizing Applications, Schemas, and Query Design on Cloud Spanner
Speaker: Robert Kubis

Cloud Spanner was designed specifically for cloud infrastructure and scales easily to allow for efficient cloud growth. In this session, you’ll learn Cloud Spanner best practices, strategies for optimizing applications and workloads, and ways to improve performance and scalability. Through live demos, you’ll see real-time speed-ups of transactions, queries and overall performance. Additionally, this talk explores techniques for monitoring Cloud Spanner to identify performance bottlenecks. Come learn how to cut costs and maximize performance with Cloud Spanner.
Why attend: Cloud Spanner is a powerful product, but many users do not maximize its benefits. You’ll get an inside look at this session at getting the best performance and efficiency results out of this type of cloud database.

Optimizing performance on Cloud SQL for MySQL
Speakers: Stanley Feng, Theodore Tso, Brett Hesterberg

Database performance tuning can be challenging and time-consuming. In this session, you’ll get a look at the performance tuning our team has conducted in the last year to considerably improve Cloud SQL for MySQL. We’ll also highlight useful changes to the Linux kernel, EXT4 filesystem and Google's Persistent Disk storage layer to improve write performance. You'll come away knowing more about MySQL performance tuning, an underused EXT4 feature called “bigalloc” and how to let Cloud SQL handle mundane, yet necessary, tasks so you can focus on developing your next great app.
Why attend: When GCP provides fully managed services for databases, we put lots of innovations under the hood, so that your database runs in the most optimal way. Come and learn about Google’s secret sauce that lets you optimize Cloud SQL performance.

Check out the full list of Next sessions, and join your peers at the show by registering here.

Cloud Spanner adds import/export functionality to ease data movement

We launched Cloud Spanner to general availability last year, and many of you shared in our excitement: You explored it, started proof-of-concept trials, and deployed apps. Perhaps most importantly, you gave us feedback along the way. We heard you, and we got to work. Today, we’re happy to announce we’ve launched one of your most commonly requested features: importing and exporting data.

Import/export using Avro

You asked for easier ways to move data. You’ve got it. You can now import and export data easily in the Cloud Spanner Console:
  • Export any Cloud Spanner database into a Google Cloud Storage (GCS) bucket.
  • Import files from a GCS bucket into a new Cloud Spanner database.
These database exports and imports use Apache Avro files, transferred with our recently released Apache Beam-based Cloud Dataflow connector.

Adding imports and exports opens up even more possibilities for your Cloud Spanner data, including:
  • Disaster recovery: Export your database at any time and store it in a GCS location of your choice as a backup, which can be imported into a new Cloud Spanner database to restore data.
  • Testing: Export a database and then import it into Cloud Spanner as a dev/test database to use for integration tests or other experiments.
  • Moving databases: Export a database and import it back into Cloud Spanner in a new/different instance with the console’s simple, push-button functionality.
  • Ingest for analytics: Use database exports to ingest your operational data to other services such as BigQuery, for analytics. BigQuery can automatically ingest data in Avro format from a GCS bucket, which means it will become easier for you to run analytics on your operational data.
Ready to try it out? See our documentation on how to import and export data. Learn more about Cloud Spanner here, and get started with a free trial. For technical support and sales, please contact us.

We're excited to see the ways that Cloud Spanner—making application development more efficient, simplifying database administration and management, and providing the benefits of both relational and scale-out, non-relational databases—will continue to help you ship better apps, faster.

Introducing Jib — build Java Docker images better

Containers are bringing Java developers closer than ever to a "write once, run anywhere" workflow, but containerizing a Java application is no simple task: You have to write a Dockerfile, run a Docker daemon as root, wait for builds to complete, and finally push the image to a remote registry. Not all Java developers are container experts; what happened to just building a JAR?

To address this challenge, we're excited to announce Jib, an open-source Java containerizer from Google that lets Java developers build containers using the Java tools they know. Jib is a fast and simple container image builder that handles all the steps of packaging your application into a container image. It does not require you to write a Dockerfile or have docker installed, and it is directly integrated into Maven and Gradle—just add the plugin to your build and you'll have your Java application containerized in no time.

Docker build flow:

Jib build flow:

How Jib makes development better:

Jib takes advantage of layering in Docker images and integrates with your build system to optimize Java container image builds in the following ways:
  1. Simple - Jib is implemented in Java and runs as part of your Maven or Gradle build. You do not need to maintain a Dockerfile, run a Docker daemon, or even worry about creating a fat JAR with all its dependencies. Since Jib tightly integrates with your Java build, it has access to all the necessary information to package your application. Any variations in your Java build are automatically picked up during subsequent container builds.
  2. Fast - Jib takes advantage of image layering and registry caching to achieve fast, incremental builds. It reads your build config, organizes your application into distinct layers (dependencies, resources, classes) and only rebuilds and pushes the layers that have changed. When iterating quickly on a project, Jib can save valuable time on each build by only pushing your changed layers to the registry instead of your whole application.
  3. Reproducible - Jib supports building container images declaratively from your Maven and Gradle build metadata, and as such can be configured to create reproducible build images as long as your inputs remain the same.

How to use Jib to containerize your application

Jib is available as plugins for Maven and Gradle and requires minimal configuration. Simply add the plugin to your build definition and configure the target image. If you are building to a private registry, make sure to configure Jib with credentials for your registry. The easiest way to do this is to use credential helpers like docker-credential-gcr. Jib also provides additional rules for building an image to a Docker daemon if you need it.

Jib on Maven
# Builds to a container image registry.
$ mvn compile jib:build
# Builds to a Docker daemon.
$ mvn compile jib:dockerBuild
Jib on Gradle
plugins {
  id 'com.google.cloud.tools.jib' version '0.9.0'
jib.to.image = 'gcr.io/my-project/image-built-with-jib'
# Builds to a container image registry.
$ gradle jib
# Builds to a Docker daemon.
$ gradle jibDockerBuild

We want everyone to use Jib to simplify and accelerate their Java development. Jib works with most cloud providers; try it out and let us know what you think at github.com/GoogleContainerTools/jib.

Announcing MongoDB Atlas free tier on GCP

Earlier this year, in response to strong customer demand, we announced that we were expanding region support for MongoDB Atlas. The MongoDB NoSQL database is hugely popular, and the MongoDB Atlas cloud version makes it easy to manage on Google Cloud Platform (GCP). We heard great feedback from users, so we’re further lowering the barrier to get started on MongoDB Atlas and GCP.

We’re pleased to announce that as of today, MongoDB will offer a free tier of MongoDB Atlas on GCP in three supported regions, strategically located in North America, Europe and Asia Pacific in recognition of our wide user install base.

The free tier will allow developers a no-cost sandbox environment for MongoDB Atlas on GCP. You can test any potential MongoDB workloads on the free tier and decide to upgrade to a larger paid Atlas cluster once you have confidence in our cloud products and performance.

As of today, these specific regions are supported by the Atlas free tier:
  1. Iowa (us-central1)
  2. Belgium (europe-west1)
  3. Singapore (asia-southeast1)
To get started, you’ll just need to log in to your MongoDB console, select “Build a New Cluster,” pick “Google Cloud Platform,” and look for the “Free Tier Available” message. The free tier utilizes MongoDB’s M0 instances. An M0 cluster is a sandbox MongoDB environment for prototyping and early development with 512MB of storage space. It also comes with strong enterprise features such as always-on authentication, end-to-end encryption and high availability, as well as monitoring. Happy experimenting!

Related content:

Bust a move with Transfer Appliance, now generally available in U.S.

As we celebrate the upcoming Los Angeles Google Cloud Platform (GCP) region in one of the creative centers of the world, we are excited to share news about a product that can help you get your data there as fast as possible. Google Transfer Appliance is now generally available in the U.S., with a few new features that will simplify moving data to Google Cloud Storage. Customers have been using Transfer Appliance for almost a year, and we’ve heard great feedback.

The Transfer Appliance is a high-capacity server that lets you transfer large amounts of data to GCP, quickly and securely. It’s recommended if you’re moving more than 20TB of data, or data that would take more than a week to upload.

You can now request a Transfer Appliance directly from your Google Cloud Platform console. Indicate the amount of data you’re looking to transfer, and our team will help you choose the version that is the best fit for your needs.

The service comes in two configurations: 100TB or 480TB of raw storage capacity. We see typical data compression rates of 2x the raw capacity. The 100TB model is priced at $300, plus express shipping (approximately $500); the 480TB model is priced at $1,800, plus shipping (approximately $900).

You can mount Transfer Appliance as an NFS volume, making it easy to drag and drop files, or rsync, from your current NAS to the appliance. This feature simplifies the transfer of file-based content to Cloud Storage, and helps our migration partners expedite the move for customers.
"SADA Systems provides expert cloud consultation and technical services, helping customers get the most out of their Google Cloud investment. We found Transfer Appliance helps us transition the customer to the cloud faster and more efficiently by providing a secure data transfer strategy."
-Simon Margolis, Director of Cloud Platform, SADA Systems
Transfer Appliance can also help you transition your backup workflow to the cloud quickly. To do that, move the bulk of your current backup data offline using Transfer Appliance, and then incrementally back up to GCP over the network from there. Partners like Commvault can help you do this.

With this release, you’ll also find a more visible end-to-end integrity check, so you can be confident that every bit was transferred as is, and have peace of mind in deleting source data.

Transfer Appliance in action

In developing Transfer Appliance, we built a device designed for the data center, so it slides into a standard 19” rack. That has been a positive experience for our early customers, even those with floating data centers (yes, actually floating--see below for more).

We’ve seen our customers successfully use Transfer Appliance for the following use cases:
  • Migrate your data center (or parts of it) to the cloud.
  • Kick-start your ML or analytics project by transferring test data and staging it quickly.
  • Move large archives of content like creative libraries, videos, images, regulatory or backup data to Cloud Storage.
  • Collect data from research bodies or data providers and move it to Google Cloud for analysis.
We’ve heard about lots of innovative, interesting data projects powered by Transfer Appliance. Here are a few of them.

One early adopter, Schmidt Ocean Institute, is a private non-profit foundation that combines advanced science with state-of-the-art technology to achieve lasting results in ocean research. Their goals are to catalyze sharing of information and to communicate this knowledge to audiences around the world. For example, the Schmidt Ocean Institute owns and operates research vessel Falkor, the first oceanographic research vessel with a high-performance cloud computing system installed onboard. Scientists run models and software and can plan missions in near-real time while at sea. With the state-of-the-art technologies onboard, scientists contribute scientific data to the oceanographic community at large, very quickly. Schmidt Ocean Institute uses Transfer Appliance to safely get the data back to shore and publicly available to the research community as fast as possible.

“We needed a way to simplify the manual and complex process of copying, transporting and mailing hard drives of research data, as well as making it available to the scientific community as quickly as possible. We are able to mount the Transfer Appliance onboard to store the large amounts of data that result from our research expeditions and easily transfer it to Google Cloud Storage post-cruise. Once the data is in Google Cloud Storage, it’s easy to disseminate research data quickly to the community.”
-Allison Miller, Research Program Manager, Schmidt Ocean Institute

Beatport, a division of LiveStyle, serves an audience of electronic music DJs, producers and their fans. Google Transfer Appliance afforded Beatport the opportunity to rethink their storage architecture in the cloud without affecting their customer-facing network in the process.

“DJs, music producers and fans all rely on Beatport as the home for the world’s electronic music. By moving our library to Google Cloud Storage, we can access our audio data with the advanced tools that Google Cloud Platform has to offer. Managing tens of millions of lossless quality files poses unique challenges. Migrating to the highly performant Cloud Storage puts our wealth of audio data instantly at the fingertips of our technology team. Transfer Appliance made that move easier for our team.”
-Jonathan Steffen, CIO, beatport
Eleven Inc. creates content, brand experiences and customer activation strategies for clients across the globe. Through years of work for their clients, Eleven built a large library of creative digital assets and wanted a way to cost-effectively store that data in the cloud. Facing ISP network constraints and a desire to free up space on their local asset server quickly, Eleven Inc. used Transfer Appliance to facilitate their migration.

“Working with Transfer Appliance was a smooth experience. Rack, capture and ship. And now that our creative library is in Google Cloud Storage, it's much easier to think about ways to more efficiently manage the data throughout its life-cycle.”
-Joe Mitchell, Director of Information Systems
amplified ai combines extensive IP industry experience with deep learning to offer instant patent intelligence to inventors and attorneys. This requires a lot of patent data for building models. Transfer Appliance helped amplified ai move TBs of this specialized essential data to the cloud quickly.

“My hands are already full building deep learning models on massive, disparate data without also needing to worry about physically moving data around. Transfer Appliance was easy to understand, easy to install, and made it easy to capture and transfer data. It just did what it was supposed to do and saved me time which, for a busy startup, is the most valuable asset.”
-Chris Grainger, Founder & CTO, amplified ai
Airbus Defence and Space Geo Inc. uses their exclusive access to radar and optical satellites to offer a stunning Earth observation images library. As part of a major cloud migration effort, Airbus moved hundreds of TBs of this data to the cloud with Transfer Appliance so they can better serve images to clients from Cloud Storage. They improved data quality along with the migration by using Transfer Appliance.

“We needed to liberate. To flex on demand and scale in the cloud, and unleash our creativity. Transfer Appliance was a catalyst for that. In addition to migrating an amount of data that would not have been possible over the network, this transfer gave us the opportunity to improve our storage in the process—to clean out the clutter.”
-Dave Wright, CTO, Airbus Defense and Space Geo Inc.

National Collegiate Sports Archives (NCSA) is the creator and owner of the VAULT, which contains years worth of college sports footage. NCSA digitizes archival sports footage from leading schools and delivers it via mobile, advertising and social media platforms. With a lot of precious footage to deliver to college sports fans around the globe, NCSA needed a way to move data into Google Cloud Platform quickly and with zero disruption for their users.

“With a huge archive of collegiate sports moments, we wanted to get that content into the cloud and do it in a way that provides value to the business. I was looking for a solution that would cost-effectively, simply and safely execute the transfer and let our teams focus on improving the experience for our users. Transfer Appliance made it simple to capture data in our data center and ship it to Google Cloud. ”
-Jody Smith, Technology Lead, NCSA

Tackle your data migration needs with Transfer Appliance

To get detailed information on Transfer Appliance, check out our documentation. And visit our Data Transfer page to learn more about our other cloud data transfer options.

We’re looking forward to bringing Transfer Appliance to regions outside of the U.S. in the coming months. But we need your help: Where should we deploy first? If you are interested in offline data transfer but not located in the U.S., please indicate so in the request form.

If you’re interested in learning more about cloud data migration strategies, check out this session at Next 2018 next month. For more information, and to register, visit the Next ‘18 website.

Building scalable web applications with Cloud Datastore — new solution

If you manage database systems for large web applications, your job can be quite challenging. When unforeseen situations arise, making configuration changes can be complex and risky due to the stateful nature of those database systems. And before launching a new application, you have to do a lot of capacity planning, such as the number of virtual machines (VMs), the amount of disk storage, and the optimal network configuration, while contending with unknown factors such as the volume and frequency of open database connections and evolving usage patterns. You also need to do regular maintenance work to upgrade database software and scale resources to meet growing demand.

All of this planning and maintenance takes time, money, and attention away from developing new application features, so it is important to find a balance between provisioning enough resources to handle heavy loads and overspending on unused resources.

Cloud Datastore can help minimize these challenges by providing a scalable, highly available, high-performance, and fully-managed NoSQL database system.

We recently published an article that presents an overview of how to build large web applications with Cloud Datastore. The article includes scenarios of full-fledged web applications that use Cloud Datastore jointly with other products in the Google Cloud Platform (GCP) ecosystem.

Check out the article for all the details and next steps for building your own scalable solutions using Cloud Datastore!

What DBAs need to know about Cloud Spanner, part 1: Keys and indexes

Cloud Spanner is a relational and horizontally scalable database service, built from a cloud/distributed design perspective. It brings efficiency and high availability for developers and database administrators (DBAs), and differs structurally from typical databases you’re used to. In this blog series, we’ll explore some of the key differences that DBAs and developers will encounter as you migrate from traditional vertically-scaling (scale-up) relational database management systems (RDBMS) and move to Cloud Spanner. We will discuss some of the dos-and-don'ts, best practices and why things are different in Cloud Spanner.

In this series, we will explore a range of topics, including:
  • Selection of keys and use of indexes
  • How to approach business logic
  • Importing and exporting data
  • Migrating from your existing RDBMS
  • Optimizing performance
  • Access control and logging
You’ll gain an understanding of how to best use Cloud Spanner and release its potential to achieve linearly scalable performance over massive databases. In this first installment, let’s start with a closer look at how the concepts of keys and indexes work in Cloud Spanner.

Choosing keys in Cloud Spanner

Just like in other databases, the choice of key is vitally important to optimize the performance of the database. It’s even more important in Cloud Spanner, due to the way its mechanisms distribute database load. Unlike traditional RDBMS, you’ll need to take care when choosing the primary keys for the tables and choosing which columns to index.

Using well-distributed keys results in a table whose size and performance can scale linearly with the number of Cloud Spanner nodes, while using poorly distributed keys can result in hotspots, where a single node is responsible for the majority of reads and writes to the table.

In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.

Cloud Spanner is a distributed system, with many nodes reading and writing to the database at any one time. However, to achieve scalability, along with global ACID transactions and strong consistency, only one node at any one time can have write responsibility for a given row.

Cloud Spanner distributes management of rows across multiple nodes by breaking up each table into several splits, using ranges of the lexicographically sorted primary key.

This enables Cloud Spanner to achieve high availability and scalability, but it also means that using any continuously increasing or decreasing sequence as a key is detrimental to performance. To explain why, let’s explore how Cloud Spanner creates and manages its table splits.

Table splits and key choice

Cloud Spanner manages splits using Paxos (you can learn more about how in this detailed documentation: Life of Cloud Spanner Reads & Writes and Spanner: Google's Globally Distributed Database). In a regional instance of Cloud Spanner, the responsibility for reading/writing each split is distributed across a group of three nodes, one in each of the three availability zones of the Cloud Spanner instance.

One node in this group of three is elected Split Leader and manages the writes and locks for all the rows in the split. All three nodes in the group can perform reads.

To create a visual example, consider a table with 600 rows that uses a simple, continuous, monotonically increasing integer key (as is common in traditional RDBMS), broken into six splits, running on a two-node (per zone) Cloud Spanner instance. In an ideal situation, the table will have six splits, the leaders of which will be the six individual nodes available in the instance.

This distribution would result in ideal performance when reading/updating the rows, provided that the reads and updates are evenly distributed across the key range.

Problems with hotspots

The problem arises when new rows are appended to the database. Every new row will have an incrementing ID and will be added to the last split, meaning that out of the six available nodes, only one node will be handling all of the writes. In the example above, node 2c would be handling all writes. This node then becomes a hotspot, limiting the overall write performance of the database. In addition, the row distribution becomes unbalanced, with the last split becoming significantly larger, so it’s then handling more row reads than the others.

Cloud Spanner does try to compensate for uneven load by adding and removing splits in the background according to read and write load, and by creating a new split once the split size crosses a set threshold. However, in a frequently appended table, this will not happen quickly enough to avoid creating a hotspot.

Along with monotonically increasing or decreasing keys, this issue also affects tables that are indexed by any deterministic key—for example, an increasing timestamp in an event log table. Timestamp-keyed tables are also more likely to have a read hotspot because, in most cases, recently timestamped rows are accessed more frequently than the others. (Check out Cloud Spanner — Choosing the right primary keys for detailed information on detecting and avoiding hotspots.)

Problems with sequence generators

The concept of sequence generators, or lack thereof, is an important area to explore further. Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.

One workaround could be that the key is generated by the application (for example, by storing the next key value in a separate table in the database, or by getting the current maximum key value from the table). However, you’ll run into the same performance problems. Consider that as the application is also likely to be distributed, there may be multiple database clients trying to append a row at the same time, with two potential results depending on how the new key is generated:
  • If the SELECT for the existing key is performed in the transaction, one application instance trying to append would block all other application instances trying to append due to row locking.
  • If the SELECT for the existing key is done outside of the transaction, then there is a race between each of the application instances trying to append the new row. One will succeed, while others would have to retry (including generating a new key) after the append fails, since the key already exists.

What makes a good key

So if sequential keys will limit database performance in Cloud Spanner, what’s a good key to use? Ideally, the high-order bits should be evenly and semi-randomly distributed when keys are generated.

One simple way to generate such a key is to use random numbers, such as a random universally unique identifier (UUID). Note that there are several classes of UUID. Versions 1 and 2 use deterministic prefixes, such as timestamps or MAC addresses. Ensure that the UUID generation method you use is truly randomly distributed, i.e., v4, at least over the higher order bytes. This will ensure that the keys are evenly distributed among the keyspace, and hence that the load is distributed evenly over the spanner nodes.

Although another approach might be to use some real-world attributes of the data that are immutable and evenly distributed over the key range, this is quite a challenge since most uniformly distributed attributes are discrete and not continuous. For example, the random result of a dice roll is uniformly distributed and has six finite values. A continuous distribution could rely on an irrational number, for example π.

What if I really need an integer sequence as a key?

Though it’s not recommended, in some circumstances an integer sequence key is necessary, either for legacy or for external reasons, e.g., an employee ID.

To use an integer sequence key, you’ll first need a sequence generator that’s safe across a distributed system. One way of doing this is to have a table in Cloud Spanner contain a row for each required sequence that contains the next value in the sequence—so it would look something like this:
CREATE TABLE Sequences (
     Sequence_ID STRING(MAX) NOT NULL, -- The name of the sequence
     Next_Value INT64 NOT NULL
) PRIMARY KEY (Sequence_ID)
When a new ID value is required, the next value of the sequence is read, incremented and updated in the same transaction as the insert for the new row.

Note that this will limit performance when many rows are inserted, as each insert will block all other inserts due to the update of the Sequences table that we created above.

This performance issue can be reduced—though at the cost of possible gaps in the sequence—if each application instance reserves a block of, for example, 100 sequence values at once by incrementing Next_Value by 100, and then manages issuing individual IDs from that block internally.

In the table using the sequence, the key cannot simply be the numeric sequence value itself, as that will cause the last split to become a hotspot (as explained previously). So the application must generate a complex key that randomly distributes the rows among the splits.

This is known as application-level sharding and is achieved by prefixing the sequential ID with an additional column containing a value that’s evenly distributed among the key space—e.g., a hash of the original ID, or bit-reversing the ID. That looks something like this:
     Hashed_Id INT64 NOT NULL, 
     -- other columns with data values follow....
) PRIMARY KEY (Hashed_Id, Id)
Even a simple cyclic redundancy check (CRC)32 checksum is good enough to provide a suitably pseudo-random Hashed_Id. It does not have to be secure, just enough to randomize the row order of the sequentially numbered keys, as in the following table:

Note that whenever a row is read directly, both the ID and Hashed_Id must be specified to prevent a table scan, as in this example:
WHERE t1.Hashed_Id = 0xDEADBEEF
      AND t1.Id = 1234
Similarly, whenever this table is joined with other tables in the query by Id, the join must also use both the ID and the Hashed_Id. Otherwise, you’ll lose performance, since a table scan will be required to find the row. This means that the table that references the ID must also include the Hashed_Id, like this:
     Id String(MAX),  -- UUID
     Table1_Hashed_Id INT64 NOT NULL, 
     Table1_Id INT64 NOT NULL,
     -- other columns with data values follow....

SELECT * from Table2 t2 INNER JOIN Table1 t1 
     ON t1.Hashed_Id = t2.Table1_Hashed_Id
     AND t1.Id = t2.Table1_Id
WHERE ... -- some criteria

What if I really need to use a timestamp as a key?

In many cases, the row using the timestamp as a key also refers to some other table data. For example, the transactions on a bank account will refer to the source account. In this case, assuming that the source account number is already reasonably evenly distributed, you can use a complex key containing the account number first and then the timestamp:
CREATE TABLE Transactions (
     account_number INT64 NOT NULL,
     timestamp TIMESTAMP NOT NULL,
     transaction_info ...,
) PRIMARY KEY (account_number, timestamp DESC)
The splits will be made primarily using the account number and not the timestamp, thus distributing the newly added rows over various splits.

Note that in this table, the timestamp is sorted by descending order. That’s because in most cases you want to read the most recent transactions—which will be first in the table—so you won’t need to scan through the entire table to find the most recent rows.

If you do not, or cannot have an external reference, or have any other data that you can use in the key in order to distribute the order, then you will need to perform application-level sharding, which is shown in the integer sequence example above.

Note, however, that using a simple hash will make queries by timestamp range extremely slow, since retrieving a range of timestamps will require a full table scan to cover all the hashes. Instead, we recommend generating a ShardId from the timestamp. So, for example,
TimestampShardId = CRC32(Timestamp) % 100
will return a pseudo-random value between 0 and 99 from the timestamp. Then, you can use this ShardId in the table key so that sequential timestamps are distributed across multiple splits, like so:
     TimestampShardId INT64 NOT NULL
     Timestamp TIMESTAMP NOT NULL,
) PRIMARY KEY (TimestampShardId, Timestamp DESC)
For example, a table with dates of the first 10 days of 2018 (which without ShardId would be stored in the table in date order) will give the following ordering:

When a query is made, you must use a BETWEEN clause to be able to select across all shards without performing a table scan:
Select * from Events
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Note that the ShardId is only a way of improving key distribution so that Cloud Spanner can use multiple splits to store sequential timestamps. It does not identify an actual database split, and rows in different tables with the same ShardId may well be in different splits.

Migration implications

When you’re migrating from an existing RDBMS that uses keys that are not optimal for Cloud Spanner, take the above considerations into account. If necessary, add key hashes to tables or change the key ordering.

Deciding on indexes in Cloud Spanner

In a traditional RDBMS, indexes are very efficient ways of looking up rows in a table by a value that is not the primary key. Under most circumstances, a row lookup via an index will take approximately the same time as a row lookup via its key. That’s because the table and the index are managed by a single node, so the index can point directly to the on-disk row of the table.

In Cloud Spanner, indexes are actually implemented using tables, which allows them to be distributed and enables the same degree of scalability and performance as normal tables.

However, because of this type of implementation, using indexes to read the data from the table row is less efficient than in a traditional RDBMS. It’s effectively an inner join with the original table, so reading from a table using an indexed key turns into this process:
  • Look up split for index key
  • Read index row from split to get table key
  • Look up split for table key
  • Read table row from split to get row values
  • Return row values
Note that there is no guarantee that the split for the index key is on the same node as the split for the table key, so a simple index query may require cross-node communication just to read one row.

Similarly, updating an indexed table will most likely require a multi-node write to update the table row and the index row. So using an index in Cloud Spanner is always a trade-off between improved read performance and reduced write performance.

Index keys and hotspots

Because indexes are implemented as tables in Cloud Spanner, you’ll encounter the same issues with the indexed columns as you did with the table keys: An index on a column with poorly distributed values (such as a timestamp) will lead to the creation of a hotspot, even if the underlying table is using well-distributed keys. That’s because when rows are appended to the table, the index will also have new rows appended, and writes for these new rows will always be sent to the same split.

Therefore, care must be taken when creating indexes, and we recommend that you create indexes only using columns which have a well-distributed set of values, just like when choosing a table key.

In some cases, you’ll need to do application-level sharding for the indexed columns in order to create a synthetic ShardId column, which can be used in the index to distribute values over the splits.

For example, this configuration below will create a hotspot when appending events due to the index, even if UserId is randomly distributed.
      UserId String(MAX),
      Timestamp TIMESTAMP,
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (Timestamp DESC);
As with a table keyed by timestamp only, a synthetic ShardId column will need to be added to the table, and then used as the first indexed column to help the distribution of the index among the splits.

A simple ShardId generator could be:
TimestampShardId = CRC32(Timestamp) % 100
which will give a hash value between 0 and 99 from the timestamp. You’ll need to add this to the original table as a new column, then use it as the first index key, like this:
     UserId String(MAX),
     Timestamp TIMESTAMP,
     TimestampShardId INT64, 
PRIMARY KEY (UserId, Timestamp DESC);

CREATE INDEX EventsByTimestamp ON Events (TimestampShardId,Timestamp);
This will remove the hotspot on index update, but will slow down range queries on timestamp, since you’ll have to run the query for each ShardId value (0-99) to get the timestamp range from all shards:
Select * from Events@{FORCE_INDEX=EventsByTimestamp}
   TimestampShardId BETWEEN 0 AND 99
   AND Timestamp > @lower_bound
   AND Timestamp < @upper_bound;
Using this type of index and sharding strategy must strike a balance between the additional complexity when reading and the increase in performance of an indexed query.

Other indexes you should know

When you’re migrating to Cloud Spanner, you’ll also want to understand how these other index types function and when you might need to use them:


By default, Cloud Spanner will index rows using NULL indexed column values. A NULL is considered to be the smallest possible value, so these values will appear at the start of the index.

It’s also possible to disable this behavior by using the CREATE NULL_FILTERED INDEX syntax, which will create an index ignoring rows with NULL indexed column values.

This index will be smaller than the complete index, as it will effectively be a materialized filtered view on the table, and will be faster to query than the full table when a table scan is necessary.

UNIQUE indexes

You can use a UNIQUE index to enforce that a column of a table has unique values. This constraint will be applied at transaction commit time (and at index creation).

Covering Indexes and STORING clause

To optimize performance when reading from indexes, Cloud Spanner can store the column values of the table row in the index itself, removing the need to read the table. This is known as a Covering Index. This is achieved by using the STORING clause when defining the index. The values of the column can then be read directly from the index, so reading from the index performs as well as reading from the table. For example, this table contains employee data:
CREATE TABLE Employees (
      CompanyUUID INT64,
      EmployeeUUID INT64,
      FullName STRING(MAX)
) PRIMARY KEY (CompanyUUID,EmployeeUUID)
If you often need to look up an employee’s full name, for example, you can create an index on employeeUUID, storing the full name for rapid lookups:
      ON Employees (EmployeeUUID) 
      STORING (FullName);

Forcing index use

Cloud Spanner’s query engine will only automatically use indexes in rare circumstances (when it is a query fully covered by the index), so it is important to use a FORCE_INDEX directive in the SQL SELECT statement to ensure that Cloud Spanner looks up values from the index. (You can find more details in the documentation.)
Select * 
from  Employees@{FORCE_INDEX=EmployeesById}
Where EmployeeUUID=xxx;
Note that when using the Cloud Spanner Read APIs, you can only perform fully covered queries—i.e., queries where the index stores all of the columns requested. To read the columns from the original table using an index, you must use an SQL query. See Use a Secondary Index section of the Getting Started docs for examples.

Continuing your Cloud Spanner education

There are some big conceptual differences when you’re using a cloud-built, horizontally scalable database like Cloud Spanner in contrast with the RDBMS you’ve been using for years. Once you’re familiar with the way keys and indexes work, you can start to take advantage of the benefits of Cloud Spanner for faster scalability.

In the next episode of this series, we will look at how to deal with business logic that would previously be implemented by triggers and stored procedures, neither of which exist in Cloud Spanner.

Want to learn more about Cloud Spanner in person? We’ll be discussing data migration and more in this session at Next 2018 in July. For more information, and to register, visit the Next ‘18 website.

Related content:

How we used Cloud Spanner to build our email personalization system—from “Soup” to nuts