Tag Archives: Storage & Databases

Cloud SQL for PostgreSQL now generally available and ready for your production workloads



Among open-source relational databases, PostgreSQL is one of the most popular—and the most sought-after by Google Cloud Platform (GCP) users. Today, we’re thrilled to announce that PostgreSQL is now generally available and fully supported for all customers on our Cloud SQL fully-managed database service.

Backed by Google’s 24x7 SRE team, high availability with automatic failover, and our SLA, Cloud SQL for PostgreSQL is ready for the demands of your production workloads. It’s built on the strength and reliability of Google Cloud’s infrastructure, scales to support critical workloads and automates all of your backups, replication, patches and updates while ensuring greater than 99.95% availability anywhere in the world. Cloud SQL lets you focus on your application, not your IT operations.

While Cloud SQL for PostgreSQL was in beta, we added high availability and replication, higher performance instances with up to 416GB of RAM, and support for 19 additional extensions. It also joined the Google Cloud Business Associates Agreement (BAA) for HIPAA-covered customers.

Cloud SQL for PostgreSQL runs standard PostgreSQL to maintain compatibility. And when we make improvements to PostgreSQL, we make them available for everyone by contributing to the open source community.

Throughout beta, thousands of customers from a variety of industries such as commercial real estate, satellite imagery, and online retail, deployed workloads on Cloud SQL for PostgreSQL. Here’s how one customer is using Cloud SQL for PostgreSQL to decentralize their data management and scale their business.

How OneMarket decentralizes data management with Cloud SQL


OneMarket is reshaping the way the world shops. Through the power of data, technology, and cross-industry collaboration, OneMarket’s goal is to create better end-to-end retail experiences for consumers.

Built out of Westfield Labs and Westfield Retail Solutions, OneMarket unites retailers, brands, venues and partners to facilitate collaboration on data insights and implement new technologies, such as natural language processing, artificial intelligence and augmented reality at scale.

To build the platform for a network of retailers, venues and technology partners, OneMarket selected GCP, citing its global locations and managed services such as Kubernetes Engine and Cloud SQL.
"I want to focus on business problems. My team uses managed services, like Cloud SQL for PostgreSQL, so we can focus on shipping better quality code and improve our time to market. If we had to worry about servers and systems, we would be spending a lot more time on important, but somewhat insignificant management tasks. As our CTO says, we don’t want to build the plumbing, we want to build the house." 
— Peter McInerney, Senior Director of Technical Operations at OneMarket 
OneMarket's platform is comprised of 15 microservices, each backed by one or more independent storage services. Cloud SQL for PostgreSQL backs each microservice with relational data requirements.

The OneMarket team employs a microservices architecture to develop, deploy and update parts of their platform quickly and safely. Each microservice is backed by an independent storage service. Cloud SQL for PostgreSQL instances back many of the platform’s 15 microservices, decentralizing data management and ensuring that each service is independently scalable.
 "I sometimes reflect on where we were with Westfield Digital in 2008 and 2009. The team was constantly in the datacenter to maintain servers and manage failed disks. Now, it is so easy to scale." 
— Peter McInerney 

Because the team was able to focus on data models rather than database management, developing the OneMarket platform proceeded smoothly and is now in production, reliably processing transactions for its global customers. Using BigQuery and Cloud SQL for PostgreSQL, OneMarket analyzes data and provides insights into consumer behavior and intent to retailers around the world.

Peter’s advice for companies evaluating cloud solutions like Cloud SQL for PostgreSQL: “You just have to give it a go. Pick a non-critical service and get it running in the cloud to begin building confidence.”

Getting started with Cloud SQL for PostgreSQL 


Connecting to a Google Cloud SQL database is the same as connecting to a PostgreSQL database—you use standard connectors and standard tools such as pg_dump to migrate data. If you need assistance, our partner ecosystem can help you get acquainted with Cloud SQL for PostgreSQL. To streamline data transfer, reach out to Google Cloud partners Alooma, Informatica, Segment, Stitch, Talend and Xplenty. For help with visualizing analytics data, try ChartIO, iCharts, Looker, Metabase, and Zoomdata.

Sign up for a $300 credit to try Cloud SQL and the rest of GCP. You can start with inexpensive micro instances for testing and development, and scale them up to serve performance-intensive applications when you’re ready.

Cloud SQL for PostgreSQL reaching general availability is a huge milestone and the best is still to come. Let us know what other features and capabilities you need with our Issue Tracker and by joining the Cloud SQL discussion group. We’re glad you’re along for the ride, and look forward to your feedback!

Improving the Google Cloud Storage backend for HashiCorp Vault



HashiCorp Vault is a powerful open source tool for secrets management, popular with many Google Cloud Platform (GCP) customers today. HashiCorp Vault provides "secret management as a service," acting as a static secret store for encrypted key-value pairs; a secret generation tool to dynamically generate on-the-fly credentials; and pass-through encryption service so that applications don’t need to roll their own encryption. Today, we're announcing exciting improvements to the existing Google Cloud Storage backend for HashiCorp Vault, including high availability.

As mentioned in our blog post announcing Google Cloud Spanner as a supported HashiCorp Vault storage backend, we strive to make Google Cloud an excellent platform on which to operationalize Vault for all users and use cases. Your feedback from the Cloud Spanner integration was overwhelmingly positive, but many of you are already leveraging the community-supported Cloud Storage backend and don’t want to migrate your existing data to a different storage system. GCP’s wealth of offerings let you choose the best storage options to meet your needs, and now you can choose from both Cloud Spanner and Cloud Storage for HashiCorp Vault storage backends.

The improved Cloud Storage HashiCorp Vault storage backend is completely backwards compatible with the existing solution, but includes a number of new features and benefits:
  • High availability - In addition to Cloud Storage's built-in multi-region architecture, the improved HashiCorp Vault storage backend also supports running Vault in "high availability" mode. By default, HashiCorp Vault runs as a single tenant, relying on the storage backend to provide distributed locking and leader election. By leveraging object metadata for read-modify-write conditions in Cloud Storage, the improved storage backend allows for a highly available Vault cluster with just a single line of configuration. You can read more about HashiCorp Vault's High Availability model in the documentation.
  • Support for default application credentials - Previously the Cloud Storage Vault storage backend required you to create a dedicated service account and credentials file. While you can still specify a credentials file, the storage backend now supports pulling default application credentials, such as those from your local gcloud installation or Application Default Credentials if you're running Vault on GCP.
  • Enterprise-grade security - Cloud Storage follows the same security best practices as other Google products. Objects stored in Cloud Storage are encrypted by default, and it uses IAM to provide granular permission management on buckets and folders. Google’s infrastructure has many security differentiators, including secure boot using Google’s custom-designed security chip Titan, and Google’s private network backbone.


Getting started


To get started, download and install the latest version of HashiCorp Vault. The improvements to the Cloud Storage backend for Vault, including high availability mode, were added in Vault 0.10 (released on April 10, 2018). Please ensure you're running Vault 0.10 or later before continuing.

Next, create a Cloud Storage bucket using the gsutil CLI tool (part of the gcloud CLI) to store the Vault data . You can also create the bucket using the web interface or API directly:

$ gsutil mb -c regional -l us-east4 gs://company-vault-data

In this example, we created a bucket named "company-vault-data." Note that Cloud Storage bucket names must be globally unique.

Next, create a Vault configuration file configured to use Cloud Storage as the storage backend:

# config.hcl
storage "gcs" {
  bucket = "company-vault-data"
}

Start Vault with the configuration file. Note that this example uses Vault's built-in development mode, which does not represent best practices or a production installation, but it's the fastest way to try the improved Cloud Storage storage backend for HashiCorp Vault. For more details on a production-grade Vault installation, please read the Vault production hardening guide.

$ export VAULT_ADDR=http://127.0.0.1:8200
$ sudo vault server -dev -config=config.hcl

During this process, Vault authenticates and connects to Cloud Storage to populate and manage objects in the provided bucket. After a few seconds, you can view the objects in the web interface and see that data has been populated.

You can now create, read, update and delete secrets:

$ vault kv write secret/my-secret foo=bar

To learn more about the backend configuration options, read the HashiCorp Vault Cloud Storage backend documentation. To learn more about Cloud Storage, check out the documentation.

Toward a seamless Vault experience on GCP


With the Cloud Spanner and Cloud Storage Vault storage backends, Vault users can choose which Google-supported storage backend is best for them. In addition to supporting our customers, we are delighted to continue our long-standing relationship with HashiCorp as part of our ongoing partnership. Be sure to follow us on Twitter and open a GitHub issue if you have any questions.

How to automatically scan Cloud Storage buckets for sensitive data: Taking charge of your security



Security in the cloud is often a matter of identifying—and sticking to—some simple best practices. A few months ago, we discussed some steps you can take to harden the security of your Cloud Storage buckets. We covered how to set up proper access permissions and provided tips on using tools like the Data Loss Prevention API to monitor your buckets for sensitive data. Today, let’s talk about how to automate data classification using the DLP API and Cloud Functions, Google Cloud Platform’s event-driven serverless compute platform that makes it easy for you to integrate and extend cloud services with code.

Imagine you need to regularly share data with a partner outside of your company, and this data cannot contain any sensitive elements such as Personally Identifiable Information (PII). You could just create a bucket, upload data to it, and grant access to your partner, but what if someone uploads the wrong file or doesn’t know that they aren’t supposed to upload PII? With the DLP API and Cloud Functions, you can automatically scan this data before it’s uploaded to the shared storage bucket.

Setting this up is easy: Simply create three buckets—one in which to upload data, one to share and one for any sensitive data that gets flagged. Then:

  1. Configure access appropriately so that relevant users can put data in the “upload” bucket 
  2. Write a Cloud Function triggered by an upload that scans the data using the DLP API 
  3. Based on any DLP findings, automatically move data into the share bucket or into a restricted bucket for further review.

To get you started, here’s a tutorial with detailed instructions including the Cloud Functions script. You can get it up and running in just a few minutes from the Cloud Console or via Cloud Shell. You can then easily modify the script for your environment, and add more advanced actions such as sending notification emails, creating a redacted copy or triggering approval workflows.

We hope we’ve showed you some proactive steps you can take to prevent sensitive data from getting into the wrong hands. To learn more, check out the documentation for the DLP API and Cloud Functions.

Best practices for securing your Google Cloud databases



If information is gold, the database is a treasure chest. Web applications store their most valuable data in a database, and lots of sites would cease to exist if their data were stolen or deleted. This post aims to give you a series of best practices to help protect and defend the databases you host on Google Cloud Platform (GCP).

Database security starts before the first record is ever stored. You must consider the impact on security as you design the hosting environment. From firewall rules to logging, there's a lot to think about both inside and outside the database.

First considerations


When it comes to database security, one of the fundamental things to consider is whether to deploy your own database servers or to use one of Google's managed storage services. This decision should be influenced heavily by your existing architecture, staff skills, policies and need for specialized functionality.

This post is not intended to sell you specific GCP products, but absent any overwhelming reasons to host your own database, we recommend using a managed version thereof. Managed database services are designed to work at Google scale, with all of the benefits of our security model. Organizations seeking compliance with PCI, SOX, HIPAA, GDPR, and other compliance regimes will appreciate the significant reduction in effort with a shared responsibility model. And even if these rules and regulations don't apply to your organization, I recommend following the PCI SAQ A (payment card industry self-assessment questionnaire type A) as a baseline set of best practices.

Access controls


You should limit access to the database itself as much as possible. Self-hosted databases should have VPC firewall rules locked down to only allow ingress from and egress to authorized hosts. All ports and endpoints not specifically required should be blocked. If possible, ensure changes to the firewall are logged and alerts are configured for unexpected changes. This happens automatically for firewall changes in GCP. Tools like Forseti Security can monitor and manage security configurations for both Google managed services and custom databases hosted on Google Compute Engine instances.

As you prepare to launch your database, you should also consider the environment in which it operates. Service accounts streamline authorization to Google databases using automatically rotating keys, and you can manage rotation for self-hosted databases in GCP using Cloud Key Management System (KMS).


Data security


Always keep your data retention policy in mind as you implement your schema. Sensitive data that you have no use for is a liability and should be archived or pruned. Many compliance regulations provide specific guidance (HIPAA, PCI, GDPR, SOX) to identify that data. You may find it helpful to operate under the pessimistic security model in which you assume your application will be cracked and your database will be exfiltrated. This can help clarify some of the decisions you need to make regarding retention, encryption at rest, etc.

Should the worst happen and your database is compromised, you should receive alerts about unexpected behavior such as spikes in egress traffic. Your organization may also benefit from using "canary" data—specially crafted information that should never be seen in the real world by your application under normal circumstances. Your application should be designed to detect a canary account logging in or canary values transiting the network. If found, your application should send you alerts and take immediate action to stem the possible compromise. In a way, canary data is similar to retail store anti-theft tags. These security devices have known detection characteristics and can be hidden inside a product. A security-conscious retailer will set up sensors at their store exists to detect unauthorized removal of inventory.

Of course, you should develop and test a disaster recovery plan. Having a copy of the database protects you from some failures, but it won't protect you if your data has been altered or deleted. A good disaster recovery plan will cover you in the event of data loss, hardware issues, network availability and any other disaster you might expect. And as always, you must regularly test and monitor the backup system to ensure reliable disaster recovery.

Configuration


If your database was deployed with a default login, you should make it a top priority to change or disable that account. Further, if any of your database accounts are password-protected, make sure those passwords are long and complex; don’t use simple or empty passwords used under any circumstances. If you're able to use Cloud KMS, that should be your first choice. Beyond that, be sure to develop a schedule for credential rotation and define criteria for out-of-cycle rekeying.

Regardless of which method you use for authentication, you should have different credentials for read, write and admin-level privileges. Even if an application performs both read and write operations, separate credentials can limit the damage caused by bad code or unauthorized access.

Everyone who needs to access the database should have their own private credentials. Create service accounts for each discrete application with only the permissions required for that service. Cloud Identity and Access Management is a powerful tool for managing user privileges in GCP; generic administrator accounts should be avoided as they mask the identity of the user. User credentials should restrict rights to the minimum required to perform their duties. For example, a user that creates ad-hoc reports should not be able to alter schema. Consider using views, stored procedures and granular permissions to further restrict access to only what a user needs to know and further mitigate SQL injection vulnerabilities.

Logging is a critical part of any application. Databases should produce logs for all key events, especially login attempts and administrative actions. These logs should be aggregated in an immutable logging service hosted apart from the database. Whether you're using Stackdriver or some other service, credentials and read access to the logs should be completely separate from the database credentials.


Whenever possible, you should implement a monitor or proxy that can detect and block brute force login attempts. If you’re using Stackdriver, you could set up an alerting policy to a Cloud Function webhook that keeps track of recent attempts and creates a firewall rule to block potential abuse.

The database should run as an application-specific user, not root or admin. Any host files should be secured with appropriate file permissions and ownership to prevent unauthorized execution or alteration. POSIX compliant operating systems offer chown and chmod to set file permissions, and Windows servers offer several tools as well. Tools such as Ubuntu's AppArmor go even further to confine applications to a limited set of resources.


Application considerations


When designing application, an important best practice is to only employ encrypted connections to the database, which eliminates the possibility of an inadvertent leak of data or credentials via network sniffing. Cloud SQL users can do this using Google's encrypting SQL proxy. Some encryption methods also allow your application to authenticate the database host to reduce the threat of impersonation or man-in-the-middle attacks.
If your application deals with particularly sensitive information, you should consider whether you actually need to retain all of its data in the first place. In some cases, handling this data is governed by a compliance regime and the decision is made for you. Even so, additional controls may be prudent to ensure data security. You may want to encrypt some data at the application level in addition to automatic at-rest encryption. You might reduce other data to a hash if all you need to know is whether the same data is presented again in the future. If using your data to train machine learning models, consider reading this article on managing sensitive data in machine learning.

Application security can be used to enhance database security, but must not be used in place of database security. Safeguards such as sanitization must be in place for any input sent to the database, whether it’s data for storage or parameters for querying. All application code should be peer-reviewed. Security scans for common vulnerabilities, including SQL injection and XSS, should be automated and run frequently.

The computers of anyone with access rights to the database should be subject to an organizational security policy. Some of the most audacious breaks in security history were due to malware, lax updates or mishandled portable data storage devices. Once a workstation is infected, every other system it touches is suspect. This also applies to printers, copiers and any other connected device.

Do not allow unsanitized production data to be used in development or test environments under any circumstances. This one policy will not only increase database security, but will all but eliminate the possibility of a non-production environment inadvertently emailing customers, charging accounts, changing states, etc.

Self-hosted database concerns


While Google's shared responsibility model allows managed database users to relieve themselves of some security concerns, we can't offer equally comprehensive controls for databases that you host yourself. When self-hosting, it's incumbent on the database administrator to ensure every attack vector is secured.

If you’re running your own database, make sure the service is running on its own host(s), with no other significant application functions allowed on it. The database should certainly not share a logical host with a web host or other web-accessible services. If this isn’t possible, the service and databases should reside in a restricted network shared and accessible by a minimum number of other hosts. A logical host is the lowest level computer or virtual computer in which an application itself is running. In GCP, this may be a container or virtual machine instance, while on-premises, logical hosts can be physical servers, VMs or containers.
A common use case for running your own database is to replicate across a hybrid cloud. For example, a hybrid cloud database deployment may have a master and one or more replicas in the cloud, with a replica on-premises. In that event, don’t connect the database to the corporate LAN or other broadly available networks. Similarly, on-premises hardware should be physically secure so it can't be stolen or tampered with. A proper physical security regime employs physical access controls, automated physical access logs and remote video monitoring.

Self-hosted databases also require your regular attention, to make sure they’ve been updated. Craft a software update policy and set up regular alerts for out-of-date packages. Consider using a system management (Ubuntu, Red Hat) or configuration management tool that lets you easily perform actions in bulk across all your instances, monitor and upgrade package versions and configure new instances. Be sure to monitor for out-of-cycle changes to key parts of the filesystem such as directories that contain configuration and executable files. 

Several compliance regimes recommend an intrusion detection system or IDS. The basic function of an IDS is to monitor and react to unauthorized system access. There are many products available that run either on each individual server or on a dedicated gateway host. Your choice in IDS may be influenced by several factors unique to your organization or application. An IDS may also be able to serve as a monitor for canary data, a security tactic described above.

All databases have specific controls that you must adjust to harden the service. You should always start with articles written by your database maker for software-specific advice on hardening the server. Hardening guides for several popular databases are linked in the further reading section below.
The underlying operating system should also be hardened as much as possible, and all applications that are not critical for database function should be disabled. You can achieve further isolation by sandboxing or containerizing the database. Use articles written by your OS maker for variant-specific advice on how to harden the platform. Guides for the most common operating systems available in Compute Engine are linked in the further reading section below.

Organizational security 


Staff policies to enforce security is an important but often overlooked part of IT security. It's a very nuanced and deep topic, but here are a few general tips that will aid in securing your database:

All staff with access to sensitive data should be considered for a criminal background check. Insist on strict adherence to a policy of eliminating or locking user accounts immediately upon transfer or termination. Human account password policies should follow the 2017 NIST Digital Identity Guidelines, and consider running social engineering penetration tests and training to reduce the chance of staff inadvertently enabling an attack.

Further reading 


Security is a journey, not a destination. Even after you've tightened security on your database, application, and hosting environment, you must remain vigilant of emerging threats. In particular, self-hosted DBs come with additional responsibilities that you must tend to. For your convenience, here are some OS- and database-specific resources that you may find useful.

How we used Cloud Spanner to build our email personalization system—from “Soup” to nuts



[Editor’s note: Today we hear from Tokyo-based Recruit Technologies Co., Ltd., whose email marketing platform is built on top of GCP. The company recently migrated its database to Cloud Spanner for lower cost and operational requirements, and higher availability compared to their previous HBase-based database system. Cloud Spanner also allows Recruit to calculate metrics (KPIs) in real time without having to transfer the data first. Read on to learn more about their workload and how Cloud Spanner fits into the picture.]

There are just under 8 billion people on Earth, depending on the source. Here at Recruit, our work is to develop and maintain an email marketing system that sends personalized emails to tens of millions of customers of hundreds of web services, all with the goal of providing the best, most relevant customer experience.

When Recruit was founded in 1960, the company was focused on helping match graduates to jobs. Over the years, we’ve expanded to help providers that deal with almost every personal moment and event a person encounters in their life. From travel plans to real estate, restaurant choices to haircuts, we offer software and services to help providers deliver on virtually everything and connect to their end-customers.

Recruit depends on email as a key marketing vehicle to end-customers and to provide a communications channel to clients and advertisers across its services. To maximize the impact of these emails, we customize each email we send. To help power this business objective, we developed a proprietary system named “Soup” that we host on Google Cloud Platform (GCP). Making use of Google Cloud Spanner, Soup is the connective tissue that manages the complex customization data needed for this system.

Of course, getting from idea to functioning product is easier said than done. We have massive datasets so requirements like high availability and serving data in real-time are particularly tricky. Add in a complex existing on-premises environment, some of which we had to maintain in our journey to the cloud creating a hybrid environment, and the project became even more challenging.

A Soup primer


First, why the name “Soup”? The name of the app is actually “dashi-wake” in Japanese, from “dashi,” a type of soup. In theory, Soup is a fairly simple application: its API returns recommendation results based on the data we compute about a user via the user's user ID. Soup ingests pre-computed recommendations and then serves those recommendations to the email generation engine and tracks metrics. While Soup doesn’t actually send the customer emails, it manages the entire volume of personalization and customization data for tens of millions of users. It also manages the computed metrics associated with these email sends such as opens, clicks, and other metadata.

Soup leverages other GCP services such as App Engine Flex (Node.js), BigQuery, Data Studio, and Stackdriver in addition to Cloud Spanner.
(click to enlarge)

Soup requirements


High availability

If the system is unavailable when a user decides to open an email they see a white screen with no content at all. Not only is that lost revenue for that particular email, it makes customers less likely to open future emails from us.

Low latency

Given a user ID, the system needs to search all its prediction data and generate the appropriate contentan HTML file, an image, multiple images, or other contentand deliver it, all very quickly.

Real-time log ingestion and fast JOINs

In today’s marketing environment, tracking user activity and being able to make dynamic recommendations based on it is a must-have. We live in an increasingly real-time world. In the past, it might have been OK to take a week or longer to adapt content based on customer behavior. Now? A delay of even a day can make the difference between a conversion and a lost opportunity.

The problem


Pushing out billions of personalized emails to tens of millions of customers comes with some unique challenges. Our previous on-premises system was based on Apache HBase, the open-source NoSQL database, and Hive data warehouse software. This setup presented three major obstacles:

Cluster sizing

Email marketing is a bursty workload. You typically send a large batch of emails, which requires a lot of compute, and then there’s a quiet period. For our email workloads, we pre-compute a large set of recommendations and then serve those recommendations dynamically upon email open. On-premises, there wasn’t much flexibility and we had to resize clusters manually. We were plagued by errors whenever loads of email opens and the resulting requests to the system outpaced the traffic we could handle, because the cluster size of our HBase/Hive system couldn’t keep up.

Performance

The next issue was optimizing the schema model for performance. Soup has a couple of main functions: services write customer tracking data to it, and downstream “customers” read that data from it to create the personalized emails. On the write side, after the data is written to Soup, the writes need to be aggregated. We initially did this on-premises, which was quite difficult and time consuming because Hbase’s doesn’t offer aggregation queries, and because it was hard to scale in response to traffic bursts.

Transfer delays

Finally, every time we needed to generate a recommendation model for a personalized email blast, we needed to transfer the necessary data from HBase to Hive to create the model, then back to HBase. These complex data transfers were taking two-to-three days. Needless to say, this didn’t allow for the type of agility that we need to provide the best service to our customers.

Cloud Spanner allows us to store all our data in one place, and simply join the data tables and do aggregates; there’s no need for a time-intensive data transfer. Using this model, we believe we can cut the recommendation generation time from days to under a minute, bringing real-time back into the equation.

Why Cloud Spanner?

Compared to the previous application running on-premises, Cloud Spanner offers lower cost, lower operations requirements and higher availability. Most critically, we wanted to calculate metrics (KPIs) in real time without data transfer. Cloud Spanner allows us to do this by pumping SQL queries into a custom dashboard that monitors KPIs in real time.

Soup now runs on GCP, although the recommendations themselves are still generated in an on-premise Hadoop cluster. The computed recommendations are stored in Cloud Spanner for the reasons mentioned above. After moving to GCP and architecting for the cloud, we see an error rate of .005% per second vs. a previous rate of 4% per second, an improvement of 1/800. This means that for an email blast sent to all users in Japan, one user won’t be able to see one image in one email. Since these emails often contain 10 images or more, this error rate is acceptable.

Cloud Spanner also solved our scaling problem. In the future, Soup will have to support one million concurrent users in different geographical areas. Likewise, Soup has to perform 5,000 queries per second (QPS) at peak times on the read side, and will expand this requirement to 20,000 to 30,000 QPS in the near future. Cloud Spanner can handle all the different, complex transactions Soup has to run, while scaling horizontally with ease.

Takeaways


In migrating our database to Cloud Spanner, we learned many things that are worth taking note of, whether you have 10 or 10 million users.

Be prepared to scale

We took scaling into account from Day One, sketching out specific requirements for speed, high availability, and other metrics. Only by having these requirements specifically laid out were we able to choose—and build—a solution that could meet them. We knew we needed elastic scale.

With Cloud Spanner, we didn’t have to make any of the common trade-offs between the relational database structure we wanted, and the scalability and availability needed to keep up with the business requirements. Likewise, with a growing company, you don’t want to place any artificial limits on growth, and Cloud Spanner’s ability to scale to “arbitrarily large” database sizes eliminates this cap, as well as the need to rewrite or migrate in the future as our data needs grow.

Be realistic about downtime

For us, any downtime can result in literally thousands of lost opportunities. That meant that we had to demand virtually zero downtime from any solution, to avoid serving up errors to our users. This was an important realization. Google Cloud provides an SLA guarantee for Cloud Spanner. This solution is more available and resistant to outages than anything we would build on our own.

Don’t waste time on management overhead

When you’re worrying about millions of users and billions of emails, the last thing you have time to do is all the maintenance and administrative tasks required to keep a database system healthy and running. Of course, this is true for the smallest installations, as well. Nobody has a lot of extra time to do things that should be taken care of automatically.

Don’t be afraid of hybrid

We found that a hybrid architecture that leverages the cloud for fast data access but still using our existing on-premises investments for batch processing to be effective. In the future, we may move the entire workload to the cloud but data has gravity, and we currently have lots of data stored on-premises.

Aim for real-time

At this time, we can only move data in and out of Cloud Spanner in small volumes. This prevents us from making real-time changes to recommendations. Once Cloud Spanner supports batch and streaming connections, we'll be able to enable an implementation to provide more real-time recommendations to deliver even more relevant results and outcomes.

Overall, we’re extremely happy with Cloud Spanner and GCP. Google Cloud has been a great partner in our move to the cloud, and the unique services provided enable us to offer the best service to our customers and stay competitive.

Expanding MongoDB Atlas availability on GCP



With over 35 million downloads and customers ranging from Cisco to Metlife to UPS, MongoDB is one of the most popular NoSQL databases for developers and enterprises alike.

MongoDB is available on Google Cloud Platform (GCP) through MongoDB’s simple-to-use, fully managed Database as a Service (DBaaS) product, MongoDB Atlas. With MongoDB Atlas, you get a globally distributed database with cross-region replication, multi-region fault tolerance and the ability to provide fast, responsive read access to data users around the globe. You can even configure your clusters to survive the outage of an entire cloud region.

Now, thanks to strong demand for MongoDB Atlas from GCP customers, Google and MongoDB have expanded the availability of MongoDB Atlas on GCP, making it available across most GCP regions as well as on Cloud Launcher.
With this expanded geographic availability, you can now join the wide variety of organizations around the world, from innovators in the social media space to industry leaders in energy, that are already running MongoDB on GCP.

How MongoDB Atlas works on GCP 


But first, to understand how MongoDB Atlas works on GCP, let’s discuss the architecture of a MongoDB cluster.

MongoDB maintains multiple copies of data called replica sets using native replication. A replica set is a fully self-healing unit that helps prevent database downtime. Failover is fully automated, eliminating the need for administrators to intervene manually.
You can configure the number of replicas in a MongoDB replica set; a larger number of replicas provides increased data availability and protection against database downtime (e.g., in case of multiple machine failures, rack failures, data center failures or network partitions). It's recommended that any MongoDB cluster include at least three replica set members, ideally geographically distributed to ensure no single point of failure.

Alternatively, you can configure operations to write to multiple replicas before returning the write acknowledgement to the application, for near-synchronous replication. This multi-data center awareness enables global data distribution and separation between operational and analytical workloads. Replica sets also provide operational flexibility by providing a way to upgrade hardware and software without taking the database offline.

MongoDB provides horizontal scale-out in the cloud via a process called sharding. Sharding distributes data across multiple physical partitions (shards). Sharding allows MongoDB to address the hardware limitations of a single server without adding complexity to the application. MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases.

Sharding is transparent to applications; whether there's one or one hundred shards, the application code for querying MongoDB is the same.
MongoDB provides multiple sharding policies, letting you distribute data across a cluster according to query patterns or data locality.

MongoDB clusters managed by MongoDB Atlas on GCP are organized into Projects. All clusters within a Project live inside a single Virtual Private Cloud (VPC) per region, ensuring network isolation from other customers. By default, clusters contain three MongoDB replica set members, which are distributed across the available zones within a single GCP region.

Creating a globally distributed database with MongoDB Atlas


Configuring cross-region replication is easy from the MongoDB Atlas UI, as shown below:
Once configured, MongoDB Atlas automatically scales the storage on clusters and makes it easy to scale the compute (memory and CPU), to larger cluster sizes, or enable sharding in the UI or API, with no manual intervention.

We hope you take advantage of the availability of MongoDB Atlas across more GCP regions to reduce the operational overhead of setting up and scaling the database, letting you focus on building better apps, faster. We look forward to seeing what you build and hearing your feedback so we can continue to make GCP the best place for running MongoDB in the cloud.

More on MongoDB


Optimizing your Cloud Storage performance: Google Cloud Performance Atlas



Google Cloud Storage is Google Cloud Platform’s powerful unified object storage that can meet most if not all of your developer needs. Out of the box, you get close-to-user-edge serving, CDN capabilities, automatic redundancy and the knowledge that you’re using a service that can even help reduce your storage carbon emissions to zero!

That being said, every developer has a unique use case and specific requirements for how they use Cloud Storage. While its out-of-the-box performance is quite impressive, here are a few pro tips, tweaks and suggestions to help you optimize for your particular use case.

Establishing baseline performance


First off, you can’t fix what you can’t measure. As such, it’s important to establish a baseline expectation of the performance of a Cloud Storage bucket. To this end, we suggest running the perfdiag utility which runs a set of tests to report the actual performance of a Cloud Storage bucket. It has lots of options, so that you can tune it as closely as possible to match your own usage pattern, which can help set expectations, and, if there’s a performance difference, help track down where the problem might be.

Improving upload performance


How to upload small files faster

Each Cloud Storage upload transaction has some small overhead associated with it, which in bulk scenarios, can quickly dominate the performance of the operation. For example, when uploading 20,000 files that are 1kb each, the overhead of individual upload takes more time than all the entire upload time altogether. This concept of overhead-per-operation is not new, nor is the solution: batch your operations together. If you’ve ever done SIMD programming on the CPU, for example, you’ve seen firsthand how batch operations mitigates the overhead of each operation across the set, improving performance.

To this end, gsutil provides the "-m" option that performs a batched, parallel (multi-threaded/multi-processing) upload from your local machine to your cloud bucket, which can significantly increase the performance of an upload. Here’s a test that compares uploading 100 200k files individually and batch uploading them using "gsutil -m cp". In the example below, upload speeds increase by more than 5X by using this flag. 
In this graph, smaller is better


More efficient large file uploads


The gsutil utility can also automatically use object composition to perform uploads in parallel for large, local files that you want to upload to Cloud Storage. It splits a large file into component pieces, uploads them in parallel and then recomposes them once they're in the cloud (and deletes the temporary components it created locally).

You can enable this by setting the `parallel_composite_upload_threshold` option on gsutil (or, updating your .boto file, like the console output suggests).

gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket

Where "localbigfile" is a file larger than 150MB. This divides up your data into chunks ~150MB and uploads them in parallel, increasing upload performance. (Note, there are some restrictions on the number of chunks that can be used. Refer to the documentation for more information.)

Here’s a graph that compares the speed of uploading 100 500MB files using regular vs. composite uploads.
In this graph, smaller is better

Avoid the sequential naming bottleneck


When uploading a number of files to Cloud Storage, the Google Cloud frontends auto-balance your upload connections to a number of backend shards to handle the transfer. This auto-balancing is, by default, done through the name/path of the file, which is very helpful if the files are in different directories (since each one can be properly distributed to different shards).

This means that how you name your files can impact your upload speed.


For example, imagine you’re using a directory structure that includes a timestamp:

YYYY/MM/DD/CUSTOMER/timestamp 


This can cause an upload speed issue, since the majority of your connections will all be directed to the same shard, since the filenames are so similar. Once the number of connections gets high enough, performance can quickly degrade. This scenario is very common with datasets which have a deep connection to a timestamp (like photographs, or sensor data logs).

A simple solution to this issue is to simply re-name your folder or file structure such that they are no longer linear. For example, prepending a uniformly distributed hash (over a fixed range) to the filenames breaks the linearity and allows the load balancers to do a better job at partitioning the connections.

Below, you can see the difference between how long it took to upload a set of files named linearly, vs. ones that were prepended with a hash:

It’s worth noting that if this renaming process breaks some data-dependencies in your pipeline, you can always run a script to remove the hashes on the files once the upload is finished.

Improving download performance


Set your optimal fetch size 

As mentioned before, Cloud Storage has fixed transactional overhead per request. For downloads (much like uploads), this means you can achieve improved performance by finding the sweet spot between the size of the request, and the number of requests that have to be made to download your data.

To demonstrate this, let’s take an 8MB file, and fetch it through different sized chunks. In the graph below, we can see that as the block size gets larger, performance improves. As the chunk size decreases, the overhead per-transaction increases, and performance slows down.

In this graph, smaller is better

What this graph highlights is that Cloud Storage is extremely strong in terms of single-stream throughput. That means that for both uploads and downloads, Cloud Storage performance is at its best for larger requests of around 1MB in size. If you have to use smaller requests, try to parallelize them, so that your fixed latency costs overlap.

Optimizing GSUTIL reads for large files


If you need to download multi-gigabyte files to an instance for processing, it’s important to note that gsutil’s default settings aren’t optimized for large-file transfer. As a result, you might want to adjust your use of slicing, and threadcount to improve performance.

The default settings for GSUTIL spread file downloads to four threads, but only use a single process. To copy a single file onto a powerful Compute Engine VM, we can improve the performance by limiting the number of threads, forcing the system to use multiple processes instead.

For large files, GSUTIL has a nifty feature that can help even further. Gsutil uses HTTP Range GET requests to perform "sliced" downloads in parallel when downloading large objects from Cloud Storage.


time gsutil -o 'GSUtil:parallel_thread_count=1' -o 
'GSUtil:sliced_object_download_max_components=8' cp gs://bukket/fileSRC.dat 
./localDST.bin

You can see in the graph below that using sliced and threaded downloads with HTTP Range GET requests is a big time-saver.
In this graph, smaller is better
So there you have it folks—some quick and dirty tricks to help you get the most performance out of your Cloud Storage environment. If you have any other tips that you’ve found helpful, reach out to me on social media, and let me know if there are other topics you’d like me to cover. And don’t forget to subscribe to the Google Cloud Platform Youtube channel for more great tips and tricks for optimizing your cloud application.

Fully managed export and import with Cloud Datastore now generally available



If you store information in a database, you know how important it is to have an easy and reliable way to get data in and out — whether you’re making a copy for backup or archival purposes, or you want to restore data after a user “ooops” moment. So today, we're announcing the general availability of managed export and import for Cloud Datastore, our highly scalable NoSQL managed database service.

This feature, in beta since September, replaces the previous Cloud Datastore Admin backup module, and provides a more fully managed database experience. Some of the benefits of managed export and import include:
  • Fully managed by Google Cloud Platform (GCP) and exposed as a simple service 
  • Improved export speed achieved through sharding improvements 
  • A new API in the same form as the Cloud Datastore v1 API 
  • Integration with Cloud IAM 
  • Billing exclusively based on Cloud Datastore entity reads and writes; no longer billing for App Engine instance hours
"Batterii, a consumer insights and collaboration platform, has been using Cloud Datastore for seven years. Previously, our backups took 8-10 hours. With managed export and import, we can now complete those backups in an hour." 
Greg Fairbanks, Senior Software Engineer
Since its beta release in September 2017, early users now perform more than 10,000 exports and imports per week. Now that the new feature is generally available, we'll phase out Datastore Admin backup in 12 months, and starting on February 28th, 2019, it will no longer be available from the Google Cloud Console. To learn more and to get started with managed exports and imports, check out the documentation on exporting and importing entities and scheduling an export.

Announcing Google Cloud Spanner as a Vault storage backend



HashiCorp Vault is a powerful open source tool for secrets management, popular with many Google Cloud Platform (GCP) customers today. Vault provides "secret management as a service," acting as a static secret store for encrypted key-value pairs; a secret generation tool to dynamically generate on-the-fly credentials; and pass-through encryption service so applications do not need to roll their own encryption. We strive to make Google Cloud an excellent platform on which to operationalize Vault.

Using Vault will require up-front configuration choices such as which Vault storage backend to use for data persistence. Some storage backends support high availability while others are single tenant; some operate entirely in your own datacenter while others require outbound connections to third-party services; some require operational knowledge of the underlying technologies, while others work without configuration. These options required you to consider tradeoffs across issues such as consistency, availability, scalability, replication, operationalization and institutional knowledge . . . until now.

Today we're pleased to announce Cloud Spanner as a storage backend for HashiCorp Vault. Building on the scalability and consistency of Google Cloud Spanner, Vault users gain all the benefits of a traditional relational database, the scalability of a globally-distributed data store and the availability (99.999% SLA for multi-region configurations) of a fully managed service.

With support for high-performance transactions and global consistency, using Cloud Spanner as a Vault storage backend brings a number of features and benefits:
  • High availability - In addition to Cloud Spanner's built-in high availability for data persistence, the Vault storage backend also supports running Vault in high availability mode. By default, Vault runs as a single tenant, relying on the storage backend to provide distributed locking and leader election. Cloud Spanner's global distribution and strong, consistent transactions allow for a highly available Vault cluster with just a single line of configuration.
  • Transactional support - Vault backends optionally support batch transactions for update and delete operations. Without transactional support, large operations—such as deleting an entire prefix or bootstrapping a cluster—can result in hundreds of requests. This can bottleneck the system or overload the underlying storage backend. The Cloud Spanner Vault storage backend supports Vault's transactional interface, meaning it collects a batch of related update/delete operations and issues a single API call to Spanner. Not only does this reduce the number of HTTP requests and networking overhead, but it also ensures a much speedier experience for bulk operations.
  • Enterprise-grade security - Cloud Spanner follows the same security best practices as other Google products. Data stored at rest in Cloud Spanner is encrypted by default, and Cloud Spanner uses IAM to provide granular permission management. Google’s infrastructure has many security differentiators, including backing by Google’s custom-designed security chip Titan, and Google’s private network backbone.
  • Google supported - This backend was designed and developed by Google developers, and is available through the Google open-source program. It's open for collaboration to the entire Vault community with the added benefit of support from the Google engineering teams.

Getting started


To get started, download and install the latest version of HashiCorp Vault. The Google Cloud Spanner Vault storage backend was added in Vault 0.9.4 (released on February 20, 2018), so ensure you're running Vault 0.9.4 or later before you continue.

Next, create a Cloud Spanner instance and schema for storing our Vault data using the gcloud CLI tool. You can also create the instance and the schema using the web interface or API directly:

$ gcloud spanner instances create my-instance \
  --config=nam3 \
  --description=my-instance \
  --nodes=3

$ gcloud spanner databases create my-database --instance=my-instance

$ gcloud spanner databases ddl update my-database --instance=my-instance --ddl="$(cat <

Next, create a Vault configuration file with the Google Cloud Spanner storage backend configuration:

# config.hcl
storage "spanner" {
  database   = "projects/my-default-project/instances/my-instance/databases/my-database"
}

Start Vault with the configuration file. This example uses Vault's built-in development mode, which does not represent best practices or a production installation, but it's the fastest way to try the new Cloud Spanner Vault storage backend.

$ export VAULT_ADDR=http://127.0.0.1:8200
$ sudo vault server -dev -config=config.hcl

During this process, Vault authenticates and connects to Cloud Spanner to populate the data storage layer. After a few seconds, you can view the table data in the web interface and see that data has been populated. Vault is now up-and-running. Again, this is not a production-grade Vault installation. For more details on a production-grade Vault installation, please read the Vault production hardening guide. You can now create, read, update and delete secrets:

$ vault write secret/my-secret foo=bar

To learn more about the backend configuration options, read the HashiCorp Vault Google Cloud Spanner storage backend documentation. To learn more about Google Cloud Spanner, check out the Google Cloud Spanner documentation.


Toward a great Vault experience on GCP


The Cloud Spanner Vault storage backend enables organizations to leverage the consistency, availability, scalability, replication and security of Cloud Spanner while also supporting Vault's own high availability requirements. In addition to supporting our customers, we're delighted to continue our long-standing relationship with HashiCorp as part of our ongoing partnership. We're excited to see how this new storage backend enables organizations to be more successful with Vault on GCP. Be sure to follow us on Twitter and open a GitHub issue if you have any questions.

How Google Cloud Storage offers strongly consistent object listing thanks to Spanner



Here at Google Cloud, we're proud of the fact that, all of our listing operations are consistent across Google Cloud Storage. They're consistent across all Cloud Storage bucket locations, including regional and multi-regional buckets. They're consistent whether you're listing buckets in a project or listing objects within a bucket. If you create a Cloud Storage bucket or object, and then you follow that up with a request to fetch a list of resources, your resource will be in that response.

Why is this important? Strong list consistency is a big deal when you run data and analytics workloads. Here's an explanation from Johannes Fabian Rußek, Technical Product Owner at Spotify on why strongly consistent listing operations are so important to his business:
When you do not have consistent listings, there is a possibility of missing files. You cannot rely on the consistency of the data being read as you develop your products. Even worse, inconsistent listings lead to unforeseen issues. For example, our processing tooling will succeed reading partial data and may potentially produce seemingly valid outputs. Problems like these have a tendency to quickly propagate throughout the dependency tree. 
When that happens, in the best-case we notice the failure and recompute all datasets produced within the dependency tree. In the worst-case, the failure goes unnoticed and we create invalid reports and statistics. Considering the large amount of data pipelines we run, even with a low probability of that happening, a lack of list-consistency in cloud storage offerings was a major blocker for data-processing at Spotify.
Not all cloud storage services provide list-after-write consistency, which can cause challenges for some common use cases. Typically, when a user uploads an object to a cloud storage bucket, an unpredictable and unbounded amount of time passes before that object shows up in that bucket’s list of contents. This is a very weak consistency model called “eventual consistency.” In practice, if a user uploads a new object and then tries to find it from a browser on another computer, they might not see the object that they just uploaded. Similar issues impact workloads distributed across multiple compute nodes. By offering strong list consistency across all Google Cloud Storage objects, you avoid having to wrangle with these sorts of problems. Again, here’s Spotify’s Johannes Fabian Rußek:
We considered multiple workarounds, such as using a global consistency cache based on NFS, porting Netflix’s s3mper as well as persisting listings in a manifest file stored alongside the data. All of the considered solutions were suboptimal as they either introduced a single point of failure or required us to put significant resources into developing our own solution and adjusting our tooling. Strong list consistency in Cloud Storage means we can continue using our existing data-processing stack without modifications and without worrying that data may be corrupted.

List consistency on Cloud Storage is an essential feature for data processing at Spotify. We use a Hadoop-based data processing stack built on top of the Hadoop Distributed File System, which means we rely on its filesystem-like guarantees. Consistency is critical to running our business, and its absence creates many challenges.

Spanner: the secret to Cloud Storage strong list consistency 

Up until last year, Cloud Storage stored information about its buckets and objects (metadata) in a group of technologies powered by an internal Google technology called “Megastore." Megastore enabled Cloud Storage to provide important features like read-after-write consistency quickly and at very high volumes. But as is typical for object storage, Cloud Storage only provided eventual consistency for list-after-write operations.

Last year we migrated all of Cloud Storage metadata to Spanner, Google’s globally distributed and strongly consistent relational database. Spanner’s specialty is scaling horizontally while providing strong consistency guarantees and high availability. The same technology is available to Google Cloud customers today as Cloud Spanner.

Migrating to Spanner afforded Cloud Storage some key new features, including strong listing consistency. Other beneficiaries of strong listing consistency are MapReduce or Hadoop tasks, in which many workers need to produce separate work that will later be collected by some other processor. With strong listing consistency, workers can separately upload their results to a bucket, confident that a collecting job will always be able to collate all of the results with no exceptions. This is another example of how strong or external consistency makes application development more efficient.

Summary


Cloud Storage now provides strong consistency for the following operations in all types of buckets in all regions:

  • Read-after-write consistency: Reading an object after writing it has completed, for both new objects and overwrites of existing objects 
  • Read-after-update consistency: Reading an object’s metadata after updating its metadata 
  • Read-after-delete consistency: Reading an object will fail with a 404 immediately after it has been deleted 
  • List-after-write consistency: Fetching a list of buckets and objects will always reflect any changes that have previously completed 
  • Granting additional permissions for access to resources: For example, when you grant a new user permission to read an object, that user can immediately read the object
  • Some operations or permissions changes provide bounded consistency. Such as read operations on public cached objects, designed to achieve top cache performance. Details can be found here.

We’re excited by the new functionality that hosting Cloud Storage metadata on Spanner brings to the table. You can learn more about Cloud Storage’s consistency model in our documentation: https://cloud.google.com/storage/docs/consistency.