Logica is a logic programming language designed for intuitive and efficient data manipulation, which we open sourced in 2020. It compiles to SQL, providing access to the power of SQL engines with the convenience of a logic programming syntax.
When it was open sourced, Logica's only fully supported engine was BigQuery, a powerful data warehouse, executing queries with high parallelization and processing terabytes of data within seconds.
Modern machines can store and process significant amounts of data, even within a single computer. Thus relational SQL databases are as popular as ever. They contain a lot of data and its analysis is important. Among open source database options, PostgreSQL and SQLite are some of the most popular database engines (example1, example2). Logica added support for SQLite in 2021.
Now we are pleased to announce a new release of Logica that adds support for PostgreSQL.
As Logica compiles to SQL, it is natural to extend the language to use PostgreSQL as the engine. However, there are nuances in the SQL dialect of Postgres which require addressing. The biggest distinction is that PostgreSQL requires types of records to be explicitly spelled out in your query, while BigQuery determines the types automatically.
For example, consider a Logica predicate where for each user we collect a list of records with information about their purchases.
We can translate this Logica predicate to GoogleSQL to run on BigQuery as follows:
SELECT
user_id,
user_name,
ARRAY_AGG(STRUCT(item_name as item_name, item_price as item_price)) as purchases
FROM
Purchases INNER JOIN
UserInfo USING (user_id)
GROUP BY 1, 2;
Logica's record {item_name:, item_price:} simply compiles into GoogleSQL's STRUCT(item_name as item_name, item_price as item_price).
However, in the dialect of PostgreSQL composite types must be explicitly defined and specified. In our example, we need to define the type PurchaseRecord with fields item_name and item_price. We should also specify in the query that the purchases column is aggregating records of type PurchaseRecord. Thus PostgreSQL query for our predicate would be written like so.
CREATE TYPE PurchaseRecord as (item_name text, item_price numeric);
SELECT
user_id,
user_name,
ARRAY_AGG(ROW(item_name,
item_price)::PurchaseRecord) AS purchases
FROM
Purchase INNER JOIN
UserInfo USING (user_id)
GROUP BY UserInfo.user_id, UserInfo.user_name;
Records and lists are also useful as intermediates in calculations, even if the input and output data are normalized. For example, we have a table called ItemSales and want to find a list of most sold items in each of the stores that the table describes. Specifically, we want to assemble a table with information about the top three most sold items among all of the stores. For each of the items, we may want to list the department of the store where the item is being sold. This can be achieved intuitively using the ArgMax3 aggregate function, which accumulates all the information about the items that we need, and no extra join is needed.
# Collecting information of top 3 most sold items for each store.
To support the PostgreSQL engine, we extended the Logica compiler with type inference. Logica now infers data types for all expressions that a user employs. For records and arrays, Logica specifies their type in the produced SQL, just as PostgreSQL requires. Commands to create necessary types are produced as part of the compiled SQL. In this collab, we show an example of a program that writes a PostgreSQL table, and in this collab, we show how to give type hints when the program does not have enough information for complete inferences.
As a byproduct of type inference, we were able to improve error messages. Now that we know the types, we can point to the user where a mistake is made within the Logica program, rather than the user having to debug the generated SQL statement.
PostgreSQL is a popular and powerful engine. It is easy to start your own instance (maybe just in CoLab!), or use a serverless option. We are excited to provide users of Logica with the option to run on Postgres. If you already use PostgreSQL, we encourage you to give Logica a try, it is a joy to write data analysis with logic programming! If you have any feedback or questions, please share at the discussion section of Logica repository.
In May 2021, we released El Carro to make it easier to run Oracle databases on Kubernetes. Our following blog dove deeper into El Carro’s features, announced support for Oracle 19c, and detailed more flexibility for building database images. Today we’re excited to open source two new features to enhance El Carro and make it easier to manage your Oracle deployments: Data Migration and Point-in-Time Recovery. Automated Data Migration makes it much easier to re-platform to El Carro and Point-in-Time Recovery is a standard feature that database professionals come to expect because it enables you to drastically reduce RPO and worry less about backup frequency.
Data Migration
The Data Migration feature of El Carro enables users to migrate data from their existing database to an El Carro database running on Kubernetes. This functionality allows users to re-platform to El Carro with minimal disruption. The two most common pathways shown in the image below are to 1) modernize in place by simply migrating your database to Kubernetes so you can leverage the automation of El Carro and to 2) migrate to Kubernetes in the cloud.
Migrate your database in place or to the cloud
Typical migration sources include AWS (RDS, EKS, EC2), Azure (AKS, VMs), GCP (GKE, GCE, BMS) or on-premises deployments. Typical migration targets include any Kubernetes installation on GCP (GKE), AWS (EKS), Azure (AKS), or on-premises.
The Data Migration feature offers two automated migration flows and two manual ones.
Migration Downtime: required downtime to migrate source database into El Carro without data loss. Minimal means less downtime, long means more downtime.
Complexity: summarizes the difficulty and complexity of the migration journey.
Point-in-Time Recovery
Since its release, El Carro has provided users the ability to take backups and restore via RMAN or storage-based snapshots. Today we’re excited to release a new Point-in-Time Recovery feature to enhance El Carro’s backup functionality by automatically backing up archive redo logs to a GCS (Google Cloud Storage) bucket and allowing users to seamlessly restore their databases to any point in time within a user configurable window. This optional feature provides an additional layer of protection and enhanced restore granularity without interfering with manual backups or affecting database performance.
The diagram below contrasts the new versus old functionality. Previously, there were discrete restore points (shown in green on the top arrow) which represented limited opportunities to restore. With Point-in-Time Recovery, the entire arrow is green, meaning the recovery functionality is continuous, with restore points at any time along the green arrow.
With Point-in-Time Recovery you can restore to any point after the first backup
Conclusion
As always, you can try the open source El Carro operator for free (Apache 2.0 license) wherever you run Oracle databases. Follow the quick start guide and try out provisioning of instances, databases, users. Import data via Data Pump, manage instance parameters, choose between different methods for backups, and try out a restore. Have a look at how we integrate with external logging and monitoring solutions. Reach out via our Google group and leave feedback for what features you would like to see next, or even create your own patch, issue or pull request on GitHub.
By Kyle Meggs, Product Manager and Ash Gbadamassi, Software Engineer – Cloud Databases
Open Cloud enables you to develop software faster, innovate more easily, and scale more efficiently—while also reducing technology risk. Google has a long history of leadership in open source, and today, I want to look back at our activities around open source projects, for databases, over the past year.
Give developers the best tools to be efficient
Developers choose to build applications with managed database services on Google Cloud to benefit from velocity, scalability, security, and performance. To enable you to be most efficient and deliver your best possible work, we deliver tools and frameworks that work with your preferred development environments, no matter if you develop in the cloud or on premises. To make local testing, building and continuous integration easier for our cloud-native databases, we released emulators for Cloud Spanner, Firestore, and Cloud Bigtable so that you can test your code wherever you develop it - without the need to create or re-create cloud infrastructure with every test run.
Another area where we are helping developers is with instrumentation of Cloud SQL for easier debugging and performance tuning. With Cloud SQL Insights it is easier than ever to pinpoint underperforming SQL statements. That said, without additional instrumentation, it can be cumbersome to identify the source code or microservice that issued that SQL - let alone tying a SQL statement to a client session and its context. So we released Sqlcommenter as an open source library that will automatically add this instrumentation as SQL comments in queries that are generated by popular ORMs like Hibernate, Django, Sqlalchemy, and others (repoblog). We didn’t stop there, but merged Sqlcommenter with OpenTelemetry (blog) to add SQL insights from instrumented queries back to OpenTelemetry traces.
Lastly, we want to broaden access to our differentiated offerings, like Spanner. The recently announced Spanner PostgreSQL interface allows organizations to access Spanner’s industry-leading consistency and availability at scale using tools and skills from the popular PostgreSQL ecosystem. This new way of working with Spanner provides familiarity for developers and portability for administrators. (blog) Learn more in the documentation or sign up for the preview today.
Provide connectivity that is simple and secure
Connecting to APIs and databases from an application running in the cloud should be simple and secure. That’s why we recommend using IAM and Application Default Credentials when authenticating to other services. The Cloud SQL Proxy (repo) has been doing this and also setting up firewalls for you for a while. It works by running a local client either inside your VM or a GKE cluster. This year, we added libraries for Java (repo) and Python (repo) that can provide similar functionality without the overhead of running an extra client such as the proxy.
Cloud Spanner also offers an open source adapter for its new PostgreSQL interface (repo). This local proxy allows tools, starting with psql, to connect to a Spanner database using the PostgreSQL wire protocol.
Manage cloud infrastructure with the tools of your choice
When it comes to provisioning, monitoring, and managing your cloud database services, flexibility and choice are important. We provide you with our cloud console, gcloud cli, and APIs as well as our own Deployment Manager. That said, you may prefer different ways to manage cloud infrastructure - whether through interactive tools or scripts or embedded into CI/CD pipelines that support GitOps or other controls, checks, and balances. Terraform is one of those open tools that is very popular - and we ensure that our cloud databases can be managed from it as documented in this blog about creating Spanner instances with Terraform.
If you manage the majority of your resources with Kubernetes either directly or through package managers like Helm, then our Kubernetes Config Connector (KCC) might be for you. In a nutshell, KCC exposes Google Cloud services such as Cloud SQL, Spanner, and others as Custom Resources in Kubernetes. This allows you to create and reconcile cloud resources outside of Kubernetes just like K8s native objects.
Once you are managing cloud infrastructure with CI/CD, the next step is to extend that same mechanism to manage objects within your databases such as tables, indexes, and views. To that extent we have released a Liquibase extension for Cloud Spanner.
Help you to move data with confidence
Cloud journeys often involve moving data either in a lift and shift process or sometimes replatforming to a different database. Whatever your journey, we want to simplify the process and give you the confidence that your migration is successful.
For enterprise users with Oracle databases, we have several open source projects. First, we have the Optimus Prime database assessment tool (repo) that queries your database and collects information about schemas and historic performance to be analyzed for migration complexity and consolidation potential. Our own professional services teams have been using this toolset to plan migrations to Bare Metal Solution for Oracle.
Some Oracle users are looking for opportunities to transform their workloads to fit with their bigger strategy of modernizing applications with Kubernetes. For this group we developed and open sourced the El Carro Kubernetes operator for Oracle. This not only automates database lifecycle tasks for systems running on Kubernetes, but also exposes declarative APIs for these operations.
If your application supports replatforming from Oracle to PostgreSQL, then we have a toolset for schema conversion along with dataflow pipelines that will read the output of a change data capture job and load it into a PostgreSQL database. What a great use-case for Datastream - our new serverless change data capture service.
Another case of heterogeneous database migration is to move MySQL or PostgreSQL databases to Cloud Spanner. HarbourBridge helps with the evaluation and data migration, and our latest contribution was adding support for DynamoDB as a source database. Part of every heterogeneous migration should be to validate that the source and target data are matching - we have released the Data Validation Toolkit for that use-case. DVT can connect to a number of source and target databases and compare the data on each side - giving you the confidence that your migration did not miss or change any records.
Conclusion
Whether you are migrating existing databases or you are building your next application in the cloud - we want to make your journey as comfortable and seamless as possible. Open source projects play a big role in meeting you where you are and providing you with the connectivity options, language support, and tools you want for management and migrations.
By Bjoern Rost, Product Manager, Google Cloud Databases
Google Cloud is releasing El Carro, an open source tool to help you transform and modernize your Oracle database operations. El Carro implements the Kubernetes operator pattern to deliver automation for provisioning and ongoing operations like backups, patching, and high availability for databases running in hybrid and multi-cloud environments. And it does so using the same declarative syntax that DevOps teams are using to manage applications. With El Carro, users can choose to modernize and transform their database operations in place and benefit from a consistent management experience and hybrid and multi-cloud portability. Released under the Apache License 2.0, you are free to use El Carro in any Kubernetes environment—you are in control.
Containers and Kubernetes deliver portability on standardized infrastructure, and today Oracle supports databases running in containers; they’ve also released container build files and images and helm charts to simplify provisioning. What is missing for the next level of integration is support for lifecycle operations and an extension of the Kubernetes API to the primitives needed for database management.
In addition, fully managed or autonomous services for Oracle may not make available all the required features, such as Active Data Guard, Multitenant, and In-Memory, parameters/flags, versions, and patch levels. DBAs also find themselves locked out of many roles, including sysadmin and root. These restrictions make many cloud architects fall back to lift and shift Oracle databases onto infrastructure as a service offerings and miss out on opportunities to modernize and transform database operations. And with transactional databases growing in number and criticality, organizations are struggling to deliver innovation and modernization. Engineers are already busy keeping up with sprawl and mundane operational tasks while adhering to strict change management processes.
How do we solve this database operations gap?
El Carro solves this. It is built with scalability in mind, using the same container orchestration infrastructure, Kubernetes, that powers many businesses and is a top choice for modern architectures. Its open API allows you to manage your database configurations as declarative code, enabling CI/CD or Gitops workflows for auditability and control mechanisms. El Carro automates many database lifecycle operations, like backups, replication, and patching. And, when it distributes databases on the nodes of a cluster, it is aware of the priority and resource requirements of each database to optimize tight packing while respecting quality of service. Lastly, it helps DBAs by delivering automation without restrictions and leaving DBAs in full control over their systems. You can choose to let the operator drive for you, but you can also take over the steering wheel yourself at any time.
Because Kubernetes is now the standard for portable infrastructure automation and orchestration, engineers appreciate how Kubernetes abstracts complex problems into manageable infrastructure as code. Kubernetes can scale from small projects to large projects that support the infrastructure that powers Google products and services for billions of users around the world. Moreover, Google pioneers the next generation of infrastructure as code that we refer to as Configuration as Data to declaratively establish a contract between developer intent and the runtime operation. According to the Cloud Native Survey 2020, two-thirds of respondents were either already running stateful workloads in production or were considering doing so within the next 12 months. We expect that datastores are going to drive the next wave of enterprise Kubernetes adoption.
A number of open source operators for databases, such as PostgreSQL, MySQL, and many others, have been released, are actively maintained by the community, and are popular among developers and architects looking for a hands-off approach to manage databases with their applications. El Carro extends the list of database operators to include Oracle.
What are we building with El Carro for Oracle databases?
The operator pattern emerged in late 2016 as an extension of the Kubernetes API and control loop aimed at automating more complicated and application-specific tasks that are beyond the native Kubernetes objects.
El Carro implements a custom resource definition (CRD), which is tailored to database management. Users set and change attributes of the custom resource using the Kubernetes API the same way they do for built-in objects such as pods, deployments, or services. The El Carro controller observes changes to the CRs and compares the declared state with the current reality in the cluster, then makes the necessary changes. Those changes could either affect the Kubernetes resources used by the database such as persistent volumes or the pod itself, or may result in issuing calls via SQL or command line tools to the database to create and modify users or other database objects.
Here’s a look at how this works:
El Carro Architecture
The diagram above shows how the major components of a database managed by the El Carro Operator interact with each other. The controller monitors the CRD for any changes made by admins. It creates and manages the cluster resources that make up the actual database deployment: persistent volumes for filesystems and data, a pod to run containers with the actual database, and a daemon that allows the controller to securely run SQL commands on the database. And lastly, a service makes sqlnet connections available to applications and end users that can either run in the same Kubernetes cluster or outside of it.
At release time, the El Carro Operator can provision Oracle databases of 12c Enterprise Edition and 18c Express Edition. It manages instance parameters, pluggable databases, and users. You can take and restore backups either using rman or storage snapshots, and we are working to add additional features.
How to get involved with El Carro?
In the development process, we collaborated with users and partners in the Oracle community to help us validate the approach. "Pythian has helped Oracle users to automate and optimize the operations of their mission-critical systems for over 20 years,” says Simon Pane, principal consultant at Pythian. “We are excited about the possibilities that El Carro brings to users on their cloud modernization journeys. We are proud to work with the community on a vision for the future of database management.".
Sean Scott covers Docker for databases on his blog oraclesean.com, and says: "There are many benefits to running Oracle databases in containers. Adding Kubernetes orchestration introduces new opportunities to bring the DevOps and Oracle communities together."
You can try out El Carro today. Follow the quick start guide and try out provisioning of instances, databases, users. Import data via Data Pump, manage instance parameters, choose between different methods for backups, and try out a restore. Have a look at how we integrate with external logging and monitoring solutions. Reach out via our Google group and leave feedback for what features you would like to see next, or even create your own patch and pull request on GitHub.
By Bjoern Rost - Product Manager and Boris Dali - Team Lead, Engineering
We present Logica, a novel open source Logic Programming language. A successor to Yedalog (a language developed at Google earlier) it is a Datalog-like logic programming language. Logica code compiles to SQL and runs on Google BigQuery (with experimental support for PostgreSQL and SQLite), but it is much more concise and supports the clean and reusable abstraction mechanisms that SQL lacks. It supports modules and imports, it can be used from an interactive Python notebook and it even makes testing your queries natural and easy.
“Data is the new oil”, they say, and SQL is so far the lingua franca for working with data. When SQL (or “Structured English Query Language”, as it was first named) was invented in the 1970s, its authors might not have imagined the popularity that it would reach half a century later. Today, systems ranging from tiny smart watch applications to enterprise IT solutions, read and write their data using SQL. Even the browser that you are using to read this post now might have a working built-in SQL database in it.
Despite the widespread adoption, SQL is not flawless. Constructing statements from long chains of English words (which are often capitalized to keep the old-fashioned COBOL spirit of the 70s alive!) can be very verbose—a single query spanning hundreds of lines is a routine occurrence. The main flaw of SQL, however, lies in its very limited support for abstraction.
Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.
This inherent resistance to decomposition of logic into bite-sized pieces is what leads into the contrived, lengthy queries, the copy-pasted chunks of code and, eventually, unmaintainable, unstructured (note the irony) SQL codebases. To make things worse, SQL code is rarely tested, because “testing SQL queries” sounds rather esoteric to most engineers, at best. Because of that, a number of alternative query languages and libraries have been developed. Of those, systems based on logic programming perhaps come the closest to addressing SQL’s limitations.
Logic programming languages solve problems of SQL by using syntax of mathematical propositional logic rather than natural English language. The language of formal logic was designed by mathematicians specifically to make expression of complex statements easier and suits this purpose much better than natural language. Logica extends classical Logic programming syntax further, most notably with aggregation, hence the name, which stands for
Logica = Logic + Aggregation.
Let us see how it all works. SQL operates with relations, which are sets of rows. In logic programming the analog of a relation is a predicate. While a predicate is a set of rows, we think of it as a logicalcondition, which describes the rows of a relation. Here is, for example, the definition of a simple predicate:
MagicNumber(x:2);
MagicNumber(x:3);
MagicNumber(x:5);
The definition claims that the condition MagicNumber(x) must hold when X is precisely either 2, 3, or 5. That means, if we were to query this predicate (i.e. request all values of X that satisfy it), the output should be a “relation” with a single column X and rows 2, 3, and 5. The SQL equivalent would be:
SELECT 2 AS x
UNION ALL
SELECT 3 AS x
UNION ALL
SELECT 5 AS x;
Rather than listing the individual values, we could have defined the predicate by encoding a logical condition upon X as follows:
MagicNumber(x:):-
x in[2,3,5];
Now, here is where the magic starts. Firstly, any table in your database is itself already a predicate, so the following definition:
MagicComment(comment_text:):-
`comments`(user_id:, comment_text:),
user_id ==5;
Defines a predicate MagicComment, which includes precisely those comment_text values, which are present in the comments table where user_id == 5. In SQL this would read:
SELECT comment_text FROM comments WHERE user_id =5;
Observe what happens if we replace the condition “user_id == 5” in our predicate with MagicNumber(x: user_id):
MagicComment(comment_text:):-
`comments`(user_id:, comment_text:),
MagicNumber(x: user_id);
Here, we are querying for comments of users whose ID is one of the “magic numbers” we just defined above. Note how easily we could reuse a previously defined piece of code without having to copy anything around. We could now even extract the MagicNumber to a common module and import it in wherever it is needed:
import my_project.magic.MagicNumber;
As a final example, let us mock the comments table, in a unittest of a query.
If we query the MagicComment predicate here, it will not try to read the comments table in the database. Instead, it will use the predicate we just defined, thus letting us verify its correctness by testing the output (it must include two rows “Logic” and “Programming”). Observe how natural and frictionless many of the good programming practices become with Logica, and compare that to what you would have to do to achieve the same using bare SQL.
There is much more to Logica, so make sure you give it a try—chances are, you will love it! Start with this tutorial to learn Logica. Even if you do not end up using it in your next project, learning a new powerful language may open your mind to new ideas and perspectives on data processing and computing in general.
The simple examples above are only a small sample of how concise Logica code can be over SQL for complex queries. In particular, we did not even touch the topic of aggregations in this article. For all of this see examples section of the Logica open source repository.
We also hope that some of the readers consider contributing to Logica development. That’s what open source is all about!
By Konstantin Tretyakov and Evgeny Skvortsov – Logica Open Source Project
Today we’re announcing that HarbourBridge—an open source toolkit that automates much of the manual work of evaluating and assessing Cloud Spanner—supports migrations from MySQL, in addition to existing support for PostgreSQL. This provides a zero-configuration path for MySQL users to try out Cloud Spanner. HarbourBridge bootstraps early stages of migration, and helps get you to the meaty issues as quickly as possible.
Core capabilities
At its core, HarbourBridge provides an automated workflow for loading the contents of an existing MySQL or PostgreSQL database into Spanner. It requires zero configuration—no manifests or data maps to write. Instead, it imports the source database, builds a Spanner schema, creates a new Spanner database populated with data from the source database, and generates a detailed assessment report. HarbourBridge can either import dump files (from mysqldump or pg_dump) or directly connect to the source database. It is intended for loading databases up to a few tens of GB for evaluation purposes, not full-scale migrations.
Bootstrap early-stage migration
HarbourBridge bootstraps early-stage migration to Spanner by using an existing MySQL or PostgreSQL source database to quickly get you running on Spanner. It generates an assessment report with an overall migration-fitness score for Spanner, a table-by-table analysis of type mappings and a list of features used in the source database that aren't supported by Spanner.
View HarbourBridge as a way to get up and running quickly, so you can focus on critical things like tuning performance and getting the most out of Spanner. You will need to tweak and enhance what HarbourBridge produces—more on that later.
Getting started
HarbourBridge can be used with the Cloud Spanner Emulator, or directly with a Cloud Spanner instance. The Emulator is a local, in-memory emulation of Spanner that implements the same APIs as Cloud Spanner’s production service, and allows you to try out Spanner’s functionality without creating a GCP Project. The HarbourBridge README contains a step-by-step quick-start guide for using the tool with a Cloud Spanner instance.
Together, HarbourBridge and the Cloud Spanner Emulator provide a lightweight, open source toolchain to experiment with Cloud Spanner. Moreover, when you want to proceed to performance testing and tuning, switching to a production Cloud Spanner instance is a simple configuration change.
To get started on using HarbourBridge with the Emulator, follow the Emulator instructions. In particular, start the Emulator using Docker and configure the SPANNER_EMULATOR_HOST environment variable (this tells the Cloud Spanner Client libraries to use the Emulator).
Next, install Go and configure the GOPATH environment variable if they are not already part of your environment. Now you can download and install HarbourBridge using
GO111MODULE=on \ go get github.com/cloudspannerecosystem/harbourbridge
It should be installed as $GOPATH/bin/harbourbridge. To use HarbourBridge on a MySQL database, run mysqldump and pipe its output to HarbourBridge
mysqldump <opts> db | $GOPATH/bin/harbourbridge -driver=mysqldump
where <opts> are the standard options you pass to mysqldump or mysql to specify host, port, etc., and db is the name of the database to dump. Similarly, to use HarbourBridge on a PostgreSQL database, run
pg_dump <opts> db | $GOPATH/bin/harbourbridge -driver=pg_dump
See the Troubleshooting guide if you run into any issues. In addition to creating a new Spanner database with data from the source database, HarbourBridge also generates a schema file, the assessment report, and a bad data file (if any data is dropped). See Files generated by HarbourBridge.
Sample dump files
If you don’t have ready access to a MySQL or PostgreSQL database, the HarbourBridge github repository has some samples. The files cart.mysqldump and cart.pg_dump contain mysqldump and pg_dump output for a very basic shopping cart application (just two tables, one for products and one for user carts). The files singers.mysqldump and singers.pg_dump contain mysqldump and pg_dump output for a version of the Cloud Spanner singers example. To use HarbourBridge on cart.mysqldump, download the file locally and run
The schema created by HarbourBridge provides a starting point for evaluation of Spanner. While it preserves much of the core structure of your MySQL or PostgreSQL schema, data types will be mapped based on the types supported by Spanner, and unsupported features will be dropped e.g. functions, sequences, procedures, triggers and views. See the assessment report as well as HarbourBridge’s Schema conversion documentation for details.
To test Spanner’s performance, you will need to switch from the Emulator to a Cloud Spanner instance. The HarbourBridge quick-start guide provides details of how to set up a Cloud Spanner instance. To have HarbourBridge use your Cloud Spanner instance instead of the Emulator, simply unset the SPANNER_EMULATOR_HOST environment variable (see the Emulator documentation for context).
To optimize your Spanner performance, carefully review choices of primary keys and indexes—see Keys and indexes. Note that HarbourBridge preserves primary keys from the source database but drops all other indexes. This means that the out-of-the-box performance you get from the schema created by HarbourBridge can be significantly impacted. If this is the case, add appropriate Secondary indexes. In addition, consider using Interleaved tables to optimize table layout and improve the performance of joins.
Recap
HarbourBridge is an open source toolkit for evaluating and assessing Cloud Spanner using an existing MySQL or PostgreSQL database. It automates many of the manual steps so that you can quickly get to important design, evaluation and performance issues, such as. refining choice of primary keys, tuning of indexes, and other optimizations.
We encourage you to try out HarbourBridge, send feedback, file issues, fork and modify the codebase, and send PRs for fixes and new functionality. We have big plans for HarbourBridge, including the addition of user-guided schema conversion (to customize type mappings and provide a guided exploration of indexing, primary key choices, and use of interleaved tables), as well as support for more databases. HarbourBridge is part of the Cloud Spanner Ecosystem, owned and maintained by the Cloud Spanner user community. It is not officially supported by Google as part of Cloud Spanner.
Would you like to try out Cloud Spanner with data from an existing PostgreSQL database? Maybe you’ve wanted to ‘kick the tires’ on Spanner, but have been discouraged by the effort involved?
Today, we’re announcing a tool that makes trying out Cloud Spanner using PostgreSQL data simple and easy.
HarbourBridge is a tool that loads Spanner with the contents of an existing PostgreSQL database. It requires zero configuration—no manifests or data maps to write. Instead, it ingests pg_dump output, automatically builds a Spanner schema, and creates a new Spanner database populated with data from pg_dump.
HarbourBridge is part of the Cloud Spanner Ecosystem, a collection of public, open source repositories contributed to, owned, and maintained by the Cloud Spanner user community. None of these repositories are officially supported by Google as part of Cloud Spanner.
Get up and running fast
HarbourBridge is designed to simplify Spanner evaluation, and in particular to bootstrap the process by getting moderate-size PostgreSQL datasets into Spanner (up to a few GB). Many features of PostgreSQL, especially those that don't map directly to Spanner features, are ignored, e.g. (non-primary) indexes, functions and sequences.
View HarbourBridge as a way to get up and running fast, so you can focus on critical things like tuning performance and getting the most out of Spanner. Expect that you'll need to tweak and enhance what HarbourBridge produces—More on this later.
Quick-start guide
The HarbourBridge README contains a step-by-step quick-start guide. We’ll quickly review the main steps. Before you begin, you'll need a Cloud Spanner instance, Cloud Spanner API enabled for your Google Cloud project, authentication credentials configured to use the Cloud API, and Go installed on your development machine.
To download HarbourBridge and install it, run
go get -u github.com/cloudspannerecosystem/harbourbridge
The tool should now be installed as $GOPATH/bin/harbourbridge. To use HarbourBridge on a PostgreSQL database called mydb, run
pg_dump mydb | $GOPATH/bin/harbourbridge
The tool will use the cloud project specified by the GCLOUD_PROJECT environment variable, automatically determine the Cloud Spanner instance associated with this project, convert the PostgreSQL schema for mydb to a Spanner schema, create a new Cloud Spanner database with this schema, and finally, populate this new database with the data from mydb. HarbourBridge also generates several files when it runs: a schema file, a report file (with details of the conversion), and a bad data file (if any data is dropped). See Files Generated by HarbourBridge.
Take care with ACLs
Note that PostgreSQL table-level and row-level ACLs are dropped during conversion since they are not supported by Spanner (Spanner manages access control at the database level). All data written to Spanner will be visible to anyone who can access the database created by HarbourBridge (which inherits default permissions from your Cloud Spanner instance).
Next steps
The tables created by HarbourBridge provide a starting point for evaluation of Spanner. While they preserve much of the core structure of your PostgreSQL schema and data, many important PostgreSQL features have been dropped.
In particular, HarbourBridge preserves primary keys but drops all other indexes. This means that the out-of-the-box performance you get from the tables created by HarbourBridge can be significantly slower than PostgreSQL performance. If HarbourBridge has dropped indexes that are important to the performance of your SQL queries, consider adding Secondary Indexes to the tables created by HarbourBridge. Use the existing PostgreSQL indexes as a guide. In addition, Spanner's Interleaved Tables can provide a significant performance boost.
Other dropped features include functions, sequences, procedures, triggers, and views. In addition, types have been mapped based on the types supported by Spanner. Types such as integers, floats, char/text, bools, timestamps and (some) array types map fairly directly to Spanner, but many other types do not and instead are mapped to Spanner's STRING(MAX). See Schema Conversion for details of the type conversions and their tradeoffs.
Recap
HarbourBridge automates much of the manual work of trying out Cloud Spanner using PostgreSQL data. The goal is to bootstrap your evaluation and help get you to the meaty issues as quickly as possible. The tables generated by HarbourBridge provide a starting point, but they will likely need to be tweaked and enhanced to support a full evaluation.
We encourage you to try out the tool, send feedback, file issues, fork and modify the codebase, and send PRs for fixes and new functionality. Our plans and aspirations for developing HarbourBridge further are outlined in the HarbourBridge Whitepaper. HarbourBridge is part of the Cloud Spanner Ecosystem, owned and maintained by the Cloud Spanner user community. It is not officially supported by Google as part of Cloud Spanner.
Posted by Amit Ganesh, VP Engineering & Dan McGrath, Product Manager
As modern application development moves away from managing infrastructure and toward a serverless future, we're pleased to announce the general availability of Cloud Firestore, our serverless, NoSQL document database. We're also making it available in 10 new locations to complement the existing three, announcing a significant price reduction for regional instances, and enabling integration with Stackdriver for monitoring.
Cloud Firestore is a fully managed, cloud-native database that makes it simple to store, sync, and query data for web, mobile, and IoT applications. It focuses on providing a great developer experience and simplifying app development with live synchronization, offline support, and ACID transactions across hundreds of documents and collections. Cloud Firestore is integrated with both Google Cloud Platform (GCP) and Firebase, Google's mobile development platform. You can learn more about how Cloud Firestore works with Firebase here. With Cloud Firestore, you can build applications that move swiftly into production, thanks to flexible database security rules, real-time capabilities, and a completely hands-off auto-scaling infrastructure.
Cloud Firestore does more than just core database tasks. It's designed to be a complete data backend that handles security and authorization, infrastructure, edge data storage, and synchronization. Identity and Access Management (IAM) and Firebase Auth are built in to help make sure your application and its data remain secure. Tight integration with Cloud Functions, Cloud Storage, and Firebase's SDK accelerates and simplifies building end-to-end serverless applications. You can also easily export data into BigQuery for powerful analysis, post-processing of data, and machine learning.
Building with Cloud Firestore means your app can seamlessly transition from online to offline and back at the edge of connectivity. This helps lead to simpler code and fewer errors. You can serve rich user experiences and push data updates to more than a million concurrent clients, all without having to set up and maintain infrastructure. Cloud Firestore's strong consistency guarantee helps to minimize application code complexity and reduces bugs. A client-side application can even talk directly to the database, because enterprise-grade security is built right in. Unlike most other NoSQL databases, Cloud Firestore supports modifying up to 500 collections and documents in a single transaction while still automatically scaling to exactly match your workload.
What's new with Cloud Firestore
New regional instance pricing. This new pricing takes effect on March 3, 2019 for most regional instances, and is as low as 50% of multi-region instance prices.
Data in regional instances is replicated across multiple zones within a region. This is optimized for lower cost and lower write latency. We recommend multi-region instances when you want to maximize the availability and durability of your database.
SLA now available. You can now take advantage of Cloud Firestore's SLA: 99.999% availability for multi-region instances and 99.99% availability for regional instances.
New locations available. There are 10 new locations for Cloud Firestore:
Multi-region
Europe (eur3)
North America (Regional)
Los Angeles (us-west2)
Montréal (northamerica-northeast1)
Northern Virginia (us-east4)
South America (Regional)
São Paulo (southamerica-east1)
Europe (Regional)
London (europe-west2)
Asia (Regional)
Mumbai (asia-south1)
Hong Kong (asia-east2)
Tokyo (asia-northeast1)
Australia (Regional)
Sydney (australia-southeast1)
Cloud Firestore is now available in 13 regions.
Stackdriver integration (in beta). You can now monitor Cloud Firestore read, write and delete operations in near-real time with Stackdriver.
More features coming soon. We're working on adding some of the most requested features to Cloud Firestore from our developer community, such as querying for documents across collections and incrementing database values without needing a transaction.
As the next generation of Cloud Datastore, Cloud Firestore is compatible with all Cloud Datastore APIs and client libraries. Existing Cloud Datastore users will be live-upgraded to Cloud Firestore automatically later in 2019. You can learn more about this upgrade here.
Adding flexibility and scalability across industries
Cloud Firestore is already changing the way companies build apps in media, IoT, mobility, digital agencies, real estate, and many others. The unifying themes among these workloads include: the need for mobility even when connectivity lapses, scalability for many users, and the ability to move quickly from prototype to production. Here are a few of the stories we've heard from Cloud Firestore users.
When opportunity strikes...
In the highly competitive world of shared, on-demand personal mobility via cars, bikes, and scooters, the ability to deliver a differentiated user experience, iterate rapidly, and scale are critical. The prize is huge. Skip provides a scooter-sharing system where shipping fast can have a big impact. Mike Wadhera, CTO and Co-founder, says, "Cloud Firestore has enabled our engineering and product teams to ship at the clock-speed of a startup while leveraging Google-scale infrastructure. We're delighted to see continued investment in Firebase and the broader GCP platform."
Another Cloud Firestore user, digital consultancy The Nerdery, has to deliver high-quality results in a short period of time, often needing to integrate with existing third-party data sources. They can't build up and tear down complicated, expensive infrastructure for every client app they create. "Cloud Firestore was a great fit for the web and mobile applications we built because it required a solution to keep 40,000-plus users apprised of real-time data updates," says Jansen Price, Principal Software Architect. "The reliability and speed of Cloud Firestore coupled with its real-time capabilities allowed us to deliver a great product for the Google Cloud Next conferences."
Reliable information delivery
Incident response company Now IMS uses real-time data to keep citizens safe in crowded places, where cell service can get spotty when demand is high. "As an incident management company, real-time and offline capabilities are paramount to our customers," says John Rodkey, Co-founder. "Cloud Firestore, along with the Firebase Javascript SDK, provides us with these capabilities out of the box. This new 100% serverless architecture on Google Cloud enables us to focus on rapid application development to meet our customers' needs instead of worrying about infrastructure or server management like with our previous cloud."
Regardless of the app, users want the latest information right away, without having to click refresh. The QuintoAndar mobile application connects tenants and landlords in Brazil for easier apartment rentals. "Being able to deliver constantly changing information to our customers allows us to provide a truly engaging experience. Cloud Firestore enables us to do this without additional infrastructure and allows us to focus on the core challenges of our business," says Guilherme Salerno, Engineering Manager at QuintoAndar.
Real-time, responsive apps, happy users
Famed broadsheet and media company The Telegraph uses Cloud Firestore so registered users can easily discover and engage with relevant content. The Telegraph wanted to make the user experience better without having to become infrastructure experts in serving and managing data to millions of concurrent connections. "Cloud Firestore allowed us to build a real-time personalized news feed, keeping readers informed with synchronized content state across all of their devices," says Alex Mansfield-Scaddan, Solution Architect. "It allowed The Telegraph engineering teams to focus on improving engagement with our customers, rather than becoming real-time database and infrastructure experts."
On the other side of the Atlantic, The New York Times used Cloud Firestore to build a feature in The Times' mobile app to send push notifications updated in real time for the 2018 Winter Olympics. In previous approaches to this feature, scaling had been a challenge. The team needed to track each reader's history of interactions in order to provide tailored content for particular events or sports. Cloud Firestore allowed them to query data dynamically, then send the real-time updates to readers. The team was able to send more targeted content faster.
Delivering powerful edge storage for IoT devices
Athlete testing technology company Hawkin Dynamics was an early, pre-beta adopter of Cloud Firestore. Their pressure pads are used by many professional sports teams to measure and track athlete performance. In the fast-paced, high-stakes world of professional sports, athletes can't wait around for devices to connect or results to calculate. They demand instant answers even if the WiFi is temporarily down. Hawkin Dynamics uses Cloud Firestore to bring real-time data to athletes through their app dashboard, shown below.
"Our core mission at Hawkin Dynamics is to help coaches make informed decisions regarding their athletes through the use of actionable data. With real-time updates, our users can get the data they need to adjust an athlete's training on a moment-by-moment basis," says Chris Wales, CTO. "By utilizing the powerful querying ability of Cloud Firestore, we can provide them the insights they need to evaluate the overall efficacy of their programs. The close integrations with Cloud Functions and the other Firebase products have allowed us to constantly improve on our product and stay responsive to our customers' needs. In an industry that is rapidly changing, the flexibility afforded to us by Cloud Firestore in extending our applications has allowed us to stay ahead of the game."
Getting started with Cloud Firestore
We've heard from many of you that Cloud Firestore is helping solve some of your most timely development challenges by simplifying real-time data and data synchronization, eliminating server-side code, and providing flexible yet secure database authentication rules. This reflects the state of the cloud app market, where developers are exploring lots of options to help them build better and faster while also providing modern user experiences. This glance at Stack Overflow questions gives a good picture of some of these trends, where Cloud Firestore is a hot topic among cloud databases.
We've seen close to a million Cloud Firestore databases created since its beta launch. The platform is designed to serve databases ranging in size from kilobytes to multiple petabytes of data. Even a single application running on Cloud Firestore is delivering more than 1 million real-time updates per second to users. These apps are just the beginning. To learn more about serverless application development, take a look through the archive of the recent application development digital conference.