Tag Archives: Storage & Databases

A closer look at the HANA ecosystem on Google Cloud Platform



Since we announced our partnership with SAP in early 2017, we’ve rapidly expanded our support for SAP HANA, SAP’s in-memory, column-oriented, relational database management system. From the beginning, we knew we’d need to build tools that integrate SAP HANA with Google Cloud Platform (GCP) that make it faster and easier for developers and administrators to take advantage of the platform.

In this blog post, we’ll walk you through the evolution of SAP HANA on GCP and take a deeper look at the ecosystem we’ve built to support our customers.

The evolution of SAP HANA on GCP


For many enterprise customers running SAP HANA databases, instances with large amounts of memory are essential. That’s why we’ve been working to make virtual machines with larger memory configurations available for SAP HANA workloads.

Our initial work with SAP on the certification process for SAP HANA began in early 2017. In that 15 month period, we’ve rapidly evolved from instances with 208GB memory to 4TB memory. This has allowed us to support larger single-node SAP HANA installations of up to 4TB.

Smart Data Access — Google BigQuery

Google BigQuery, our serverless data warehouse, enables low cost, high performance analytics at petabyte scale. We’ve worked with SAP to natively integrate SAP HANA with BigQuery through smart data access which allows you to extend SAP HANA’s capabilities and query data stored within BigQuery by means of virtual tables. This support has been available since SAP HANA 2.0 SPS 03, and you can try it out by following this step-by-step codelabs tutorial.

Fully automated deployment of SAP HANA

In many cases, the manual deployment process can be time consuming, error prone and cumbersome. It’s very important to reduce or eliminate the margin of error and make the deployment conform to SAP’s best practices and standards.


To address this, we’ve launched deployment templates that fully automate the deployment of single node and scale-out SAP HANA configurations on GCP. In addition, we’ve also launched a new deployment automation template that creates a high availability SAP HANA configuration with automatic failover.

With these templates, you have access to fully configured, ready-to-go SAP HANA environments in a matter of minutes. You can also see the resources you created, and a complete catalog of all your deployments, in one location through the GCP console. We’ve also made the deployment process fully visible by providing deployment logs through Google Stackdriver.

Monitoring SAP HANA with Stackdriver

Visibility into what’s happening inside your SAP HANA database can help you identify factors impacting your database, and prepare accordingly. For example, a time series view of how resource utilization or latency within the SAP HANA database changes over time can help administrators plan in advance, and in many cases successfully troubleshoot issues.

Stackdriver provides monitoring, logging, and diagnostics to better understand the health, performance, and availability of cloud-powered applications. Stackdriver’s integration with SAP HANA helps administrators monitor their SAP HANA databases, notifying and alerting them so they can proactively fix issues.

More information on this integration is available in our documentation.

TensorFlow support in SAP HANA

SAP has offered support for TensorFlow Serving beginning with SAP HANA 2.0. This lets you directly build inference into SAP HANA through custom machine learning models hosted in TensorFlow serving applications running on Google Compute Engine.

You can easily build a continuous training pipeline by exporting data in your SAP HANA database to Google Cloud Storage and then using Cloud TPUs to train deep learning models. These models can then be hosted with TensorFlow Serving to be used for inference within SAP HANA.

SAP Cloud Platform, SAP HANA as a Service

SAP HANA as a Service is now also deployable to the Google Cloud Platform. The fully managed cloud service makes it possible for developers to leverage the power of SAP HANA, without spending time on operational and administrative tasks. The service is especially well suited to customers who have the goal of rapid innovation, reducing time to value.

SAP HANA, express edition on Google Cloud Launcher

SAP HANA, express edition is meant for developers and technologists who prefer a hands-on learning experience. A free license is included, allowing users to use a large catalog of tutorial content, online courses and samples to get started with SAP HANA. The Google Launcher provides users a fast and effective provisioning experience for SAP HANA, express edition.

Conclusion

These developments are all part of our continuing goal to make Google Cloud the best place to run SAP applications. We’ll continue listening to your feedback, and we’ll have more updates to share in the coming months. In the meantime, you can learn more about SAP HANA on GCP by visiting our website. And if you’d like to learn about all our announcements at SAPPHIRE NOW, read our Google Cloud blog post.

Last month today: GCP in May



When it comes to Google Cloud Platform (GCP), every month is chock full of news and information. We’re kicking off a monthly recap of key moments you may have missed.

What caught your attention this month:

Announcements about open source projects were some of the most-read this month.
  • Open-sourcing gVisor, a sandboxed container runtime was by far your favorite post in May. gVisor is a sandbox that lets you run containers in strongly isolated environments. It’s isolated like a virtual machine, but more lightweight and also more flexible, since it interfaces with the host OS just like another process.
  • Our introduction of Asylo, an open-source framework for confidential computing, also got your attention. As more and more sensitive workloads move to cloud, lots of businesses want to be able to verify that they’re properly isolated, inside a closed environment that’s only available to authorized users. Asylo democratizes trusted execution environments (TEEs) by allowing them to run on generic hardware. With Asylo, developers will be able to run their workloads encrypted in a highly secure environment, whether it’s on-premises or in the cloud.
  • Rounding out the open-source fun for the month was our introduction of the beta availability of Cloud Memorystore, a fully managed in-memory data store service for Redis. Cloud Memorystore gives you the caching power of Redis to reduce latency, without having to manage the details.


Hot topics: Kubernetes, DevOps and SRE

Google Kubernetes Engine 1.10 debuted in May, and we had a lot to say about the new features that this version enables—from security to brand-new monitoring functionality via Stackdriver Kubernetes Monitoring to networking. Start with this post to see what’s new and how customers like Spotify are using Kubernetes Engine on Google Cloud.

And one of our recent posts also struck a chord, as two of our site reliability engineering (SRE) experts delved into the differences—and similarities—between SRE and DevOps. They have similar goals, mostly around creating flexible, agile dev environments, but SRE generally gets much more specific and prescriptive than DevOps in accomplishing them.

Under the radar: GCP adds infrastructure options

As you look for new ways to use GCP to run your business, our engineers are adding features and new releases to give you more power, resources and coverage.

First, we introduced ultramem Google Compute Engine machine types, which offer more memory and compute resources than any other Compute Engine VM instance. These machines types are especially useful for those of you running enterprise workloads that need a lot of memory, like data analytics or high-performance applications.

We’ve also been busy on the back-end in other ways too, as we continue adding new regional cloud computing infrastructure. Our third zone of the Singapore region opened in May, and we’ll open a Zurich region next year.

Stay tuned in June for more on the technologies behind Google Cloud—we’ve got lots up our sleeve.

Troubleshooting tips: How to talk so your cloud provider will listen (and understand)



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and as a teaser, this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part one of two. Then, check out the second installment specifically on troubleshooting cloud provider communications.

Effective technology troubleshooting requires a systematic approach, as opposed to luck or experience. Troubleshooting can be a learned skill, as discussed in our site reliability engineering (SRE) troubleshooting primer.

But how does that change when you and your operations team are running systems and services on cloud infrastructure? Regardless of where your websites or apps live, you’re the one getting paged when the site goes down and you are the one under pressure to solve the problems and answer questions.

Cloud presents a new way of working for IT teams shifting away from legacy systems. You had full visibility and control of all aspects of your system when it was on-premises, but now you depend on off-site, cloud-based infrastructure, into which you may have limited visibility or understanding. That’s true no matter how top-notch your systems and processes are and how great a troubleshooter you are. You simply can’t see much into many cloud provider systems. The troubleshooting challenge is no longer limited to pure debugging; you also need to effectively communicate with your cloud provider. You’ll need to engage each provider’s support process and engineers to find where problems originated and fix them as soon as possible. This is an area of opportunity for IT teams as they gain new skills and adapt to the cloud-based technology model.

It’s definitely possible to do cloud troubleshooting right. When you’re choosing a provider, make sure to understand how they’ll work with you during any issues. Once you’re working with a provider, you have more control than you think over how you communicate issues and speed along their resolution. We propose this model, inspired by the SRE model of actionable improvements, for working with your cloud provider to troubleshoot more effectively and efficiently. (For more on cloud provider communications throughout the troubleshooting process, see the companion post.)

Understand your cloud provider's support workflow

Your goal here is to figure out the best way to provide the right information to your provider when an issue inevitably arises. This is a useful place to start, especially since you may depend on multiple cloud providers to run your infrastructure. Your interaction with cloud support will typically begin with questions related to migration. Your relationship then progresses into the domain of production integration, and finally, joint troubleshooting of production problems.

Keep in mind that different cloud providers have different philosophies when it comes to customer interaction. Some provide a large degree of freedom and little direct support. They expect you to find answers to most of your questions from online forums such as Stack Overflow. Other providers emphasize tight customer integration and joint troubleshooting. You have some homework to do before you begin serving real customer traffic from your cloud deployment. Talk to your cloud provider's salespeople, project managers and support engineers, to get a sense of how they approach support. Ask them the following questions:
  • What does the lifecycle of a typical support issue report look like?
  • What is your internal escalation process if an issue becomes complex or critical?
  • Do you have an internal SLO for <service name>? If so, what is that SLO?
  • What types of premium support are available?
This step is critical in reducing your frustration when you have to troubleshoot an issue that involves a cloud provider’s giant black box.

Communicate with cloud provider support efficiently

Once you have a sense of the provider’s support workflow, you can figure out the best way to get your information across. There are some best practices for filing the perfect issue report with cloud provider support teams, including what to say in your issue report and why. These guidelines follow the 80/20 rule: we try to give 20% of the details that will be useful in 80% of your issue reports. The same principles apply if you're filing bug reports in issue trackers or posting to user groups and forums.

Your guiding principle when communicating to cloud providers should be clarity: specify the appropriate level of technical detail and communicate expectations explicitly.

Provide basic information
It may seem like common sense, but these basics are essential to include in an issue report. Failing to provide any of these details leads to delays and a poor experience.

Include four critical details
Effective troubleshooting starts with time, product, location and specific identifiers.

1. Time
Here are a few examples of including time in an issue report:
  • Starting at 2017-09-08 15:13 PDT and ending 5 minutes later, we observed...
  • Observed intermittently, starting no earlier than 2017-09-10 and observed 2-5 times...
  • Ongoing since 2017-09-08 15:13 PDT...
  • From 2017-09-08 15:13 PDT to 2017-09-08 22:22 PDT...
Including the onset time and duration allow support teams to focus their time-series monitoring on the relevant period. Be explicit about whether the issue is ongoing or whether it was observed only in the past. If the issue is not ongoing, be explicit about that fact and provide an end time, if possible.

Remember to always include the time zone. ISO 8601 format is a good choice because it is unambiguous and easy to sort. If you instead specify time in a relative way, whomever you're working with must convert local time into an absolute format they can input into time-series monitoring tools. This conversion is error-prone and costly: for example, sending an email about something that happened "earlier yesterday" means that your counterpart has to search for an email header and start doing mental math. This introduces cognitive load, which decreases the person’s mental energy available for solving the technical problem.

If an issue was intermittent over some period of time, state when it was first observed, or perhaps note a time in the past when it was surely not happening. Include the frequency of observations and note one or two specific examples.

Meanwhile, here are some antipatterns to avoid:
  • Earlier today: Not specific enough
  • Yesterday: Requires the recipient to figure out the implied date; can be confusing especially when work crosses the International Date Line
  • 9/8: Ambiguous, as the date might be interpreted as September in United States, or August in other locales. Use ISO 8601 format for clarity.
2. Product
Be as specific as possible in your issue report about the product you're using, including version information where applicable. These, for example, aren’t specific enough to locate the components or logs that can help with diagnosis:
  • REST API returned errors...
  • The data mining query interface is hanging...
Ideally, you should refer to specific APIs or URLs, or include screenshots. If the issue originates in a specific component or tool (for example, the CLI or Terraform), clearly identify that tool. If multiple products are involved, be specific about each one.Describe the behavior you're observing, and the behavior you expected to occur.

Antipatterns:
  • Can't create virtual machine: It's not clear how you're attempting to create those machine, nor does it say what the failure mode is.
  • The CLI command is giving an error:
    • Instead, provide the specific error, and the command syntax so others can run the command themselves.
    • Better: I ran 'mktool create my-instance --zone us-central1' and got the following error message...
3. Location
It's important to specify the region and zone because cloud providers often roll out changes to one region or zone at a time. Therefore, region or zone is a proxy for a cloud-internal software version number.
These are examples of how you might include location information in an issue report:
  • In us-east1-a... 
  • I tried regions eu-west-1 and eu-west-3...
Given this information, the support team can see if there's a rollout underway in a given location, or map your issue to an internal release ID for use in an internal bug report.

4. Specific identifiers
Project identifiers are included in many troubleshooting tools. Specify whether you observed the error in multiple projects, or in one project but not another.

These are examples of specific identifiers:
  • In project 123412341234 or my-project-id... 
  • Across multiple projects (including 123412341234)... 
  • Connecting to cloud external IP 218.239.8.9 from our corporate gateway 56.56.56.56... 
IP addresses are another form of unambiguous identifiers, though they also require additional details when used in an issue report. When specifying an IP, try to describe the context of how it's used. For example, specify whether the IP is connected to a VM instance, a load balancer or a custom route, or if it's an API endpoint. If the IP address isn't part of the cloud platform (for example, your home internet, a VPN endpoint or an external monitoring system), specify that information. Remember, 192.168.0.1 occurs many times in the world.

Antipatterns:
  • One of our instances is unreachable…: Overly vague 
  • We can't connect from the Internet...: Overly vague 
Note that other models, such as the Five Ws (popular in fields like journalism) can provide structure to your report.

Specify impact and response expectations
Basic information to include in your issue report to a cloud provider should also include how it’s affecting your business and when it needs to be resolved.

Priority expectations
Cloud provider support commonly uses the priority you specify to initially route the issue report and to determine the urgency of the issue. The priority rating drives the speed of response, potentially paging oncall personnel.

In addition to selecting the appropriate priority, it's useful to add a sentence describing the impact. Help avoid incorrect assumptions by being explicit about why you selected P1.

Think of the priority in terms of the impact to your business. Your cloud provider may have what appear to be strict definitions of priority (e.g., P1 signified a total outage). Don't let these definitions slow progress on issues that are business critical; extrapolate impact if your issue isn't addressed, or describe the worst case scenario of an exposure related to the issue. For example, the following two descriptions essentially describe impeding P1 issues:
  • Our backlog is increasing. Current impact is minor, but if this issue is not fixed in 12 hrs, we're effectively down. 
  • A key monitoring component has failed. While there is no current impact, this means we have a blind spot that will cause the next failure to become a user visible outage. 
Response time expectations
If you have specific needs related to response time, indicate them clearly. For example, you might specify "I need a response by 5pm because that's when my shift ends." If you have internal outage communication SLOs, make sure you request a response time from your provider that is within that interval so you can meet those SLOs. Cloud providers likely have 24/7 support, but if this isn't the case, or if your relevant personnel are in a particular time zone, communicate your time zone-specific needs to your provider.

Including these details in your issue report will ideally save you time later and speed up your overall provider resolution process. Check out part two for tips on communicating with your cloud provider throughout the actual troubleshooting process.

Related content:
SRE vs. DevOps
Incident management at Google —adventures in SRE-land
Applying the Escalation Policy — CRE life lessons
Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Troubleshooting tips: Help your cloud provider help you



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part two of two. Check out part one on writing better issue reports for cloud provider support.

Troubleshooting computer systems is an act as old as computers themselves. Some might even call it an art. The cloud computing paradigm entails a fundamental change to how IT teams conduct troubleshooting.

Successful IT troubleshooting doesn’t depend only on luck or experience, but is a deliberate process that can be taught. When you’re using cloud-based infrastructure, you’re often troubleshooting via a cloud provider’s help desk, adding another layer to helping users. Because of this shift away from the traditional IT team model, your communications with the provider are essential. (See part one for more on putting together an effective issue report to improve troubleshooting from the start.)

Once you’ve communicated the issue to your provider, you’ll be working with the provider’s support team to get the issue fixed.

The essentials of cloud troubleshooting

Those diagnosing a technical problem with cloud infrastructure are seeking possible explanations (hypotheses) and evidence that explains the problem. In the short term, they look for changes in the system that roughly correlate with the problem, and consider rolling back, as a first step to mitigate the problem and stop the bleeding. The longer-term goal is to identify and fix the root cause so the problem will not recur.

From the site reliability engineering (SRE) perspective,  the general approach for troubleshooting is as follows:

  • Triage: Mitigate the impact if possible
  • Examine: Gather observations and share them
  • Diagnose: Create a hypothesis that explains the observations
  • Test and treat:
    • Identify tests that may prove or disprove the hypothesis
    • Execute the tests and agree on the meaning of the result
    • Move on to the next hypothesis; repeat until solved


When you’re working with a cloud provider on troubleshooting an issue, there are parts of the process you’re unable to control. But you can follow the steps on your end. Here’s what you can do when submitting a report to your cloud provider support team.

1. Communicate any troubleshooting you've already done
By the time you open an issue report, you've probably done some troubleshooting already. You may have checked the provider’s status page, for example. Share the steps you've taken and any key findings. Keep a timeline and log book of what you have done and share it with the provider. This means that you should start keeping a log book as soon as possible, from the start of detection of your problem. Keep in mind that while cloud providers may have telemetry that provides real-time omniscient awareness of the state of their infrastructure, the dependencies that result from your particular implementation may be less obvious. By design, your particular use of cloud resources is proprietary and private, so your troubleshooting vantage point is vital.

If you think you have a diagnosis, explain how you came to that conclusion. If you think others can reproduce the issue, include the steps to do so. A reproducible test in an issue report usually leads to the fastest resolution.

You may have an idea or guess about what's causing the problem. Be careful to avoid confirmation bias—looking for evidence to support your guess without considering evidence to the contrary.

2. Be specific and explicit about the issue
If you've ever played the telephone game, in which players whisper a message from person to person, you've seen how human translation and interpretation can lead to communication gaps. Rather than describing information in your provider communications, try to share it. Doing so reduces the chance that your reader will misinterpret what you're saying and can help speed up troubleshooting. Don’t assume that your provider has access to all of this information; customer privacy means that they may not, by design.

For example:

  • Use a screenshot to show exactly what you see
  • For web-based interfaces, provide a .HAR (Http ARchive) file
  • Attach information like tcpdump output, logs snippets and example stack traces

3. Report production outages quickly
An issue is considered to be a production outage if your application has stopped serving traffic to users or is experiencing similar business-critical impact. Report production outages to your cloud provider support as soon as possible. Issues that block a small number of developers in a developer test environment are normally not considered production outages, so they should be reported at lower priorities.

Normally, when cloud provider support is alerted about a production outage, they quickly triage the situation with the following steps:

  1. Immediately check for known issues affecting the infrastructure.
  2. Confirm the nature of the issue.
  3. Establish communication channels.


Typically, you can expect a quick response with a brief message, which might contain:

  • Whether or not there is a known issue affecting multiple customers
  • An acknowledgement that they can observe the issue you've reported or a request for more details
  • How they intend to communicate (for example, phone, Skype, or issue report)


It’s important to quickly create an issue report including the four critical details (described in part one  and then begin deeper troubleshooting on your side of the equation. If your organization has a defined incident management process (see Managing Incidents), escalating to your cloud provider should be among your initial steps.

4. Report networking issues with specificity
Most cloud providers’ networks are huge and complex, composed of many technologies and teams. It's important to quickly identify a networking-specific problem as such and engage with the team that can repair it.

Many networking issues have similar symptoms, like "can't connect to server," at a high level. This level of detail is typically too generic to be useful in identifying the root cause, so you need to provide more diagnostic information. Network issues relate to connectivity, which always involves at least two specific points: source and destination. Always include information about these points when reporting network issues.

To structure your issue report,  use the conceptual tool of a packet flow diagram:

  • Describe the important hops that a packet takes along a path from source to destination, along with any significant transformations (e.g., NAT) along the way.
  • Start by identifying the affected network endpoints by Internet IP address or by RFC 1918 private address, plus an ASN for the network.
  • Note anything meaningful about the endpoints, such as who controls them and whether they are associated with a DNS hostname. 
  • Note any intermediate encapsulation and/or indirection. For example: VPN tunneling, proxies or NAT gateways.
  • Note any intermediate filtering, like firewalls, CDN or WAF.


Many problems that manifest as high latency or intermittent packet loss will require a path analysis and/or a packet capture for diagnosis. Path analysis is a list of all hops that packets traverse (for example, MTR or tcptraceroute). A packet capture (a.k.a. pcap, derived from the name of the library libpcap) is an observation of real network traffic. It's important to take a packet capture for both endpoints, at the same time, which can be tricky. Practice with the necessary tools (for example tcpdump or Wireshark) and make sure they are installed before you need them.

5. Escalate when appropriate
If circumstances change, you may need to escalate the urgency of an issue so it receives attention quickly. Take this step if business impact increases, if an issue is stuck without progress after a lot of back-and-forth with support, or if some other factor calls for quicker resolution.

The most explicit way to escalate an issue is to change the priority of the issue report (for example, from P3 to P2). Provide comments about why you need to escalate so support can respond appropriately.

6. Create a summary document for long-running or difficult issues
Issue state and relevant information change over time as new facts come to light and hypotheses are ruled out. In the meantime, new people join the investigation. Help communicate relevant, up-to-date information by collecting information in a summary document.

A good summary document has the following dimensions:

  • The latest state summarized at the top
  • Links to all relevant issue reports and internal tracking bugs
  • A list of hypotheses which are potentially true, and hypotheses that have been ruled out already. When you start investigating a particular hypothesis, note that you are doing so, and mention the tests or tools that you intend to use. Often, you can get good advice or prevent duplicate work.


SAMPLE summary document format:

$TIMESTAMP
<Current customer impact> <Working theory and actions being taken> <Next steps>

13:00:00
Customer impact has been mitigated and resolved. Our networking provider was throttling our traffic because we forgot to pay our bill last month. Next step is to be nicer to our finance team.

12:00:00
More than 100 customers are actively complaining about not being able to reach our service. Our networking provider is throttling customer traffic to one of our load balancers. The response team is actively working with our networking provider’s tier 1 support to understand why and how this happened.

11:00:00
We have now received 100 complaints from 50 customers from four different geos that they cannot consistently reach our API at api.acme.com. Our engineers currently believe that an upstream networking issue is causing this. Next steps are to reach out to our networking provider to see if there are any upstream issues.

10:00:00
We have received five complaints from five customers that they are unable to reach api.acme.com. Our engineers are looking into the issue.


Try to keep each issue report focused on a single issue. Don't reopen an issue report to bring up a new issue, even if it's related to the original issue. Do reference similar issues in your new report to help your provider recognize patterns from systemic root causes.

Keep your communication skills sharp

Communicating highly detailed technical information in a clear and actionable manner can be difficult. Doing so requires focus and specific skills. This task is particularly challenging in stressful situations, because our biological response to stress works against the need for clear cognitive reasoning. The following techniques help make communication easier for everyone.

Help reduce cognitive load by writing a detailed issue report
Many issue reports require the reader to make inferences or calculations. This introduces cognitive load, which decreases the mental energy available for solving the technical problem.

When writing an issue report, be as specific and detailed as possible. While this attention to detail requires more time on the part of the writer, consider that an issue report is written once but read many times by many people. People can solve the problem faster together when equipped with comprehensive information. Avoid acronyms and internal company code names. Also, be mindful of protecting customer privacy when disclosing any information to any third party.

Use narrative techniques
Once upon a time, in a land far, far away...

Humans are very good at absorbing information in the form of stories, so you can get your point across quite effectively this way. Start with the context: What was happening when you first observed the problem? What potential fixes did you try? Who are the characters involved, and why does the issue matter to them?
Include visuals Illustrate your issue report with any supporting images you have available, like text formatting or charts and screenshots.

Text formatting
Formatted text like log lines, code excerpts or MySQL results often become illegible when sent through plain-text emails. Add explicit markers (for example, <<<<<< at the end of the line) to help direct attention to important sections. You can use footnotes to point to long-form URLs, or use a URL shortener.

Use bullet points to format lists, and to call out important details like instance names. Use numbered lists to enumerate series of steps.

Charts
Charts are very useful for understanding time-series data. When you’re sending charts with an issue report, keep these best practices in mind:

  • Take a screenshot, including title and axis labels. For absolute values, specify units (requests per minute, errors per second, etc).
  • Annotate the screenshot with arrows or circles to call out important points.
  • Briefly describe what the chart is measuring.
  • Briefly describe how the chart normally looks.
  • In a few sentences, describe your interpretation of the chart and why it is relevant to the problem.


Avoid the following antipatterns:

  • The Y-axis represents a specific error (e.g., exceptions in my-handler) and has no clear relationship to the problem under investigation (e.g., high persistence-layer latency). To remedy this situation, explain why the graph is relevant to the problem.
  • The Y-axis is an absolute number (e.g., 1M per minute) that provides no context about the relative impact.
  • The X-axis doesn't have a time zone.
  • The Y-axis is not zero-based. This can make minor changes in the Y value seem very large.
  • Axis labels are missing or cut off.

Well-crafted issue reports, along with strong communication with your cloud provider, can speed the resolution process and time it takes. The cloud computing model has drastically changed the way that IT teams troubleshoot computer systems. Technical savvy is no longer the sole necessary skill set for effective troubleshooting--you must also be able to communicate clearly and efficiently with cloud providers. While the reality of your deployment may be unique, nuanced, and complex, these building blocks can help you navigate this territory.

Related content:
SLOs, SLIs, SLAs, oh my - CRE life lessons
SRE vs. DevOps: competing standards or close friends?
Introducing Google Customer Reliability Engineering


Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Cloud Source Repositories: more than just a private Git repository



If your goal is to release software continuously at high velocity, you need to be able to automatically build, test, deploy, and debug your code changes, all within minutes. But first you need to integrate your version control systems and your build, deploy, and debugging tools—a time-consuming and complicated process that requires numerous manual configuration steps like downloading plugins and setting up webhooks. And when you’re done, the workflow still isn’t very integrated, forcing developers to jump from one tool to another as they go from code to deployment. So much for high velocity.

Cloud Source Repositories, fully-managed private Git repositories hosted on Google Cloud Platform (GCP), is tightly integrated with other GCP tools, making it easy to automatically build, test, deploy, and debug code right out of the gate. With just a few clicks and without any additional setup or configuration, you can extend Cloud Source Repositories with other GCP tools to perform other tasks as a part of your development workflow. In this post, let’s take a closer look at some of the GCP tools that are integrated with Cloud Source Repositories, and how they simplify developer workflows:

Simplified continuous integration (CI) with Container Builder

Looking to implement continuous integration and validate each check-in to a shared repository with an automated build and test? The integration of Cloud Source Repositories with Container Builder comes in handy here, making it easy to set up a CI on a branch or tag. There are no CI servers to set up or repositories to configure. In fact, you can enable a CI process on any existing or new repo present in Cloud Source Repositories. Simply specify the trigger on which Container Builder should build the image. In the following example, for instance, the trigger specifies that a build will be triggered when changes are pushed to any branch of Cloud Source Repositories.


To demonstrate this trigger in action, the example below changes the background color of the “Hello World” website from yellow to blue.

The first step involves setting blue as the background color using background-color CSS property. Then, you need to add the changed file to the index using a git add command and record the changes to the repository with git commit. The commits are then pushed to the remote server using git push.

Because of the trigger defined above, an automated build is triggered as soon as changes are pushed to Cloud Source Repositories. Container Builder starts automatically building the image based on the changes. Once the image is created, the new version of the app is deployed using kubectl set image. The new changes are reflected and the “Hello World” website now shows a blue background color.

Follow this quickstart to begin continuous integration with Container Builder & Cloud Source Repositories.

Pre-Installed tools and programming languages in Cloud Shell and Cloud Shell Editor

Cloud Source Repositories is integrated out-of-the-box with Cloud Shell and the Cloud Shell Editor. Cloud Shell provides browser-based command-line access, giving you an easy way to build and deploy applications. It is already configured with common tools such as MySql client, kubernetes, and Docker, as well as Java, Go, Python, Node.js, PHP and Ruby, so you don't have to spend time looking for the latest dependencies or installing software. Cloud Editor, meanwhile, acts as a cross-platform IDE to edit code with no setup.

Quick deployment to App Engine

The integration of Cloud Source Repositories and App Engine makes publishing applications a breeze. It provides you a way to deploy apps quickly and lets developers focus just on writing code, without the worry of managing the underlying infrastructure or even scaling the app as its needs grow. You can deploy source code stored in Cloud Source Repositories to App Engine with the gcloud app deploy command, which automatically builds an image and deploys it to App Engine flexible environment. Let’s see this in action.

In the following example, we’ll change the text on the website from “Hello Universe” to “Hello World” before deploying it. Like with the previous example, git add and git commit help stage and commit the files staged to Cloud Source Repositories. Next, the git push command pushes the changes to the master branch.

Once the changes have been pushed to Cloud Source Repositories, you can deploy the new version of the application by running the gcloud app deploy command from the directory where the app.yaml file is located.

The text is now changed to “Hello, World!” from “Hello Universe”.

Try deploying code stored in Cloud Source Repositories to App Engine by following the quickstart here.

Debug in production with Stackdriver Debugger

If your app is running in production and has problems, you need to troubleshoot issues quickly to avoid bad customer experiences. For debugging apps in production, creating breakpoints isn't really an option as you can’t suspend the program. To help locate the root cause of production issues quickly, Cloud Source Repositories is integrated with Stackdriver Debugger, which lets you debug applications in production without stopping or slowing the application.

Stackdriver Debugger allows you to either use a debug snapshot or debug logpoint to debug production applications. Debug Snapshot captures the call stack and variables at a specific code location the first time any instance of that code is executed. Debug Logpoint, on the other hand, writes the log messages to the log stream. You can set a debug snapshot or a debug logpoint for code stored in Cloud Source Repositories with a single click.

Debug Snapshot for debugging

In the following example, a snapshot has been set up for the second line of code in the get function of the MainPage class.

The right-hand panel display details such as the call stack and the values of local variables in scope once the snapshot set above is reached.

Learn more about production debugging by following the quickstart here.

Debug Logpoint for Debugging

The integration of Stackdriver with Cloud Source Repositories also allows for injecting logging statements without restarting the app. It lets you store, search, analyze, monitor, and alert on log data and events. As an example, a logging statement introduced in the above code is highlighted below.

The logs panel highlights the logs printed by logpoint.

Version control with Cloud Functions

If you’re building a serverless app, you’ll be happy to know that Cloud Source Repositories is also integrated with Cloud Functions. You can store your function source code in Cloud Source Repositories and reference it from event-driven serverless apps. The code stored in Cloud Source Repositories can also be deployed in response to specific triggers, ranging from HTTP, Cloud Pub/Sub, and others. Changes made to function source code are automatically tracked over time, and you can roll back to the previous state of any repository.

In the following example, the “helloworld” function is deployed by an HTTP request. The location of the source code for function can be found in the root directory of the Cloud Source repository.

Learn more about deploying your function source code stored in Cloud Source Repositories using the quickstart here.

In short, the integration of Cloud Source Repositories with other Google Cloud tools lets your team to go from code to deployment in minutes, all while managing versioning and aliasing. You even get the ability to perform production debugging on the fly by using built-in monitoring logging tools. Try Cloud Source Repositories along with these integrations here.

Sharding of timestamp-ordered data in Cloud Spanner



Cloud Spanner was designed from the ground up to offer horizontal scalability and a developer-friendly SQL interface. As a managed service, Google Cloud handles most database management tasks, but it’s up to you to ensure that there are no hotspots, as described in Schema Design Best Practices and Optimizing Schema Design for Cloud Spanner. In this article, we’ll look at how to efficiently insert and retrieve records with timestamp ordering. We’ll start with the high-level guidance provided in Anti-pattern: timestamp ordering and explore the scenario in more detail with a concrete example.

Scenario

Let’s say we’re building an app that logs user activity along with timestamps and also allows users to query this activity by user id and time range. A good primary key for the table storing user activity (let’s call it LogEntries) is (UserId, Timestamp), as this gives us a uniform distribution of activity logs. Cloud Spanner inserts log entries sequentially, but they’re naturally sharded by UserId, resulting in uniform key distribution.

Table LogEntries

UserId (PK)
Timestamp (PK)
LogEntry
15b7bd1f-8473
2018-05-01T15:16:03.386257Z


Here’s a sample query to retrieve a list of log entries by user and time range:

SELECT UserId, Timestamp, LogEntry
FROM LogEntries
   WHERE UserID = '15b7bd1f-8473'
   AND Timestamp BETWEEN '2018-05-01T15:14:10.386257Z'
   AND '2018-05-01T15:16:10.386257Z';
This query takes advantage of the primary key and thus performs well.

Now let’s make things more interesting. What if we wanted to group users by the company they work for so we can segment reports by company? This is a fairly common use case for Cloud Spanner, especially with multi-tenant SaaS applications. To support this, we create a table with the following schema.
Table LogEntries


CompanyId (PK)
UserId (PK)
Timestamp (PK)
LogEntry
Acme
15b7bd1f-8473
2018-05-01T15:16:03.386257Z


And here’s the corresponding query to retrieve the log entries:

SELECT CompanyId, UserId, Timestamp, LogEntry
FROM LogEntries
   WHERE CompanyID = 'Acme'
   AND UserID = '15b7bd1f-8473'
   AND Timestamp BETWEEN '2018-05-01T15:14:10.386257Z'
   AND '2018-05-01T15:16:10.386257Z';


Here’s the query to retrieve log entries by CompanyId and time range (user field not specified):

SELECT CompanyId, UserId, Timestamp, LogEntry
FROM LogEntries
   WHERE CompanyID = 'Acme'
   AND Timestamp BETWEEN '2018-05-01T15:14:10.386257Z'
   AND '2018-05-01T15:16:10.386257Z';
To support the above query, we add a separate, secondary index. Initially, we include just two columns:

CREATE INDEX LogEntriesByCompany ON UserActivity(CompanyId, Timestamp)

Challenge: hotspots during inserts


The challenge here is that some companies may have a lot more (orders of magnitude more) users than others, resulting in a very skewed distribution of log entries. The challenge is particularly acute during inserts as described in the opening paragraph above. And even if Cloud Spanner helps out by creating additional splits, nodes that service new splits become hotspots due to uneven key distribution.

The above diagram depicts a scenario where Company B has three times more users than Company A or Company C. Therefore, log entries corresponding to Company B grow at a higher rate, resulting in the hotspotting of nodes that service the splits where Company B’s log entries are being inserted.

Hotspot mitigation

There are multiple aspects to our hotspot mitigation strategy: schema design, index design and querying. Let’s look at each of these below.

Schema and index design 

As described in Anti-pattern: timestamp ordering, we’ll use application-level sharding to distribute data evenly. Let’s look at one particular approach for our scenario: instead of (CompanyId, UserId, Timestamp), we’ll use (UserId, CompanyId, Timestamp).

Table LogEntries (reorder columns CompanyId and UserId in Primary Key)


UserId (PK)
CompanyId (PK)
Timestamp (PK)
LogEntry
15b7bd1f-8473
Acme
2018-05-01T15:16:03.386257Z


By placing UserId before CompanyId in the primary key, we can mitigate the hotspots caused by the non-uniform distribution of log entries across companies.

Now let’s look at the secondary index on CompanyId and timestamp. Since this index is meant to support queries that specify just CompanyId and timestamp, we cannot address the distribution problem by simply incorporating UserId. Keep in mind that indexes are also susceptible to hotspots and we need to design them so that their distribution is uniform.

To address this, we’ll add a new column, EntryShardId, where (in pseudo-code):
entryShardId = hash(CompanyId + timestamp) % num_shards
The hash function here could be a simple crc32 operation. Here’s a python snippet illustrating how to calculate this hash function before a log entry is inserted:
...
import datetime
import zlib
...
timestamp = datetime.datetime.utcnow()
companyId = 'Acme'
entryShardId = (zlib.crc32(companyId + timestamp.isoformat()) & 0xffffffff) % 10
...
In this case, num_shards = 10. You can adjust this value based on the characteristics of your workload. For instance, if one company in our scenario generates 100 times more log entries on average than the other companies, then we would pick 100 for num_shards in order to achieve a uniform distribution across entries from all companies.

This hashing approach essentially takes the sequential, timestamp-ordered LogEntriesByCompany index entries for a particular company and distributes them across multiple application (or logical) shards. In this case, we have 10 such shards per company, resulting from the crc32 and modulo operations shown above.

Table LogEntries (with EntryShardId added)


CompanyId (PK)

UserId (PK)

Timestamp (PK)

EntryShardId

LogEntry

‘Acme’

1

2018-05-01T15:16:03.386257Z

8


And the index:
CREATE INDEX LogEntriesByCompany ON LogEntries(EntryShardId, CompanyId, Timestamp)

Querying

Evenly distributing data using a sharding approach is great for inserts but how does it affect retrieval? Application-level sharding is no good to us if we cannot retrieve the data efficiently. Let’s look at how we would query for a list of log entries by CompanyId and time range, but without UserId:

SELECT CompanyId, UserId, Timestamp, LogEntry
FROM LogEntries@{FORCE_INDEX=LogEntriesbyCompany}
   WHERE CompanyId = 'Acme'
   AND ShardedEntryId BETWEEN 0 AND 9
   AND Timestamp > '2018-05-01T15:14:10.386257Z'
   AND Timestamp < '2018-05-01T15:16:10.386257Z'
ORDER BY Timestamp DESC;

The above query illustrates how to perform a timestamp range retrieval while taking sharding into account. By including the ShardedEntryId in the query above, we tell Spanner to ‘look’ in all 10 logical shards to retrieve the timestamp entries for CompanyId ‘Acme’ for a particular range.

Cloud Spanner is a full-featured relational database service that relieves you of most—but not all—database management tasks. For more information on Cloud Spanner management best practices, check out the recommended reading.

Anti-pattern: timestamp ordering
Optimizing Schema Design for Cloud Spanner
Best Practices for Schema Design

Introducing ultramem Google Compute Engine machine types



Today we are excited to announce beta availability of a new family of Google Compute Engine machine types. The n1-ultramem family of memory-optimized virtual machine (VM) instances come with more memory—a lot more! In fact, these machine types offer more compute resources and more memory than any other VM instance that we offer, making Compute Engine a great option for a whole new range of demanding, enterprise-class workloads.

The n1-ultramem machine type allows you to provision VMs with up to 160 vCPUs and nearly 4TB of RAM. The new memory-optimized, n1-ultramem family of machine types are powered by 4 Intel® Xeon® Processor E7-8880 v4 (Broadwell) CPUs and DDR4 memory, so they are ready for your most critical enterprise applications. They come in three predefined sizes:
  • n1-ultramem-40: 40 vCPUs and 961 GB of memory
  • n1-ultramem-80: 80 vCPUs and 1922 GB of memory
  • n1-ultramem-160: 160 vCPUs and 3844 GB of memory
These new machine types expand the breadth of the Compute Engine portfolio with new price-performance options. Now, you can provision compute capacity that fits your exact hardware and budget requirements, while paying only for the resources you use. These VMs are a cost-effective option for memory-intensive workloads, and provide you with the lowest $/GB of any Compute Engine machine type. For full details on machine type pricing, please check the pricing page, or the pricing calculator.

Memory-optimized machine types are well suited for enterprise workloads that require substantial vCPU and system memory, such as data analytics, enterprise resource planning, genomics, and in-memory databases. They are also ideal for many resource-hungry HPC applications.

Incorta is a cloud-based data analytics provider, and has been testing out the n1-ultramem-160 instances to run its in-memory database.
"Incorta is very excited about the performance offered by Google Cloud Platform's latest instances. With nearly 4TB of memory, these high-performance systems are ideal for Incorta's Direct Data Mapping engine which aggregates complex business data in real-time without the need to reshape any data. Using public data sources and Incorta's internal testing, we've experienced queries of three billion records in under five seconds, compared to three to seven hours with legacy systems."
— Osama Elkady, CEO, Incorta
In addition, the n1-ultramem-160 machine type, with nearly 4TB of RAM, is a great fit for the SAP HANA in-memory database. If you’ve delayed moving to the cloud because you have not been able to find big enough instances for your SAP HANA implementation, take a look at Compute Engine. Now you don’t need to keep your database on-premises while your apps move to cloud. You can run both your application and in-memory database in Google Cloud Platform where SAP HANA backend applications will benefit from the ultra-low latency of running alongside the in-memory database.

You can currently launch ultramem VMs in us-central1, us-east1 and europe-west1. Stay up-to-date on additional regions by visiting our available regions and zones page.

Visit the Google Cloud Platform Console and get started today. It’s easy to configure and provision n1-ultramem machine types programmatically, as well as via the console. Visit our SAP page, if you’d like to learn more about running your SAP HANA, in-memory database on GCP with ultramem machine types.

Three steps to prepare your users for cloud data migration



When preparing to migrate a legacy system to a cloud-based data analytics solution, as engineers we often focus on the technical benefits: Queries will run faster, more data can be processed and storage no longer has limits. For IT teams, these are significant, positive developments for the business. End users, though, may not immediately see the benefits of this technology (and internal culture) change. For your end users, running macros in their spreadsheet software of choice or expecting a query to return data in a matter of days (and planning their calendar around this) is the absolute norm. These users, more often than not, don’t see the technology stack changes as a benefit. Instead, they become a hindrance. They now need to learn new tools, change their workflows and adapt to the new world of having their data stored more than a few milliseconds away—and that can seem like a lot to ask from their perspective.

It’s important that you remember these users at all stages of a migration to cloud services. I’ve worked with many companies moving to the cloud, and I’ve seen how easy it is to forget the end users during a cloud migration, until you get a deluge of support tickets letting you know that their tried-and-tested methods of analyzing data no longer work. These added tickets increase operational overhead on the support and information technology departments, and decrease the number of hours that can be spent on doing the useful, transformative work—that is, analyzing the wealth of data that you now have available. Instead, you can end up wasting time trying to mold these old, inconvenient processes to fit this new cloud world, because you don’t have the time to transform into a cloud-first approach.

There are a few essential steps you can take to successfully move your enterprise users to this cloud-first approach.

1. Understand the scope

There are a few questions you should ask your team and any other teams inside your organization that will handle any stored or accessed data.
  • Where is the data coming from?
  • How much data do we process?
  • What tools do we use to consume and analyse the data?
  • What happens to the output that we collect?

When you understand these fundamentals during the initial scoping of a potential data migration, you’ll understand the true impact that such a project will have on those users consuming the affected data. It’s rarely as simple as “just point your tool at the new location.” A cloud migration could massively increase expected bandwidth costs if the tools aren’t well-tuned for a cloud-based approach—for example, by downloading the entire data set before analyzing the required subset.

To avoid issues like this, conduct interviews with the teams that consume the data. Seek to understand how they use and manipulate the data they have access to, and how they gain access to that data in the first place. This will all need to be replicated in the new cloud-based approach, and it likely won’t map directly. Consider using IAM unobtrusively to grant teams access to the data they need today. That sets you up to expand this scope easily and painlessly in the future. Understand the tools in use today, and reach out to vendors to clarify any points.. Don’t assume a tool does something if you don’t have documentation and evidence. It might look like the tool just queries the small section of data it requires, but you can’t know what’s going on behind the scenes unless you wrote it yourself!

Once you’ve gathered this information, develop clear guidelines for what new data analytics tooling should be used after a cloud migration, and whether it is intended as a substitute or a complement to the existing tooling. It is important to be opinionated here. Your users will be looking to you for guidance and support with new tooling. Since you’ll have spoken to them extensively beforehand, you’ll understand their use cases and can make informed, practical recommendations for tooling. This also allows you to scope training requirements. You can’t expect users to just pick up new tools and be as productive as they had been right away. Get users trained and comfortable with new tools before the migration happens.

2. Establish champions

Teams or individuals will sometimes stand against technology change. This can be for a variety of reasons, including worries over job security, comfort with existing methods or misunderstanding of the goals of the project. By finding and utilizing champions within each team, you’ll solve a number of problems:
  • Training challenges. Mass training is impersonal and can’t be tailored per team. Champions can deliver custom training that will hit home with their team.
  • Transition difficulties. Individual struggles by team can be hard to track and manage. By giving each team a voice through their champion, users will feel more involved in the project, and their issues are more likely to be addressed, reducing friction in the final stages.
  • Overloaded support teams. Champions become the voice of the project within the team too. This can have the effect of reducing support workload in the days, weeks and months during and after a migration, since the champion can be the first port of call when things aren’t running quite as expected.
Don’t underestimate the power of having people represent the project on their own teams, rather than someone outside to the team proposing change to an established workflow. The former is much more likely to be favorably received.

3. Promote the cloud transformation

It is more than likely that the current methods of data ingestion and analysis, and possibly the methods of data output and storage, will be suboptimal, or worse impossible, under the new cloud model. It is important that teams are suitably prepared for these changes. To make the transition easier, consider taking these approaches to informing users and allowing them room to experiment.

  • Promote and develop the understanding of having the power of the cloud behind the data. It’s an opportunity to ask questions of data that might otherwise have been locked away before, whether behind time constraints, or incompatibility with software, or even a lack of awareness that the data was even available to query. By combining data sets, can you and your teams become more evidential, and get better results that answer deeper, more important questions? Invariably, the answer is yes.
  • In the case that an existing tool will continue to be used, it will be invaluable to provide teams with new data locations and instructions for reconfiguring applications. It is important that this is communicated, whether or not the change will be apparent to the user. Undoubtedly, some custom configuration somewhere will break, but you can reduce the frustration of an interruption by having the right information available.
  • By having teams develop and build new tooling early, rather than during or after migration, you’ll give them the ability to play with, learn and develop the new tools that will be required. This can be on a static subset of data pulled from the existing setup, creating a sandbox where users can analyze and manipulate familiar data with new tools. That way, you’ll help drive driving the adoption of new tools early and build some excitement around them. (Your champions are a good resource for this.)

Throughout the process of moving to cloud, remember the benefits that shouldn’t be understated. No longer do your analyses need to take days. Instead, the answers can be there when you need them. This frees up analysts to create meaningful, useful data, rather than churning out the same reports over and over. It allows consumers of the data to access information more freely, without needing the help of a data analyst, by exposing dashboards and tools. But these high-level messages need to be supplemented with the personal needs of the team—show them the opportunities that exist and get them excited! It’ll help these big technological changes work for the people using the technology every day.

Introducing Cloud Memorystore: A fully managed in-memory data store service for Redis



At Redisconf 2018 in San Francisco last month, we announced the public beta of Cloud Memorystore for Redis, a fully-managed in-memory data store service. Today, the public beta is available for everyone to try. Cloud Memorystore provides a scalable, more secure and highly available Redis service fully managed by Google. It’s fully compatible with open source Redis, letting you migrate your applications to Google Cloud Platform (GCP) with zero code changes.

As more and more applications need to process data in real-time, you may want a caching layer in your infrastructure to reduce latency for your applications. Redis delivers fast in-memory caching, support for powerful data structures and features like persistence, replication and pub-sub. For example, data structures like sorted sets make it easy to maintain counters and are widely used to implement gaming leaderboards. Whether it’s simple session caching, developing games played by millions of users or building fast analytical pipelines, developers want to leverage the power of Redis without having to worry about VMs, patches, upgrades, firewall rules, etc.

Early adopters of Cloud Memorystore have been using the service for the last few months and they are thrilled with the service.
"At Descartes Labs, we have long been fans of Redis and its high performance. We have used Redis on everything from storing asynchronous task queues for tens of thousands of CPUs to a centralized persisted key-value pair store for the feature vectors output by our ML models. Cloud Memorystore provides an agile, scalable, no-operations Redis instance that we can instantly provision and scale without administration burdens."
- Tim Kelton, CoFounder and Cloud Architect, Descartes Labs
 “Cloud Memorystore has provided us with a highly reliable Redis service and has been powering our critical applications. We have been using Cloud Memorystore as an early adopter and we are impressed with the reliability and performance of the service. Google has helped us forget about our Redis instances with Cloud Memorystore and now we can focus more time on building our applications”
- George-Cristian, Software Developer, MDG



Feature Summary (Beta)
Redis version
3.2.11
Max instance size
300 GB
Max network bandwidth
12 Gbps
High availability with automatic failover
Yes
Memory scaling
Yes
Stackdriver Monitoring and Logging
Yes
Private IP access
Yes
IAM roles
Yes
Availability SLA¹
Yes
On-demand pricing
Yes
¹Applicable for GA release only.

Simple and flexible provisioning
How you choose to deploy Cloud Memorystore for Redis depends on the availability and performance needs of your application. You can deploy Redis as a standalone instance or with a replica to provide high availability. But while replicating a Redis instance provides only data redundancy, you still need to do the heavy lifting of health checking, electing of a primary, client connections on failover, etc. The Cloud Memorystore service takes away all this complexity and makes it easy for you deploy a Redis instance that meets your application’s needs.

Cloud Memorystore provides two tiers of service, Basic and Standard, each with different availability characteristics. Regardless of the tier of service, you can provision a Redis instance as small as 1 GB up to 300 GB. With network throughput up to 12 Gbps, Cloud Memorystore supports applications with very high bandwidth needs.

Here is a summary of the capabilities of each tier:


Feature
Basic Tier
Standard Tier
Max instance size
300 GB
300 GB
Max network bandwidth
12 Gbps
12 Gbps
Stackdriver Monitoring support
Yes
Yes
Memory scaling¹
Yes
Yes
Cross-zone replication
No
Yes
Automatic failover
No
Yes
Availability SLA²
No
99.9%
¹Basic Tier instances experience a downtime and a full cache flush during scaling. Standard Tier instance experience very minimal downtime and loss of some unreplicated data during scaling operation. ²Applicable for GA release only.

Provisioning a Cloud Memorystore instance is simple: just choose a tier, the size you need to support the instance availability and performance needs, and the region. Your Redis instance will be up and running within a few minutes.


“Lift and shift” applications
Once provisioned, using Cloud Memorystore is a breeze. You can connect to the Redis instance using any of the tools and libraries you commonly use in your environment. Cloud Memorystore clients makes use of IP addresses to connect to the instance. Applications always connect to one IP address and Cloud Memorystore ensures the traffic is directed to the primary in case there is a failover.

Other key features
Whether it’s provisioning, monitoring or scaling memory, Cloud Memorystore simplifies common management tasks.

Security
Open-source Redis has very minimal security, and as a developer or administrator, it can be challenging to ensure all Redis instances in your organization are protected. With Cloud Memorystore, Redis instances are deployed using a private IP address, which prevents the instance from being accessed from the internet. You can also use Cloud Identity & Access Management (IAM) roles to ensure granular access for managing the instance. Additionally, authorized networks ensure that the Redis instance is accessible only when connected to the authorized VPC network.

Stackdriver integration
Cloud Memorystore instances publish all the key metrics into Stackdriver, Google Cloud’s monitoring and management suite. You can monitor all of your instances from the Stackdriver dashboard, and use Stackdriver Logging to get more insights about the Redis instances


Seamless memory scaling
When a mobile application goes viral, it may be necessary to provision a larger Redis instance to meet latency and throughput needs. With Cloud Memorystore you can scale up the instance with a few clicks, and the Standard High Availability tier lets you scale the instance with minimal disruption to the application.

On-demand pricing
Cloud Memorystore provides on-demand pricing with no upfront cost and has per second billing. Moreover, there is no charge for network traffic coming in and out of a Cloud Memorystore instance. For more information, refer to Cloud Memorystore pricing.

Coming soon to Cloud Memorystore
This Cloud Memorystore public beta release is just a starting point for us. Here is a preview of some of the features that are coming soon.

We are excited about what is upcoming for Cloud Memorystore and we would love to hear your feedback! If you have any requests or suggestions, please let us know through Issue Tracker. You can also join the conversation at Cloud Memorystore discussion group.

Sign up for a $300 credit to try Cloud Memorystore and the rest of GCP. Start with a small Redis instance for testing and development, and then when you’re ready, scale up to serve performance-intensive applications.

Want to learn more? Register for the upcoming webinar on Tuesday, June 26th 9:00 am PT to hear all about Cloud Memorystore for Redis.

Exploring container security: Using Cloud Security Command Center (and five partner tools) to detect and manage an attack



Editor’s note: This is the sixth in a series of blog posts on container security at Google.

If you suspect that a container has been compromised, what do you do? In today’s blog post on container security, we’re focusing in on container runtime security—how to detect, respond to, and mitigate suspected threats for containers running in production. There’s no one way to respond to an attack, but there are best practices that you can follow, and in the event of a compromise, we want to make it easy for you to do the right thing.

Today, we’re excited to announce that you’ll soon be able to manage security alerts for your clusters in Cloud Security Command Center (Cloud SCC), a central place on Google Cloud Platform (GCP) to unify, analyze and view security data across your organization. Further, even though we just announced Cloud SCC a few weeks ago, already five container security companies have integrated their tools with Cloud SCC to help you better secure the containers you’re running on Google Kubernetes Engine.

With your Kubernetes Engine assets in Cloud SCC, you can view security alerts for your Kubernetes Engine clusters in a single pane of glass, and choose how to best take action. You’ll be able to view, organize and index your Kubernetes Engine cluster assets within each project and across all the projects that your organization is working on. In addition, you’ll be able to associate your container security findings to either specific clusters, container images and/or VM instances as appropriate.

Until then, let’s take a deeper look at runtime security in the context of containers and Kubernetes Engine.

Responding to bad behavior in your containers

Security operations typically includes several steps. For example, NIST’s well known framework includes steps to identify, protect, detect, respond, and recover. In containers, this translates to detecting abnormal behavior, remediating a potential threat, performing forensics after an incident, and enforcing runtime policies in isolated environments such as the new gVisor sandboxed container environment.

But first, how do you detect that a container is acting maliciously? Typically, this requires creating a baseline of what normal behaviour looks like, and using rules or machine learning to detect variation from that baseline. There are many ways to create that initial behavioral baseline (i.e., how a container should act), for example, using kprobes, tracepoints, and eBPF kernel inspection. Deviation from this baseline then triggers an alert or action.

If you do find a container that appears to be acting badly, there are several actions you might want to take, in increasing order of severity:

  • Just send an alert. This notifies your security response team that something unusual had been detected. For example, if security monitoring is relatively new in your environment, you might be worried about having too many false positives. Cloud SCC lets you unify container security signals with other security signals across your organization. With Cloud SCC, you can: see the live monitored state of container security issues in the dashboard; access the details either in the UI or via the API; and set up customer-defined filters to generate Cloud Pub/Sub topics that can then trigger email, SMS, or bugs in Jira.
  • Isolate a container. This moves the container to a new network, or otherwise restricts its network connectivity. For example, you might want to do this if you think one container is being used to perform a denial of service attack on other services.
  • Pause a container, e.g., `gcloud compute instances stop`. This suspends all running processes in the container. For example, if you detect suspected cryptomining, you might want to limit resource use and make a backup prior to further investigation.
  • Restart a container, e.g., `docker restart` or `kubectl delete pod`. This kills and restarts a running container, and resets the current state of the application. For example, if you suspect an attacker has created a foothold in your container, this might be a first step to counter their efforts, but this won’t stop an attacker from replicating an attack—just temporarily remove them.
  • Kill a container, i.e., `docker kill`. This kills a running container, halting all running processes (and less gracefully than `docker stop`). This is typically a last resort for a suspected malicious container.

Analyzing a security incident

After an incident, your security forensics team might step in to determine what really happened, and how they can prevent it the next time around. On Kubernetes Engine, you can look at a few different sources of event information:

  • Security event history and monitoring status in Cloud SCC. You can view the summary status of your assets and security findings in the dashboard, configure alerting and notification to a custom Cloud Pub/Sub topic and then query and explore specific events in detail either via the UI or API.
  • Container logs, kubelet logs, Docker logs, and audit logs in Stackdriver. Kubernetes Engine Audit Logging captures certain actions by default, both in the Kubernetes Engine API (e.g., create cluster, remove nodepool) and in the Kubernetes API (e.g., create a pod, update a DaemonSet).
  • Snapshots. You can snapshot a container’s filesystem in docker with `docker export`.

Announcing our container runtime security partners

To give you the best options for container runtime security on Google Cloud Platform, we’re excited to announce five partners who have already integrated with Cloud SCC: Aqua Security, Capsule8, Stackrox, Sysdig Secure, and Twistlock. These technical integrations allow you to use their cutting-edge security tools with your deployments, and view their findings and recommendations directly in Cloud SCC.

Aqua Security

Aqua’s integration with Cloud SCC provides real-time visibility into container security events and policy violations, including:

  • Inventory of vulnerabilities in container images in Google Container Registry, and alerts on new vulnerabilities
  • Container user security violations, such as privilege escalation attempts
  • Attempts to run unapproved images
  • Policy violations of container network, process, and host resource usage

To learn more and get a demo of Aqua’s integration with Google Cloud SCC, visit aquasec.com/gcp

Capsule8

Capsule8 is a real-time, zero-day attack detection platform purpose-built for modern production infrastructures. The Capsule8 integration with Google delivers continuous security across GCP environments to detect and help shut down attacks as they happen. Capsule8 runs entirely in the customer's Google Compute Engine environment and accounts and only requires a lightweight installation-free sensor running on each Compute Engine instance to stream behavioral telemetry to identify and help shut down zero-day attacks in real-time.

For more information on Capsule8’s integration with GCP, please visit: https://capsule8.com/capsule8-for-google-cloud-platform/

Stackrox

StackRox has partnered with Google Cloud to deliver comprehensive security for customers running containerized applications on Kubernetes Engine. StackRox visualizes the container attack surface, exposes malicious activity using machine learning, and stops attacks. Under the partnership, StackRox is working closely with the GCP team to offer an integrated experience for Kubernetes and Kubernetes Engine users as part of Cloud SCC.

“My current patchwork of security vendor solutions is no longer viable – or affordable – as our enterprise is growing too quickly and cyber threats evolve constantly. StackRox has already unified a handful of major product areas into a single security engine, so moving to containers means positive ROI."

- Gene Yoo, Senior Vice President and Head of Information Security at City National Bank

For more information on StackRox’s integration with GCP, please visit: https://www.stackrox.com/google-partnership

Sysdig Secure

By bringing together container visibility and a native Kubernetes Engine integration, Sysdig Secure provides the ability to block threats, enforce compliance, and audit activity across an infrastructure through microservices-aware security policies. Security events are enriched with hundreds of container and Kubernetes metadata before being sent to Cloud SCC. This brings the most relevant signals to your attention and correlates Sysdig events with other security information sources so you can have a single point of view and the ability to react accordingly at all levels.

"We chose to develop on Google Cloud for its robust, cost-effective platform. Sysdig is the perfect complement because it allows us to effectively secure and monitor our Kubernetes services with a single agent. We're excited to see that Google and Sysdig are deepening their partnership through this product integration.”

- Ashley Penny, VP of infrastructure, Cota Healthcare. 

For more information on Sysdig Secure’s integration with GCP, please visit: https://sysdig.com/gke-monitoring/

Twistlock

Twistlock surfaces cloud-native security intel vulnerability findings, compliance posture, runtime anomalies, and firewall logs directly into Cloud SCC. Customers can use Cloud SCC's big data capabilities to analyze and alert at scale, integrating container, serverless, and cloud-native VM security intelligence alongside other apps and workloads connected to Cloud SCC.

"Twistlock enables us to pinpoint vulnerabilities, block attacks, and easily enforce compliance across our environment, giving our team the visibility and control needed to run containers at scale."

- Anthony Scodary, Co-Founder of Gridspace

For more information on Twistlock’s integration with GCP, please visit: https://twistlock.com/partners/google-cloud-platform

Now you have the tools you need to protect your containers! Safe computing!

And if you’re at KubeCon in Copenhagen, join us at our booth for a demo and discussion around container security.