Tag Archives: Stackdriver

Announcing Enhanced Smart Home Analytics

Posted by Toni Klopfenstein, Developer Advocate

When creating scalable applications, consistent and reliable monitoring of resources is a valuable tool for any developer. Today we are releasing enhanced analytics and logging for Smart Home Actions. This feature enables you to more quickly identify and respond to errors or quality issues that may arise.

Request Latency Dashboard

You can now access the smart home dashboard with pre-populated metrics charts for your Actions on the Analytics tab in the Actions Console, or through Cloud Monitoring. These metrics help you quantify the health and usage of your Action, and gain insight into how users engage with your Action. You can view:

  • Execution types and device traits used
  • Daily users and request counts
  • User query response latency
  • Success rate for Smart Home engagements
  • Comparison of cloud and local fulfilment interactions

Successful Requests Dashboard

Cloud Logging provides detailed logs based on the events observed in Cloud Monitoring.

We've added additional features to the error logs to help you quickly debug why intents fail, which particular device commands malfunction, or if your local fulfilment falls back to cloud fulfilment.

New details added to the event logs include:

  • Cloud vs. local fulfilment
  • EXECUTE vs. QUERY intents
  • Locale of request
  • Device Type

You can additionally export these logs through Cloud Pub/Sub, and build log-based metrics and alerts for your development teams to gain insights into common issues.

For more guidance on accessing your Smart Home Action analytics and logs, check out the developer guide or watch the video.

We want to hear from you! Continue sharing your feedback with us through the issue tracker, and engage with other smart home developers in the /r/GoogleAssistantDev community. Follow @ActionsOnGoogle on Twitter for more of our team's updates, and tweet using #AoGDevs to share what you’re working on. We can’t wait to see what you build!

Drilling down into Stackdriver Service Monitoring



If you’re responsible for application performance and availability, you know how hard it can be to see it through the eyes of your customers and end users. We think that’s really going to change with last week’s introduction of Stackdriver Service Monitoring, a new tool for monitoring how your customers perceive your applications, and that then lets you drill down to the underlying infrastructure when there’s a problem.

Most IT operations tools take a bottoms-up understanding of IT systems: they look at compute, storage, and networking metrics to infer the customer experience. Application performance management (APM) tools like tracing systems, debuggers, and profilers consider the application from the code level—but lose sight of the underlying infrastructure. Sometimes, a logs analytics solution can provide the glue between those two layers, but often with great effort and expense.

IT operators have been missing a cost-effective, easy-to-use, general-purpose tool to monitor the customer-facing behavior of their applications. It’s hard to know how end users experience your software and it’s difficult to measure services and applications in a standardized way. Ops staff risk burning out from all the spurious alerts. The result of all this is that mean-time-to-resolution (MTTR) is longer than necessary, and customer satisfaction is lower than desired. The situation is exacerbated with microservice architectures where the app itself is broken into many small pieces, which makes it hard to understand how all the pieces fit together and where to start investigating when there is a problem.

That all changes with the release of Stackdriver Service Monitoring. Service Monitoring takes advantage of service-aware, “opinionated” infrastructure so you can monitor how end users perceive your systems, letting you drill down to the infrastructure level when necessary. Initially, we are supporting this functionality for Google App Engine and for Istio service meshes running on Google Kubernetes Engine. We will expand to more platforms over time.

With Stackdriver Service Monitoring, you get the answers to the following questions:
  • What are your services? What functionality do those services expose to internal and external customers?
  • What are your promises and commitments regarding the availability and performance of those services, and are your services meeting them?
  • For microservices-based apps, what are the inter-service dependencies? How can you use that knowledge to double check new code rollouts and triage problems in the event of service degradation?
  • Can you look at all the monitoring signals for a service holistically to reduce MTTR?

Anatomy of Stackdriver Service Monitoring

Service Monitoring has three pieces: the service graph, Service Level Objectives (SLOs), and multi-signal service dashboards. Together, these give you an inventory of your services, visually display the dependencies between them, let you set and measure availability and performance promises, help you triage application problems to quickly find the root cause, and finally, help you debug broken services more quickly than ever before. Let’s look at each piece in turn.

The service graph: This is a service-specific view of your infrastructure. It starts out with a real-time top level display of all services in the Istio service mesh and the communication links between them. Selecting one service displays charts with error rates and latency metrics. Double-clicking on a service allows you to drill down into its underlying Kubernetes infrastructure, providing the long elusive connection between app behavior and infrastructure. There is also a time slider which allows you to see the graph at previous points in time. Using the service graph you can see your application architecture for reference purposes or to triage problems. You can explore metrics about service behavior, and determine whether an upstream service is causing problems to a downstream service. Finally, you can compare the service graph at different points in time to determine whether there was a significant architectural change right before a problem was reported. There is no quicker way to get started exploring and understanding complex multi-service applications.

SLOs: Internally at Google, our Site Reliability Engineering team (SRE) only alert themselves on customer-facing symptoms of problems, and not all potential causes. This better aligns them to customer interests, lowers their toil, frees them to do value-added reliability engineering, and increases job satisfaction. Stackdriver Service Monitoring lets you to set, monitor, and alert on SLOs. Because Istio and App Engine are instrumented in an opinionated way, we know exactly what the transaction counts, error counts, and latency distributions are between services. All you need to do is set your targets for availability and performance and we automatically generate the graphs for service level indicators (SLIs), compliance to your targets over time, and your remaining error budget. You can configure the maximum allowed drop rate for your error budget; if that rate is exceeded, we notify you and create an incident so that you can take action. To learn more about SLO concepts including error budget, we encourage you to read the SLO chapter of the SRE book.

Service Dashboard: At some point, you will need to dig deeper into a service’s signals. Maybe you received an SLO alert and there’s no obvious upstream cause. Maybe the service is implicated by the service graph as a possible cause for another service’s SLO alert. Maybe you have a customer complaint outside of an SLO alert that you need to investigate. Or, maybe you want to see how the rollout of a new version of code is going.

The service dashboard provides a single coherent display of all signals for a specific service, all of them scoped to the same timeframe with a single control, providing you the fastest possible way to get to the bottom of a problem with your service. Service monitoring lets you dig deep into the service’s behavior across all signals without having to bounce between different products, tools, or web pages for metrics, logs, and traces. The dashboard gives you a view of the SLOs in one tab, the service metrics (transaction rates, error rates, and latencies) in a second tab, and diagnostics (traces, error reports, and logs) in the third tab.

Once you’ve validated an error budget drop in the first tab and isolated anomalous traffic in the second tab, you can drill down further in the diagnostics tab. For performance issues, you can drill down into long tail traces, and from there easily get into Stackdriver Profiler if your app is instrumented for it. For availability issues you can drill down into logs and error reports, examine stack traces, and open the Stackdriver Debugger, if the app is instrumented for it.

Stackdriver Service Monitoring gives you a whole new way to view your application architecture, reason about its customer-facing behaviors, and get to the root of any problems that arise. It takes advantage of infrastructure software enhancements that Google has championed in the open source-world, and leverages the hard-won knowledge of our SRE teams. We think this will fundamentally transform the ops experience of cloud native and microservice development and operations teams. To learn more see the presentation and demo with Descartes Labs at GCP Next last week. We hope you will sign up to try it out and share your feedback.

How to connect Stackdriver to external monitoring



Google Stackdriver lets you track your cloud-powered applications with monitoring, logging and diagnostics. Using Stackdriver to monitor Google Cloud Platform (GCP) or Amazon Web Services (AWS) projects has many advantages—you get detailed performance data and can set up tailored alerts. However, we know from our customers that many businesses are bridging cloud and on-premises environments. In these hybrid situations, it’s often necessary to also connect Stackdriver to an on-prem monitoring system. This is especially important if there is already a monitoring process in place that involves classic IT Business Management (ITBM) tasks, like opening and closing tickets and incidents automatically.

Luckily, you can use Stackdriver for these circumstances by enabling the alerting policies via webhooks. We’ll explain how in this blog post, using the example of monitoring the uptime of a web server. Setting up the monitoring condition and alerting policy is really where Stackdriver shines, since it auto-detects GCP instances and can analyze log files. This differs depending on the customer environment. (You can also find more here about alerting and incident management in Stackdriver.)

Get started with server and firewall policies to external monitoring

To keep it simple, we’ll start with explaining how to do an HTTP check on a freshly installed web server (nginx). This is called an uptime check in Stackdriver.

First, let’s set up the server and firewall policy. In order for the check to be successful, make sure you’ve created a firewall rule in the GCP console that allows HTTP traffic to the public IP of the web server. The best way to do that is to create a tag-based firewall rule that allows all IP addresses (0.0.0.0/0) on the tag “http.” You can now add that tag to your newly created web server instance. (We created ours by creating a micro instance using Ubuntu image, then installing nginx using apt-get).

If you prefer containers, you can use Kubernetes to spin up an nginx container.

Make sure to check the firewall rule by manually adding your public IP in a browser. If all is configured correctly, you should see the nginx greeting page:

Setting up the uptime check

Now let’s set up the website uptime check. Open the Stackdriver monitoring menu in your GCP cloud console.

In this case, we created a little web server instance with a public IP address. We want to monitor this public IP address to check the web server’s uptime. To set this up, select “Uptime Checks” from the right-side menu of the Stackdriver monitoring page.

Remember: This is a test case, so we set the check interval to one minute. For real-world use cases, this value might change according to the service monitoring requirements.

Once you have set up the Uptime Check, you can now go ahead and set up an alerting policy. Click on “Create New Policy” in the following popup window (only appears the first time you create an Uptime Check). Or you can click on “Alerting” on the left-side Stackdriver menu to set it up. Click on “Create a Policy” in the popup menu.

Setting up the alert policy

Once you click on “Create a Policy,” you should see a new popup with four steps to complete.

The first step will ask for a condition “when” to trigger the alert. This is where you have to make sure the Uptime Check is added. To do this, simply click on the “Add Condition” button.

A new window will appear from the right side:

Specify the Uptime Check by clicking on Select under “Basic Health.”

This will bring up this window (also from the right side) to select the specific Uptime Check to alert on. Simply choose “URL” in the “Resource Type” field and the “IF UPTIME CHECK” section will appear automatically. Here, we select the previously created Uptime Check.


You can also set the duration of the service downtime to trigger an alert. In this case, we used the default of five minutes. Click “Save Condition” to continue with the Alert Policy setup.

This leads us to step two:

This is where things get interesting. In order to include an external monitoring system, you can use so-called webhooks. Those are typically callouts using an HTTP POST method to send JSON formatted messages to the external system. The on-prem or third-party monitoring system needs to understand this format in order to be used properly. Typically, there’s wide support in the monitoring system industry for receiving and using webhooks.

Setting up the alerts

Now you’ll set up the alerts. In this example, we’re configuring a webhook only. You can set up multiple ways to get alerted simultaneously. If you want to get an email and a webhook at the same time, just configure it that way by adding the second (or third) method. In this example, we’ll use a free webhook receiver to monitor if our setup works properly.

Once the site has generated a webhook receiver for you, you’ll have a link you can use that will list all received tokens for you. Remember, this is for testing purposes only. Do not send in any user-specific data such as private IP addresses or service names.

Next you have to configure the notification to use a webhook so it’ll send a message over to our shiny new webhook receiver. Click on “Add Notification.”

By default a field will appear saying “Email”—click on the drop-down arrow to see the other options:

Select “Webhook” in the drop-down menu.

The system will most properly tell you that there is no webhook setup present. That’s because you haven’t specified any webhook receiver yet. Click on “Setup Webhook.”

(If you’ve already set up a webhook receiver, the system won’t offer you this option here.)

Therefore you need to go to the “select project” dropdown list (top left side, right next to the Stackdriver logo in the gray bar area). Click on the down arrow symbol (next to your project ID) and see at the bottom of the drop-down box the option “Account Settings.”

In the popup window, select “Notifications” (bottom of the left-side list under “Settings”) and then click on “Webhooks” at the top menu. Here you can add additional webhooks if needed.

Click on “Create webhook.”

Remember to put in your webhook endpoint URL. In our test case, we do not need any authentication.

Click on “Test Connection” to verify and see your first webhook appearing on the test site!

It should say “This is a test alert notification from Stackdriver.”

Now let’s continue with the Alerting Policy. Choose the newly created webhook by selecting “Webhook” as notification type and the webhook name (created earlier) as the target. If you want to have additional notification settings (like SMS, email, etc.), feel free to add those as well by clicking on “Add another notification.”

Once you add a notification, you can optionally add documentation by creating a so-called “Markdown document.” Learn more here about the Markdown language.

Last but not least, give the Alert Policy a descriptive name:

We decided to go super creative and call it “HTTP - uptime alert.” Once you have done this, click “Save Policy” at the bottom of the page.

Done! You just created your first policy. including a webhook to trigger alerts on incidents.

The policy should be green and the uptime check should report your service being healthy. If not, check your firewall rules.

Test your alerting

If everything is normal and works as expected, it is time to try your alerting policy. In order to do that, simply delete the “allow-http” firewall rule created earlier. This should result in a “service unavailable” condition for our Uptime Check. Remember to give it a little while. The Uptime Check will wait 10 seconds per region and overall one minute until it declares the service down (remember, we configured that here).

Now you’ll see that you can’t reach the nginx web server instance anymore:

Now let’s go to the Stackdriver overview page to see if we can find the incident. Click on “Monitoring Overview” in the left-side menu at the very top:

Indeed, the Uptime Check comes back red, telling us the service is down. Also, our Alerting Policy has created an incident saying that the “HTTP - uptime alert” has been triggered and the service has been unavailable for a couple of minutes now.

Let’s check the test receiver site to see if we got the webhook to trigger there:

You can see we got the webhook alert with the same information regarding the incident. This information is passed on using the JSON format for easy parsing at the receiving end. You will see the policy name that was triggered (first red rectangle), the state “open,” as well as the “started at” timestamp in Unix time format (seconds passed since 1970). Also, it will tell you that the service is failing in the “summary” field. If you had configured any optional documentation, you’d see it using the JSON format (HTTP post).

Bring the service back

Now, recreate the firewall rule to see if we get an “incident resolved” message.

Let’s check the overview screen again (remember to give it five or six minutes after the rule to react)

You can see that service is back up. Stackdriver automatically resolves open incidents once the condition restores. So in our case, the formerly open incident is now restored, since the Uptime Check comes back as “healthy” again. This information is also passed on using the alerting policy. Let’s see if we got a “condition restored” webhook message as well.

By the power of webhooks, it also told our test monitoring system that this incident is closed now, including useful details such as the ending time (Unix timestamp format) and a summary telling us that the service has returned to a normal state.

If you need to connect Stackdriver to a third-party monitoring system, webhooks is one extremely flexible way of doing this. It will let your operations team continue using their familiar go-to resources on-premises, while using all advantages of Stackdriver in a GCP (or AWS) environment. Furthermore, existing monitoring processes can be reused to bridge into the Google Cloud world.

Remember that Stackdriver can do far more than Uptime Checks, including log monitoring over source code monitoring, debugging and tracing user interactions with your application. Whether it’s alerting policy functionality, using the webhook messaging or other checks you could define in Stackdriver, all can be forwarded to a third-party monitoring tool. Even better, you can close incidents automatically once they have been resolved.

Have fun monitoring your cloud services!

Related content:

New ways to manage and automate your Stackdriver alerting policies
How to export logs from Stackdriver Logging: new solution documentation
Monitor your GCP environment with Cloud Security Command Center

Gain visibility and take control of Stackdriver costs with new metrics and tools



A few months back, we announced new simplified Stackdriver pricing that will go into effect on June 30. We’re excited to bring this change to our users. To streamline this change, you’ll receive advanced notifications and alerting on the performance and diagnostics data you track for cloud applications, plus flexibility in creating dashboards, without having to opt in to the premium pricing tier.

We’ve added new metrics and views to help you understand your Stackdriver usage now as you prepare for the new pricing to take effect. We’ve got some tips to help you maximize value while minimizing costs for your monitoring, logging and application performance management (APM) solutions.

Getting visibility into your monitoring and logging usage

In anticipation of the pricing changes, we’ve added new metrics to make it easier than ever to understand your logs and metrics volume. There are three different ways to view your usage, depending on which tool you prefer: the billing console; updated summary pages in the Stackdriver console; or metrics available via the API and Metrics Explorer.

1. Analyzing Stackdriver costs using the billing console
Stackdriver is now reporting logging and monitoring usage on the new SKUs (fancy name for something you can buy—in this case, volume of metrics or logs), which are visible in the billing console. Don’t worry—until June 30, the costs will still be $0, but you can view your existing volume across your billing account by going to the new reports page in the billing console. To view your current Stackdriver logging and monitoring usage volume, select group by SKU, filter for Log Volume, Metric Volume or Monitoring API Requests, and you’ll see your usage across your billing account. (See more in our documentation). You can also analyze your usage by exporting your billing data to BigQuery. Once you understand your usage, you can easily estimate what your cost will be after June 30 using the pricing calculator under the Upcoming Model tab.

2. Analyzing Stackdriver costs using the Stackdriver console
We’ve also updated the tools for viewing and managing volumes of logs and metrics within Stackdriver itself.


The Logs Ingestion page, above, now shows last month’s volume in addition to the current month’s volume for the project and by resource type. We’ve also added handy links to view detailed usage in Metrics Explorer right from this page as well.

The Monitoring Resource Usage page, above, now shows your metrics volume month-to-date vs. the last calendar month (note that these metrics are brand-new, so they will take some time to populate). All projects in your Stackdriver account are broken out individually. We’ve also added the capability to see your projected total for the month and added links to see the details in Metrics Explorer.

3. Analyzing Stackdriver costs using the API and Metrics Explorer
If you’d like to understand which logs or metrics are costing the most, you’re in luck—we now have even better tools for viewing, analyzing and alerting on metrics. For Stackdriver Logging, we’ve added two new metrics:
  • logging.googleapis.com/billing/bytes_ingested provides real-time incremental delta values that can be used to calculate your rates of log volume ingestion. It does not cover excluded logs volume. This metric provides a resource_type label to analyze log volume by various monitored resource types that are sending logs.
  • logging.googleapis.com/billing/monthly_bytes_ingested provides your usage as a month-to-date sum every 30 minutes and resets to zero every month. This can be useful for alerting on month-to-date log volume so that you can create or update exclusions as needed.
We’ve also added a new metric for Stackdriver Monitoring to make it easier to understand your costs:
  • monitoring.googleapis.com/billing/bytes_ingested provides real-time incremental deltas that can be used to calculate your rate of metrics volume ingestion. You can drill down and group or filter by metric_domain to separate out usage for your agent, AWS, custom or logs-based metrics. You can also drill down by individual metric_type or resource_type.
You can access these metrics via the monitoring API, create charts for them in Stackdriver or explore them in real time in Metrics Explorer (shown below), where you can easily group by the provided labels in each metric, or use Outlier mode to detect top metric or resource type with the highest usage. You can read more about aggregations in our documentation.

If you’re interested in an even deeper analysis of your logs usage, check out this post by one of Google’s Technical Solutions Consultants that will show you how to analyze your log volume using logs-based metrics in Datalab.


Controlling your monitoring and logging costs
Our new pricing model is designed to make the same powerful log and metric analysis we use within Google accessible to everyone who wants to run reliable systems. That means you can focus on building great software, not on building logging and monitoring systems. This new model brings you a few notable benefits:
  • Generous allocations for monitoring, logging and trace, so many small or medium customers can use Stackdriver on their services at no cost.
    • Monitoring: All Google Cloud Platform (GCP) metrics and the first 150 MB of non-GCP metrics per month are available at no cost.
    • Logging: 50 GB free per month, plus all admin activity audit logs, are available at no cost.
  • Pay only for the data you want. Our pricing model is designed to put you in control.
    • Monitoring: When using Stackdriver, you pay for the volume of data you send, so a metric sent once an hour costs 1/60th as much as a metric sent once a minute. You’ll want to keep that in mind when setting up your monitoring schedules. We recommend collecting key logs and metrics via agents or custom metrics for everything in production; development environments may not need the same level of visibility. For custom metrics, you can write points at a smaller time granularity. Another way is to reduce the number of time series sent by avoiding unnecessary labels for custom and logs-based metrics that may have high cardinality.
    • Logging: The exclusion filter in Logging is an incredible tool for managing your costs. The way we’ve designed our system to manage logs is truly unique. As the image below shows, you can choose to export your logs to BigQuery, Cloud Storage or Cloud Pub/Sub without needing to pay to ingest them into Stackdriver.
      You can even use exclusion filters to collect a percentage of logs, such as 1% of successful HTTP responses. Plus, exclusion filters are easy to update, so if you’re troubleshooting your system, you can always temporarily increase the logs you’re ingesting.

Putting it all together: managing to your budget
Let’s look at how to combine the visibility from the new metrics with the other tools in Stackdriver to follow a specific monthly budget. Suppose we have $50 per month to spend on logs, and we’d like to make that go as far as possible. We can afford to ingest 150 GB of logs for the month. Looking at the Log Ingestion page, shown below, we can easily get an idea of our volume from last month—200 GB. We can also see that 75 GB came from our Cloud Load Balancer, so we’ll add an exclusion filter for 99% of 200 responses.

To make sure we don’t go over our budget, we’ll also set a Stackdriver alert, shown below, for when we reach 145 GB on the monthly log bytes ingested. Based on the cost of ingesting log bytes, that’s just before we’ll reach the $50 monthly budget threshold.

Based on this alerting policy, suppose we get an email near the end of the month that our volume is at 145 GB for the month to date. We can turn off ingestion of all logs in the project with an exclusion filter like this:
logName:*

Now only admin activity audit logs will come through, since they don’t count toward any quota and can’t be excluded. Let’s suppose we also have a requirement to save all data access logs on our project. Our sinks to BigQuery for these logs will continue to work, even though we won’t see those logs in Stackdriver Logging until we disable the exclusion filter. So we won’t lose that data during that period of time.


Like managing your household budget, running out of funds at the end of the month isn’t a best practice. Turning off your logs should be considered a last option, similar to turning off your water in your house toward the end of the month. Both these scenarios run the risk of making it harder to put out fires or incidents that may come up. One such risk is that if you have an issue and need to contact GCP support, they won’t be able to see your logs and may not be able to help you.


With these tools, you’ll be able to plan ahead to help ensure you’re avoiding ingesting less useful logs throughout the month. You might turn off unnecessary logs based on use, rejigger production and development environment monitoring or logging, or decide to offload data to another service or database. Our new metrics, views and dashboards give you a lot more tools to see how much you’re spending in both resources and IT budget in Stackdriver. You’ll be able to bring flexibility and efficiency to logging and monitoring, and avoid unpleasant surprises. 


To learn more about Stackdriver, check out our documentation or join in the conversation in our discussion group.


Related content

Getting more value from your Stackdriver logs with structured data



Logs contain some of the most valuable data available to developers, DevOps practitioners, Site Reliability Engineers (SREs) and security teams, particularly when troubleshooting an incident. It’s not always easy to extract and use, though. One common challenge is that many log entries are blobs of unstructured text, making it difficult to extract the relevant information when you need it. But structured log data is much more powerful, and enables you to extract the most valuable data from your logs. Google Stackdriver Logging just made it easier than ever to send and analyze structured log data.

We’ve just announced new features so you can better use structured log data. You’ve told us that you’d like to be able to customize which fields you see when searching through your logs. You can now add custom fields in the Logs Viewer in Stackdriver. It’s also now easier to generate structured log data using the Stackdriver Logging agent.

Why is structured logging better?
Using structured log data has some key benefits, including making it easier to quickly parse and understand your log data. The chart below shows the differences between unstructured and structured log data. 

You can see here how much more detail is available at a glance:



Unstructured log data
Structured log data
Example from custom logs
...
textPayload: A97A7743 purchased 4 widgets.
...
...
jsonPayload: {
 "customerIDHash": “A97A7743”
 "action": “purchased”
 "quantity": “4”
 "item": “widgets”
}
...
Example from Nginx logs—now available as structured data through the Stackdriver logging agent
textPayload: 127.0.0.1 10.21.7.112 - [28/Feb/2018:12:00:00 +0900] "GET / HTTP/1.1" 200 777 "-" "Chrome/66.0"
time:
1362020400 (28/Feb/2018:12:00:00 +0900)

jsonPayload: {
 "remote" : "127.0.0.1",
 "host"   : "10.21.7.112",
 "user"   : "-",
 "method" : "GET",
 "path"   : "/",
 "code"   : "200",
 "size"   : "777",
 "referer": "-",
 "agent"  : "Chrome/66.0"
}
 


Making structured logs work for you
You can send both structured and unstructured log data to Stackdriver Logging. Most logs Google Cloud Platform (GCP) services generate on your behalf, such as Cloud Audit Logging, Google App Engine logs or VPC Flow Logs, are sent to Stackdriver automatically as structured log data.

Since Stackdriver Logging also passes the structured log data through export sinks, sending structured logs makes it easier to work with the log data downstream if you’re processing it with services like BigQuery and Cloud Pub/Sub.

Using structured log data also makes it easier to alert on log data or create dashboards from your logs, particularly when creating a label or extracting a value with a distribution metric, both of which apply to a single field. (See our previous post on techniques for extracting values from Stackdriver logs for more information.)

Try Stackdriver Logging for yourself
To start using Stackdriver structured logging today, you’ll just need to install (or reinstall) the Stackdriver logging agent with the --structured flag. This also enables automatic parsing of common log formats, such as syslog, Nginx and Apache.

curl -sSO "https://dl.google.com/cloudagents/install-logging-agent.sh"
sudo bash ./install-logging-agent.sh --structured

For more information on installation and options, check out the Stackdriver structured logging installation documentation.

To test Stackdriver Logging and see the power of structured logs for yourself, you can try one of our most asked-for Qwiklab courses, Creating and alerting on logs-based metrics, for free, using a special offer of 15 credits. This offer is good through the end of May 2018. Or try our new structured logging features out on your existing GCP project by checking out our documentation.

Announcing variable substitution in Stackdriver alerting notifications



When an outage occurs in your cloud application, having fast insight into what’s going on is crucial to resolving the issue quickly. If you use Google Stackdriver, you probably rely on alerting policies to detect these issues and notify you with relevant information. To improve the organization and readability of the information contained in these alerts, we’ve added some new features to make our alerting notifications more descriptive, useful and actionable. We’ll gradually roll out these updates over the next few weeks.

One of these new features is the ability to add variables to your alerting notifications. You can use this to include more metadata in your notifications, for example information on Kubernetes clusters and other resources. You can also use this to construct specific playbook information and links using the variable substitution.

In addition, we’re transitioning to HTML-formatted emails that are easier to read and more clearly organized. We’re also adding the documentation field to Slack notifications, as well as webhook, so teams using these notification methods can utilize these new features.

New variable substitution in alerting policy documentation

You can now include variables in the documentation section of your alerting policies. The contents of this field are also now included in Slack and webhook notifications, in addition to email.

The following syntax:

${varname}


will be formatted by replacing the expression ${varname} with the value of varname. We support only simple variable substitutions; more complex expressions, for example ${varname1 + varname2}, are not. We also support the use of $$ as an escape sequence (so that the literal text "${" may be written using "$${").

Variable Meaning
condition.name The REST resource name of the condition (e.g. "projects/foo/alertPolicies/12345/conditions/5678")
condition.display_name The display name for the triggering condition
metadata.user_label.key The value of the metadata label "key" (replace "key" appropriately)
metric.type The metric (e.g. "compute.googleapis.com/instance/cpu/utilization")
metric.display_name The display name associated with this metric type
metric.label.key The value of the metric label "key" (replace "key" appropriately)
policy.user_label.key The value of the user label "key" (replace "key" appropriately)
policy.name The REST resource name of the policy (e.g. "projects/foo/alertPolicies/12345")
policy.display_name The display name associated with the alerting policy
project The project ID of the Stackdriver host account
resource.project The project ID of the monitored resource of the alerting policy.
resource.type The type of the resource (e.g. "gce_instance")
resource.display_name The display name of the resource
resource.label.key The value of the resource label "key" (replace "key" appropriately)


Note: You can only set policy user labels via the Monitoring API.

@mentions for Slack

Slack notifications now include the alerting policy documentation. This means that you can include customized Slack formatting and control sequences for your alerts. For the various options, please refer to the Slack documentation.

One useful feature is linking to a user. So for example, including this line in the documentation field

@backendoncall policy ${policy.display_name} triggered an incident


notifies the user backend-oncall in addition to sending the message to the relevant Slack channel that was described in the policy’s notification options.

Notification examples

Now, when you look at a Stackdriver notification, all notification methods (with the exception of SMS) include the following fields:

  • Incident ID/link: the incident that triggered the notification along with a link to the incident page 
  • Policy name: the name of the configured alerting policy
  • Condition name: the name of the alerting policy condition that is in violation Email:

Email:


Slack:


Webhook:


{  
   "incident":{  
      "incident_id":"0.kmttg2it8kr0",
      "resource_id":"",
      "resource_name":"totally-new cassweb1",
      "started_at":1514931579,
      "policy_name":"Backend processing utilization too high",
      "condition_name":"Metric Threshold on Instance (GCE) cassweb1",
      "url":"https://app.google.stackdriver.com/incidents/0.kmttg2it8kr0?project=tot
ally-new",
      "documentation":{  
         "content":"CPU utilization sample. This might affect our backend
processing.\u000AFollowing playbook here: https://my.sample.playbook/cassweb1",
         "mime_type":"text/markdown"
      },
      "state":"open",
      "ended_at":null,
      "summary":"CPU utilization for totally-new cassweb1 is above the threshold of
 0.8 with a value of 0.994."
   },
   "version":"1.2"
}


Next steps

We’ll be rolling out these new features in the coming weeks as part of the regular updating process. There’s no action needed on your part, and the changes will not affect the reliability or latency of your existing alerting notification pipeline. Of course, we encourage you to give meaningful names to your alerting policies and conditions, as well as add a “documentation” section to configured alerting policies to help oncall engineers understand the alert notification when they receive it. And as always, please send us your requests and feedback, and thank you for using Stackdriver!

Google Cloud Audit Logging now available across the GCP stack



Google Cloud Audit Logging helps you to determine who did what, where and when on Google Cloud Platform (GCP). This fall, Cloud Audit Logging became generally available for a number of products. Today, we’re significantly expanding the set of products integrated with Cloud Audit Logging:
The above integrations are all currently in beta.

We’re also pleased to announce that audit logging for Google Cloud Dataflow, Stackdriver Debugger and Stackdriver Logging is now generally available.

Cloud Audit Logging provides log streams for each integrated product. The primary log stream is the admin activity log that contains entries for actions that modify the service, individual resources or associated metadata. Some services also generate a data access log that contains entries for actions that read metadata as well as API calls that access or modify user-provided data managed by the service. Right now only Google BigQuery generates a data access log, but that will change soon.

Interacting with audit logs in Cloud Console

You can see a high-level overview of all your audit logs on the Cloud Console Activity page. Click on any entry to display a detailed view of that event, as shown below.

By default, data access logs are not displayed in this feed. To enable them from the Filter configuration panel, select the “Data Access” field under Categories. (Please note, you also need to have the Private Logs Viewer IAM permission in order to see data access logs). You can also filter the results displayed in the feed by user, resource type and date/time.

Interacting with audit logs in Stackdriver

You can also interact with the audit logs just like any other log in the Stackdriver Logs Viewer. With Logs Viewer, you can filter or perform free text search on the logs, as well as select logs by resource type and log name (“activity” for the admin activity logs and “data_access” for the data access logs).

Here are some log entries in their JSON format, with a few important fields highlighted.
In addition to viewing your logs, you can also export them to Cloud Storage for long-term archival, to BigQuery for analysis, and/or Google Cloud Pub/Sub for integration with other tools. Check out this tutorial on how to export your BigQuery audit logs back into BigQuery to analyze your BigQuery spending over a specified period of time.
"Google Cloud Audit Logs couldn't be simpler to use; exported to BigQuery it provides us with a powerful way to monitor all our applications from one place.Darren Cibis, Shine Solutions

Partner integrations

We understand that there are many tools for log analysis out there. For that reason, we’ve partnered with companies like Splunk, Netskope, and Tenable Network Security. If you don’t see your preferred provider on our partners page, let us know and we can try to make it happen.

Alerting using Stackdriver logs-based metrics

Stackdriver Logging provides the ability to create logs-based metrics that can be monitored and used to trigger Stackdriver alerting policies. Here’s an example of how to set up your metrics and policies to generate an alert every time an IAM policy is changed.

The first step is to go to the Logs Viewer and create a filter that describes the logs for which you want to be alerted. Be sure that the scope of the filter is set correctly to search the logs corresponding to the resource in which you are interested. In this case, let’s generate an alert whenever a call to SetIamPolicy is made.

Once you're satisfied that the filter captures the correct events, create a logs-based metric by clicking on the "Create Metric" option at the top of the screen.

Now, choose a name and description for the metric and click "Create Metric." You should then receive a confirmation that the metric was saved.
Next, select “Logs-based Metrics” from the side panel. You should see your new metric listed there under “User Defined Metrics.” Click on the dots to the right of your metric and choose "Create alert from metric."

Now, create a condition to trigger an alert if any log entries match the previously specified filter. To do that, set the threshold to "above 0" in order to catch this occurrence. Logs-based metrics count the number of entries seen per minute. With that in mind, set the duration to one minute as the duration specifies how long this per-minute rate needs to be sustained in order to trigger an alert. For example, if the duration were set to five minutes, there would have to be at least one alert per minute for a five-minute period in order to trigger the alert.

Finally, choose “Save Condition” and specify the desired notification mechanisms (e.g., email, SMS, PagerDuty, etc.). You can test the alerting policy by giving yourself a new permission via the IAM console.

Responding to audit logs using Cloud Functions


Cloud Functions is a lightweight, event-based, asynchronous compute solution that allows you to execute small, single-purpose functions in response to events such as specific log entries. Cloud functions are written in JavaScript and execute in a standard Node.js environment. Cloud functions can be triggered by events from Cloud Storage or Cloud Pub/Sub. In this case, we'll trigger cloud functions when logs are exported to a Cloud Pub/Sub topic. Cloud Functions is currently in alpha, please sign up to request enablement for your project.

Let’s look at firewall rules as an example. Whenever a firewall rule is created, modified or deleted, a Compute Engine audit log entry is written. The firewall configuration information is captured in the request field of the audit log entry. The following function inspects the configuration of a new firewall rule and deletes it if that configuration is of concern (in this case, if it opens up any port besides port 22). This function could easily be extended to look at update operations as well.

Copyright 2017 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

'use strict';

exports.processFirewallAuditLogs = (event) => {
  const msg = JSON.parse(Buffer.from(event.data.data, 'base64').toString());
  const logEntry = msg.protoPayload;
  if (logEntry &&
      logEntry.request &&
      logEntry.methodName === 'v1.compute.firewalls.insert') {
    let cancelFirewall = false;
    const allowed = logEntry.request.alloweds;
    if (allowed) {
      for (let key in allowed) {
        const entry = allowed[key];
        for (let port in entry.ports) {
          if (parseInt(entry.ports[port], 10) !== 22) {
            cancelFirewall = true;
            break;
          }
        }
      }
    }
    if (cancelFirewall) {
      const resourceArray = logEntry.resourceName.split('/');
      const resourceName = resourceArray[resourceArray.length - 1];
      const compute = require('@google-cloud/compute')();
      return compute.firewall(resourceName).delete();
    }
  }
  return true;
};

As the function above uses the gcloud Node.js module, be sure to include that as a dependency in the package.json file that accompanies the index.js file specifying your source code:
{
  "name" : "audit-log-monitoring",
  "version" : "1.0.0",
  "description" : "monitor my audit logs",
  "main" : "index.js",
  "dependencies" : {
    "@google-cloud/compute" : "^0.4.1"
  }
}

In the image below, you can see what happened to a new firewall rule (“bad-idea-firewall”) that did not meet the acceptable criteria as determined by the cloud function. It's important to note, that this cloud function is not applied retroactively, so existing firewall rules that allow traffic on ports 80 and 443 are preserved.

This is just one example of many showing how you can leverage the power of Cloud Functions to respond to changes on GCP.


Conclusion


Cloud Audit Logging offers enterprises a simple way to track activity in applications built on top of GCP, and integrate logs with monitoring and logs analysis tools. To learn more and get trained on audit logging as well as the latest in GCP security, sign up for a Google Cloud Next ‘17 technical bootcamp in San Francisco this March.

Explore Stackdriver Monitoring data with Cloud Datalab



Google Stackdriver Monitoring allows users to create charts and alerts on monitoring metrics gathered across their Google Cloud Platform (GCP) and Amazon Web Services environments. Stackdriver users who want to drill deeper into their monitoring data can use Cloud Datalab, an easy-to-use tool for large-scale data exploration, analysis and visualization. Based on Jupyter (formerly IPython), Cloud Datalab allows you access to a thriving ecosystem, including Google BigQuery and Google Cloud Storage, plus many statistics and machine learning packages, including TensorFlow. We include notebooks of detailed tutorials to help you get started with your Stackdriver data, and the vibrant Jupyter community is a great source for more published notebooks and tips.

Libraries from the Jupyter community open up a variety of visualization options. For example, a heatmap is a compact representation of data, often used to visually highlight patterns. With a few lines of code included in the sample notebook, Getting Started.ipynb, we can visualize utilization across different instances to look for opportunities to reduce spend.
The Datalab environment also makes it possible to do advanced analytics. For example, in the included notebook, Time-shifted data.ipynb, we walk through time-shifting the data by day to compare today vs. historical data. This powerful analysis allows you to identify anomalies in your system metrics at a glance, by visualizing how they change from their historical values.

Compare today’s CPU utilization to the weekly average by zone

Stackdriver metrics, viewed with Cloud Datalab


Get started


The first step is to sign up for a 30-day free trial of Stackdriver Premium, which can monitor workloads on GCP and AWS. It takes two minutes to set up. Next, set up Cloud Datalab, which can be easily configured to run on Docker with this Quickstart. Sample code and notebooks for exploring trends in your data, analyzing group performance and heat map visualizations are included in the Datalab container.

Let us know what you think, and we’ll do our best to address your feedback and make analysis of your monitoring data even simpler for you.

Stackdriver Trace + Zipkin: distributed tracing and performance analysis for everyone



Editor's Note: You can now use Zipkin tracers with Stackdriver Trace. Go here to get started.

Part of the promise of the Google Cloud Platform is that it gives developers access to the same tools and technologies that we use to run at Google-scale. As the evolution of our Dapper distributed tracing system, Stackdriver Trace is one of those tools, letting developers analyze application latency and quickly isolate the causes of poor performance. While it was initially focused on Google App Engine projects, Stackdriver Trace also supports applications running on virtual machines or containers via instrumentation libraries for Node.js, Java, and Go (Ruby and .Net support will be available soon), and also through an API. Trace is available at no charge for all projects, and our instrumentation libraries are all open source with permissive licenses.

Another popular distributed tracing system is Zipkin, which Twitter open-sourced in 2012. Zipkin provides a plethora of instrumentation libraries for capturing traces from applications, as well as a backend system for storing and presenting traces through a web interface. Zipkin is widely used; in addition to Twitter, Yelp and Salesforce are major contributors to the project, and organizations around the world use it to view and diagnose the performance of their distributed services.

Zipkin users have been asking for interoperability with Stackdriver Trace, so today we’re releasing a Zipkin server that allows Zipkin-compatible clients to send traces to Stackdriver Trace for analysis.

This will be useful for two groups of people: developers whose applications are written in a language or framework that Stackdriver Trace doesn’t officially support, and owners of applications that are currently instrumented with Zipkin who want access to Stackdriver Trace’s advanced analysis tools. We’re releasing this code open source on GitHub with a permissive license, as well as a container image for quick set-up.
As described above, the new Stackdriver Trace Zipkin Connector is a drop-in replacement for an existing Zipkin backend and continues to use the same Zipkin-compatible tracers. You no longer need to set up, manage or maintain a Zipkin backend. Alternatively, you can run the new collector on each service that's instrumented with Zipkin tracers.

There are currently Zipkin clients available for Java, .Net, Node.js, Python, Ruby and Go, with built-in integration to a variety of popular web frameworks.

Setup Instructions

Read the Using Stackdriver with Zipkin Collector guide to configure and collect trace data from your distributed tracer. If you're not already using a tracer client, you can find one in a list of the most popular Zipkin tracers.

FAQ

Q: What does this announcement mean if I’ve been wanting to use Stackdriver Trace but it doesn’t yet support my language?

If a Zipkin tracer supports your chosen language and framework, you can now use Stackdriver Trace by having the tracer library send its data to the Stackdriver Trace Zipkin Collector.

Q: What does this announcement mean if I currently use Zipkin?

You’re welcome to set up the Stackdriver Trace Zipkin server and use it in conjunction with or as a replacement for your existing Zipkin backend. In addition to displaying traces, Stackdriver Trace includes advanced analysis tools like Insights and Latency Reports that will work with trace data collected from Zipkin tracers. As Stackdriver Trace is hosted by Google, you'll not need to maintain your own backend services for trace collection and display.
Latency reports are available to all Stackdriver Trace customers

Q: What are the limitations of using the Stackdriver Trace Zipkin Collector?
This release has two known limitations:
  1. Zipkin tracers must support the correct Zipkin time and duration semantics.
  2. Zipkin tracers and the Stackdriver Trace instrumentation libraries can’t append spans to the same traces, meaning that traces that are captured in one library won’t contain spans for services instrumented in the other type of library. For example:
  3. In this example, requests made to the Node.js web application will be traced with the Zipkin library and sent to Stackdriver Trace. However, these traces do not contain spans generated within the API application or for the RPC calls that it makes to the Database. This is because Zipkin and Stackdriver Trace use different formats for propagating trace context between services. 
    For this reason we recommend that projects wanting to use Stackdriver Trace either exclusively use Zipkin-compatible tracers along with the Zipkin Connector, or use instrumentation libraries that work natively with Stackdriver Trace (like the official Node.js, Java or Go libraries).

Q: Will this work as a full Zipkin server?

No, as the initial release only supports write operations. Let us know if you think that adding read operations would be useful, or submit a pull request through GitHub.

Q: How much does Stackdriver Trace cost?

You can use Zipkin with Stackdriver Trace at no cost.

Q: Can I use Stackdriver Trace to analyze my AWS, on-premises, or hybrid applications or is it strictly for services running on Google Cloud Platform?

Several projects already do this today! Stackdriver Trace will analyze all data submitted through its API, regardless of where the instrumented service is hosted, including traces and spans collected from the the Stackdriver Trace instrumentation libraries or through the Stackdriver Trace Zipkin Connector.

Wrapping up

We here on the Stackdriver team would like to send out a huge thank you to Adrian Cole of the Zipkin open source project. Adrian’s enthusiasm, technical assistance, design feedback and help with the release process have been invaluable. We hope to expand this collaboration with Zipkin and other open source projects in the future. A huge shout out is also due to the developers on the Stackdriver team who developed this feature.

Like the Stackdriver Trace instrumentation libraries, the Zipkin Connector has been published on GitHub under the Apache license. Feel free to file issues there or submit pull requests for proposed changes.

Happy holidays and an anomalously great New Year



2016 is winding down, and we wanted to take this chance to thank you, our loyal readers, and wish you happy holidays. As a little gift to you, here’s a poem, courtesy of Mary Koes, a product manager on the Stackdriver team channeling the Clement Clarke Moore classic.

Twas the night before Christmas and all through the Cloud
Not a creature was deploying; it wasn't allowed.
The servers were all hosted in GCP or AWS
And Stackdriver was monitoring them so no one was stressed.


The engineers were nestled all snug in their beds
While visions of dashboards danced in their heads.
When then from my nightstand, there arose such a clatter,
I silenced my phone and checked what was the matter.


Elevated error rates and latency through the roof?
At this rate our error budget soon would go poof!
The Director OOO, the CTO on vacation,
Who would I find still manning their workstation?


Dutifully, I opened the incident channel on Slack
And couldn't believe when someone answered back.
SClaus was the user name of this tireless engineer.
I wasn't aware that this guy even worked here.


He wrote, "Wait while I check your Stackdriver yule Logs . . .
Yep, it seems the errors are all coming from your blogs."
Then in Error Reporting, he found the root cause
"Quota is updated. All fixed. :-)" typed SClaus.


Who this merry DevOps elf was, I never shall know.
For before we did our postmortem, away did he go.
Just before vanishing, he took time to write,
"Merry monitoring to all and to all a silent night!"

Happy holidays everyone, and see you in 2017!