Tag Archives: metrics

OpenMetrics project accepted into CNCF Sandbox

For the past several months, engineers from Google Cloud, Prometheus, and other vendors have been aligning on OpenMetrics, a specification for metrics exposition. Today, the project was formally announced and accepted into the CNCF Sandbox, and we’re currently working on ways to support OpenMetrics in OpenCensus, a set of uniform tracing and stats libraries that work with multiple vendors’ services. This multi-vendor approach works to put architectural choices in the hands of developers.
+
OpenMetrics stems from the stats formats used inside of Prometheus and Google’s Monarch time-series infrastructure, which underpins both Stackdriver and internal monitoring applications. As such, it is designed to be immediately familiar to developers and capable of operating at extreme scale. With additional contributions and review from AppOptics, Cortex, Datadog, InfluxData, Sysdig, and Uber, OpenMetrics has begun the cross-industry collaboration necessary to drive adoption of a new specification.

OpenCensus provides automatic instrumentation, APIs, and exporters for stats and distributed traces across C++, Java, Go, Node.js, Python, PHP, Ruby, and .Net. Each OpenCensus library allows developers to automatically capture distributed traces and key RPC-related statistics from their applications, add custom data, and export telemetry to their back-end of choice. Google has been a key collaborator in defining the OpenMetrics specification, and we’re now focusing on how to best implement this inside of OpenCensus.

“Google has a history of innovation in the metric monitoring space, from its early success with Borgmon, which has been continued in Monarch and Stackdriver. OpenMetrics embodies our understanding of what users need for simple, reliable and scalable monitoring, and shows our commitment to offering standards-based solutions,” said Sumeer Bhola, Lead Engineer on Monarch and Stackdriver at Google.

For more information about OpenMetrics, please visit openmetrics.io. For more information about OpenCensus and how you can quickly enable trace and metrics collection from your application, please visit opencensus.io.

By Morgan McLean, Product Manager for OpenCensus and Stackdriver APM

OpenCensus’s journey ahead: enhanced feature set

This is the second half of a series of blog posts about what’s coming next for OpenCensus. The OpenCensus Roadmap is composed of two pillars: increased language, framework, and platform coverage, and the addition of more powerful features.

In this blog post we’re going to discuss the second pillar: new functionality that makes OpenCensus more powerful. This includes dramatically improved sampling capabilities and new types of telemetry that we’re looking to capture.

More Power

Intelligent Sampling

In addition to expanding the list of languages and frameworks that OpenCensus supports out of the box, we’ll also be increasing the usefulness of existing functionality.

Services instrumented with OpenCensus currently randomly (at a configurable rate) sample new requests (without context, usually received directly from clients). While this does provide an effective view into application latency, developers are mostly interested in traces of particularly slow requests or requests that also capture a useful event, such as an exception.

We’re adding support for OpenCensus to make deferred sampling decisions - that is, to sample requests after they’ve propagated through several systems, while still preserving the full critical path of the trace. Though the feature is just starting development, we’re focusing on making sampling more intelligent - for example, by triggering traces based on accumulated latency, errors, and debugging events. Expect to hear more about this soon.

New Telemetry, Including Logs and Errors

As we mentioned in our last blog post, our ambition is for OpenCensus to become a ubiquitous observability framework, meaning that collecting traces and stats alone won’t be enough. Correlating traces and tags with logs and errors represents an obvious next step, and we’re currently working through what this might look like. Longer term, this list could grow to include profiles and other signals.

The topic of what signals will come next is worth of its own blog post, and you can expect us to start talking about this more in the coming months.

Server-provided Traces and Metrics

Distributed applications can obtain observability into their own performance by instrumenting themselves with OpenCensus, however visibility into the performance of external services or APIs that they call into is still limited. For example, imagine a web service that calls into Google Cloud Platform’s Cloud Bigtable service: the application developer would have visibility into their client side traces but would not be able to tell how much time Cloud BigTable took to respond vs time taken by network. We’re working on adding server side traces and metrics - essentially a way for service providers to summarize server side traces and metrics.

Cluster wide Z pages

Today, OpenCensus provides a stand-alone application called z-pages that includes an embedded web server and displays configuration parameters and trace information in real-time, as captured from any OpenCensus libraries running on the same host. By accessing a z-page, developers can configure sampling rate for the local instance, or view traces, tags, and stats as they’re being processed in real-time.

Longer-term, we wish to extend this functionality to enable cluster wide z-pages, which could provide the same functionality as the current z-pages experience, aggregated over all of the instances of a particular service. We’re still discussing different implementation options, and if we can tie this into other aggregation-related workstreams that we’re already pursuing.

Wrapping up

Does the strategy and roadmap above resonate with what you’d want to get from OpenCensus library? We’d love to hear your ideas and what you’d like to see prioritized.

As we mentioned in our last post, none of this is possible without the support and participation from the community. Check out our repo and start contributing. No contribution or idea is too small. Join other developers and users on the OpenCensus Gitter channel. We’d love to hear from you.

By Pritam Shah and Morgan McLean, Census team

OpenCensus’s journey ahead: platforms and languages

We recently blogged about the value of OpenCensus and how Google uses Census internally. Today, we want to share more about our long-term vision for OpenCensus.

The goal of OpenCensus is to be a ubiquitous observability framework that allows developers to automatically collect, aggregate, and export traces, metrics, and other telemetry from their applications. We plan on getting there by building easy-to-use libraries and automatically integrate with as many technologies and frameworks as possible.

Our roadmap has two themes: increased language, framework, and platform coverage, and the addition of more powerful features.Today, we’ll discuss the first theme of the increased coverage.

Increasing Coverage

More Language Coverage

In January, we released OpenCensus for Java, Go, and C++ as well as tracing support for Python, PHP, and Ruby. We’re about to start development of OpenCensus for Node.js and .NET, and you’ll see activity on these repositories ramp up in the coming quarter.

Integration with more Frameworks, Platforms, and Clients

We want to provide a great out-of-the-box experience, so we need to automatically capture traces and metrics with as little developer effort as possible. To achieve this, we’ll be creating integrations for popular web frameworks, RPC frameworks, and storage clients. This will enable automatic context propagation, span creation, and trace annotations, without requiring extra work on behalf of developers.

As a basic example, OpenCensus already integrates with Go’s default gRPC and HTTP handlers to generate spans (with relevant annotations) and to pass context.

More complex integrations will provide more information to developers. Here’s an example of a trace captured with our upcoming MongoDB instrumentation, shown on Stackdriver Trace and AWS X-Ray:
A MongoDB trace shown in Stackdriver Trace

The same trace captured in X-Ray

Istio

OpenCensus will soon have out-of-the-box tracing and metrics collection in Istio. We’re currently working through our initial designs and implementation for integrations with the Envoy Sidecar and Istio Mixer service. Our goal is to provide Istio users with a great out of box tracing and metrics collection experience.

Kubernetes

We have two primary use cases in mind for Kubernetes deployments: providing cluster-wide visibility via z-pages, and better labeling of traces, stats, and metrics. Cluster-wide z-pages will allow developers to view telemetry in real time across an entire Kubernetes deployment, independently of their back-end. This is incredibly useful when debugging immediate high-impact issues like service outages.

Client Application Support

OpenCensus currently provides observability into back-end services, however this doesn’t tell the whole story about end-to-end application performance. Throughout 2018, we plan to add instrumentation for client and front-end web applications, so developers can get traces that begin from customers’ devices and reflect actual perceived latency, and metrics captured from client code.

We aim to add support for instrumenting Android, iOS, and front-end JavaScript, though this list may grow or change. Expect to hear more about this later in 2018.

Next Up

Next week we’ll discuss some of the new features that we’re looking to bring to OpenCensus, including notable enhancements to the trace sampling logic.

None of this is possible without the support and participation from the community. Please check out our repository and start contributing; we welcome contributions of any size -- however you want to take part. You can join other developers and users on the OpenCensus Gitter channel. We’d love to hear from you!

By Pritam Shah and Morgan McLean, Census team

The value of OpenCensus

This post is the second in a series about OpenCensus. You can find the first post here.

Early this year we open sourced OpenCensus, a distributed tracing and stats instrumentation framework. Today, we continue our journey by discussing the history and motivation behind the project here at Google, and what benefits OpenCensus has to offer. As OpenCensus continues to gain partners we’ll be shifting the focus away from Google, but we wanted to use this post as an opportunity to answer some of the questions that we’re most commonly asked at meetings and events.

Why did Google open source this? Why now?

Google open sources a lot of projects and we’ve begun documenting some of the reasons why on the Google Open Source website. What about OpenCensus specifically? There are many reasons it made sense for us to release this project and get others involved.

We had already released other related projects. The Census team had been eager to share their work with the public for a while. With projects like gRPC and Istio, going open source, it made sense to release OpenCensus as well.

It helped us serve our customers better. Teams with performance-sensitive APIs like BigTable and Spanner needed more insight into their customers’ calling patterns while debugging issues, and wanted a way to connect customers’ traced requests to equivalent traces inside of Google.

Managing integrations ourselves is costly. The Stackdriver Trace engineering team had been investing considerable resources building their own instrumentation libraries across seven languages, and it became apparent that the cost of building and maintaining integrations into web and RPC frameworks would continue in perpetuity. Releasing these libraries might encourage framework providers to manage these integrations instead.

We have a vested interest in everyone else’s reliability and performance. As a web search and cloud services provider, Google’s users benefit as web services and applications become increasingly reliable and performant. Popularizing distributed tracing and app-level metrics is a one way to achieve this. This is especially important with the rising popularity of microservices-based architectures which are difficult to debug without distributed tracing.

This expands the market for other services. By making tracing and app-level metrics more accessible, we grow the overall market for monitoring and application performance management (APM) tools, which benefits Stackdriver Monitoring and Stackdriver Trace.

As these factors came into focus, the decision to open source the project became clear.

Benefits to Partners and the Community

Google’s reasons for developing and promoting OpenCensus apply to partners at all levels.

Service developers reap the benefits of having automatic traces and stats collection, along with vendor-neutral APIs for manually interacting with these. Developers who use open source backends like Prometheus or Zipkin benefit from having a single set of well-supported instrumentation libraries that can export to both services at once.

For APM vendors, being able to take advantage of already-provided language support and framework integrations is huge, and the exporter API allows traces and metrics to be sent to an ingestion API without much additional work. Developers who might have been working on instrumentation code can now focus on other more important tasks, and vendors get traces and metrics back from places they previously didn’t have coverage for.

Cloud and API providers have the added benefit of being able to include OpenCensus in client libraries, allowing customers to gain insight into performance characteristics and debug issues without having to contact support. In situations where customers were still not able to diagnose their own issues, customer traces can be matched with internal traces for faster root cause analysis, regardless of which tracing or APM product they use.

What’s Next

If you missed the first post in our series, you can read it now. In upcoming blog posts and videos we’ll discuss:
  • The current schedule for OpenCensus on a language-by-language basis
  • Guides on how to add custom instrumentation to your application
  • Techniques for adding more automatic integrations to OpenCensus
  • Our long-term vision for OpenCensus
Thanks for reading – we’ll see you on GitHub!

By Pritam Shah and Morgan McLean, Census team

How Google uses Census internally

This post is the first in a series about OpenCensus, a set of open source instrumentation libraries based on what we use inside Google. This series will cover the benefits of OpenCensus for developers and vendors, Google’s interest in open sourcing instrumentation tools, how to get started with OpenCensus, and our long-term vision.

If you’re new to distributed tracing and metrics, we recommend Adrian Cole’s excellent talk on the subject: Observability Three Ways.

Gaining Observability into Planet-Scale Computing

Google adopted or invented new technologies, including distributed tracing (Dapper) and metrics processing, in order to operate some of the world’s largest web services. However, building analysis systems didn’t solve the difficult problem of instrumenting and extracting data from production services. This is what Census was created to do.

The Census project provides uniform instrumentation across most Google services, capturing trace spans, app-level metrics, and other metadata like log correlations from production applications. One of the biggest benefits of uniform instrumentation to developers inside of Google is that it’s almost entirely automatic: any service that uses gRPC automatically collects and exports basic traces and metrics.

OpenCensus offers these capabilities to developers everywhere. Today we’re sharing how we use distributed tracing and metrics inside of Google.

Incident Management

When latency problems or new errors crop up in a highly distributed environment, visibility into what’s happening is critical. For example, when the latency of a service crosses expected boundaries, we can view distributed traces in Dapper to find where things are slowing down. Or when a request is returning an error, we can look at the chain of calls that led to the error and examine the metadata captured during a trace (typically logs or trace annotations). This is effectively a bigger stack trace. In rare cases, we enable custom trigger-based sampling which allows us to focus on specific kinds of requests.

Once we know there’s a production issue, we can use Census data to determine the regions, services, and scope (one customer vs many) of a given problem. You can use service-specific diagnostics pages, called “z-pages,” to monitor problems and the results of solutions you deploy. These pages are hosted locally on each service and provide a firehose view of recent requests, stats, and other performance-related information.

Performance Optimization

At Google’s scale, we need to be able to instrument and attribute costs for services. We use Census to help us answer questions like:
  • How much CPU time does my query consume?
  • Does my feature consume more storage resources than before?
  • What is the cost of a particular user operation at a particular layer of the stack?
  • What is the total cost of a particular user operation across all layers of the stack?
We’re obsessed with reducing the tail latency of all services, so we’ve built sophisticated analysis systems that process traces and metrics captured by Census to identify regressions and other anomalies.

Quality of Service

Google also improves performance dynamically depending on the source and type of traffic. Using Census tags, traffic can be directed to more appropriate shards, or we can do things like load shedding and rate limiting.

Next week we’ll discuss Google’s motivations for open sourcing Census, then we’ll shift the focus back onto the open source project itself.

By Pritam Shah and Morgan McLean, Census team