This is the second half of a series of blog posts about
what’s coming next for OpenCensus. The OpenCensus Roadmap is composed of two pillars: increased language, framework, and platform coverage, and the addition of more powerful features.
In this blog post we’re going to discuss the second pillar: new functionality that makes OpenCensus more powerful. This includes dramatically improved sampling capabilities and new types of telemetry that we’re looking to capture.
More Power
Intelligent Sampling
In addition to expanding the list of languages and frameworks that OpenCensus supports out of the box, we’ll also be increasing the usefulness of existing functionality.
Services instrumented with OpenCensus currently randomly (at a configurable rate) sample new requests (without context, usually received directly from clients). While this does provide an effective view into application latency, developers are mostly interested in traces of particularly slow requests or requests that also capture a useful event, such as an exception.
We’re adding support for OpenCensus to make deferred sampling decisions - that is, to sample requests after they’ve propagated through several systems, while still preserving the full critical path of the trace. Though the feature is just starting development, we’re focusing on making sampling more intelligent - for example, by triggering traces based on accumulated latency, errors, and debugging events. Expect to hear more about this soon.
New Telemetry, Including Logs and Errors
As we mentioned in
our last blog post, our ambition is for OpenCensus to become a ubiquitous observability framework, meaning that collecting traces and stats alone won’t be enough. Correlating traces and tags with logs and errors represents an obvious next step, and we’re currently working through what this might look like. Longer term, this list could grow to include profiles and other signals.
The topic of what signals will come next is worth of its own blog post, and you can expect us to start talking about this more in the coming months.
Server-provided Traces and Metrics
Distributed applications can obtain observability into their own performance by instrumenting themselves with OpenCensus, however visibility into the performance of external services or APIs that they call into is still limited. For example, imagine a web service that calls into Google Cloud Platform’s Cloud Bigtable service: the application developer would have visibility into their client side traces but would not be able to tell how much time Cloud BigTable took to respond vs time taken by network. We’re working on adding server side traces and metrics - essentially a way for service providers to summarize server side traces and metrics.
Cluster wide Z pages
Today, OpenCensus provides a stand-alone application called
z-pages that includes an embedded web server and displays configuration parameters and trace information in real-time, as captured from any OpenCensus libraries running on the same host. By accessing a z-page, developers can configure sampling rate for the local instance, or view traces, tags, and stats as they’re being processed in real-time.
Longer-term, we wish to extend this functionality to enable cluster wide z-pages, which could provide the same functionality as the current z-pages experience, aggregated over all of the instances of a particular service. We’re still discussing different implementation options, and if we can tie this into other aggregation-related workstreams that we’re already pursuing.
Wrapping up
Does the strategy and roadmap above resonate with what you’d want to get from OpenCensus library? We’d love to hear your ideas and what you’d like to see prioritized.
As we mentioned in our last post, none of this is possible without the support and participation from the community. Check out our
repo and start contributing. No contribution or idea is too small. Join other developers and users on the OpenCensus
Gitter channel. We’d love to hear from you.
By Pritam Shah and Morgan McLean, Census team