Tag Archives: Tracing

W3C Trace Context Specification: What it Means for You

Since the first days of Google Cloud Platform (GCP), Google has been at the forefront of making your applications more observable. Beyond Stackdriver, our most visible impact in this space is OpenTelemetry, which we initiated in 2017 (as OpenCensus) and has grown into a huge community that includes the majority of APM / monitoring vendors and cloud platforms.

While OpenTelemetry allows developers to easily capture distributed traces and metrics from their own services, there’s also a need to trace requests as they propagate through components that developers don’t directly control, like managed services, load balancers, network hardware, etc. To solve this we co-defined a prototype HTTP header that these components can rely on, gathered partners, and moved the work into the W3C.

This work is now complete, and the W3C Trace Context format is now an official standard. Once implemented in GCP, this will make our services even easier to manage, both with Stackdriver and other third party distributed tracing tools. We explain more in the
official post on the W3C blog, which I’ve copied below:

The W3C Distributed Tracing working group has moved the Trace Context specification to the next maturity level. The specification is already being adopted and implemented by many platforms and SDKs. This article describes the Trace Context specification and how it improves troubleshooting and monitoring of modern distributed apps.

W3C Trace Context specification defines the format for propagating distributed tracing context between services. Distributed tracing makes it easy for developers to find the causes of issues in highly-distributed microservices applications by tracking how a single interaction was processed across multiple services. Each step of a trace is correlated through an ID that is passed between services, and W3C Trace Context now defines a standard for these context propagation headers.

Until now, different tracing systems have defined their own headers. Examples include Zipkin’s B3 format and X-Google-Cloud-Trace. Adopting a common context propagation format has been long desired by developers, APM vendors, and cloud platform hosts, as compatibility provides numerous benefits:
  • Web and RPC frameworks that use this standard to provide context propagation out of the box will also offer cross-service log correlation, even for developers who haven’t set up distributed tracing.
  • API producers can record the trace IDs of requests from API consumers and provide additional spans or metadata to their customers for a given traced request. Producers can also correlate customer trace IDs to internal traces when debugging technical issues raised by consumers.
  • Networking infrastructure (proxies, load balancers, routers, etc.) can both ensure that context propagation headers are not removed from requests passing through them, and can record spans or logs for a given trace, without having to support multiple vendor-specific formats. Potential examples of these include router appliances, cloud load balancers, and sidecar proxies like Envoy.
  • Instrumentation can be further decoupled from a developer’s choice of APM vendor. For example, using both OpenTelemetry and a given vendor’s agents, a developer can instrument different services in an application, and traces will flow through the system and be processed correctly by the vendor’s backend.
  • Web browsers and other clients can use these identifiers to correlate their telemetry with traces collected from backend services. This functionality is currently being defined.
To address this effort, a group of cloud providers, open source contributors, and APM vendors started defining a standard HTTP context propagation header that would replace their homegrown formats. This specification has been discussed and iterated on over the past two years, and the group working on it has grown significantly over that time. Sponsors include Google, Microsoft, Dynatrace, and New Relic (W3C members), and the group was officially moved into the W3C in 2018 for the work to proceed under the guidance of an official standards body and to spur even greater adoption.

TraceContext has since been adopted by OpenTelemetry (which enables it by default and also serves as the reference implementation), Azure services, Dynatrace, Elastic, Google Cloud Platform, Lightstep, and New Relic. We are tracking adoption in this list.

This first phase of work has focused on HTTP, as it is commonly used and has no built-in affordances for trace context propagation (gRPC and some newer RPC systems do). The same group of committee members are also working to define trace context propagation in other formats, starting with AMQP and MQTT for IoT; other upcoming topics include context propagation from clients and web browsers.

By Morgan McLean, OpenTelemetry + Stackdriver

OpenCensus Web: Unlocking Full End-to-End Observability for Your Entire Stack

OpenCensus Web is a tool to trace and monitor the user-perceived performance of your web pages. It can help determine whether or not your web pages are experiencing performance issues that you might otherwise not know how to diagnose.

Web application owners want to monitor the operational health of their applications so that they can better understand actual user performance; however, capturing relevant telemetry from your web applications is often very difficult. Today, we are introducing OpenCensus Web (OC Web) to make instrumenting and exporting metrics and distributed traces from web applications simple and automatic.

Background

The OpenCensus project provides a set of language-specific instrumentation libraries that collect traces and metrics from applications and export them to tracing and monitoring backends like Prometheus, Zipkin, Jaeger, Stackdriver, and others.

The OpenCensus Web library is an implementation of OpenCensus that focuses on frontend web application code that executes in the browser. OC Web instruments web pages and collects user-side performance data, including latency and distributed traces, which gives developers the information to diagnose frontend issues and monitor overall application health.

Overshadowing the work on OC Web, the wider OpenCensus family of projects is merging with OpenTracing into OpenTelemetry. OpenCensus Web’s functionality will be migrated into OpenTelemetry JS once this project is ready, although OC Web will continue working as an alpha release in the meantime.

Architecture

OC Web interacts with three application components:
  • Frontend web server: renders the initial HTML to the browser including the OC Web library code and configuration. This would typically be instrumented with an OpenCensus server-side library (Go, Java, etc.). We also suggest that you create an endpoint in the server that receives HTTP/JSON traces and proxies to the OpenCensus Agent.
  • Browser JS: the OC Web library code that runs in the browser. This measures user interactions and collects browser data and writes them to the OpenCensus Agent as spans via HTTP/JSON.
  • OpenCensus Agent: receives traces from the frontend web server proxy endpoint or directly from the browser JS, and exports them to a trace backend (e.g. Stackdriver, Zipkin).
OC Web requires the OpenCensus Agent, which will proxy and re-export telemetry to your backend of choice. For more details see the documentation.


Features

Initial page load tracing

You can use OC Web to capture traces of initial page loads, which will even capture events that take place before the OC Web library was loaded by the browser! Initial page load traces show you which resources may be causing poor website performance, and contain data that you can’t typically capture from a distributed tracing system.

To measure the time of the overall initial page load interaction, OC Web waits until after the document load event and generates spans from the initial load performance timings via the browser's Navigation Timing and Resource Timing APIs. Below is a sample trace from OC Web that has been exported to Zipkin and captured from the initial load example app. Notice that there is an overall ‘nav./’ span for the user navigation experience until the browser load event fires.

This example also includes ‘/’ spans for the client and server side measurements of the initial HTML load. These spans are connected by the server sending back a ‘window.traceparent’ variable in the W3C Trace Context format, which is necessary because the browser does not send a trace context header for the initial page load. The server side spans also indicate how much time was spent parsing and rendering the template:

Notice the long js task span in the previous image, which indicates a CPU-bound JavaScript event loop that took 80.095ms, as measured by the Long Tasks browser API.

Span annotations for DOM and network events

Spans captured by OC Web also include detailed annotations for DOM events like `domInteractive` and `first-paint`, as well as network events like domainLookupStart and secureConnectionStart. Here is a similar trace exported to Stackdriver Trace with the annotations expanded:


User Interactions

For single page applications there are often subsequent interactions after the initial load (e.g. user clicks a button or navigates to a different section of the page). Measuring end-user interactions within a browser application adds useful data for your application:
  • Ability to relate an initial page render with subsequent on-page interactions
  • Visibility into slowness as perceived by the end user, for example, an unresponsive page after clicking
Currently, OC Web tracks clicks and route transitions by monkey-patching the Angular Zone.js library. OC Web tracks the subsequent synchronous and asynchronous tasks (e.g. setTimeouts, XHRs, etc.) caused by the interaction even if there are several concurrent interactions.

Automatic tracing for click events

All browser click events are traced as long as the click is done in a DOM element (e.g. button) and the clicked element is not disabled. When the user clicks the element, a new Zone is created to measure this interaction and determine the total time.

To name this root span, we provide developers with the option of adding the attribute data-ocweb-id to elements and give a custom name to the interaction. For the next example, the resulting name will be ‘Save edit user info’:
<button type="submit" data-ocweb-id="Save edit user info">       Save changes </button>
This helps you to identify the traces related to a specific element. Also, this may avoid ambiguity when there are similar interaction. If you don’t add this attribute, OC Web will use the DOM element ID, the tag name plus the event involved in the interaction. For example, clicking this button:
<button id="save_changes"> Save changes </button>
will generate a span named : “button#save_changes click”.

Automatic tracing for route transitions

OC Web traces route transitions between the different sections of your page by monkey-patching the History API. OC Web will name these interactions with the pattern ‘Navigation /path/to/page’. The following screenshot of a trace exported to Stackdriver from the user interaction example shows a Navigation trace which includes several network calls before the route transition is complete:

Creating your own custom spans

OC Web allows you to instrument your web application with custom spans for tasks or code involved in a user interaction. Here is a code snippet that shows how to do this:

import { tracing } from '@opencensus/web-instrumentation-zone';

function handleClick() {
  // Start child span of the current root span on the current interaction.
  // This must run in in code that the button is running.
  const childSpan = tracing.tracer.startChildSpan({
    name: 'name of your child span'
  });
  // Do some operation...
  // Finish the child span at the end of it's operation
  childSpan.end();
}

See the OC Web documentation for more details.

Automatic spans for HTTP requests and Browser performance data

OC Web automatically intercepts and generates spans for HTTP requests generated by user interactions. Additionally, OC Web attaches Trace Context Headers to each intercepted HTTP request, using the W3C Trace Context format. This is only done for same-origin requests or requests that match a provided regex.

If your servers are also instrumented with OpenCensus, these requests will continue to be traced throughout your backend services! This lets you know if the issues are related to either the front-end or the server-side.

OC Web also includes Performance API data to make annotations like domainLookupStart and responseEnd and generates spans for any CORS preflight requests.

The next screenshot shows a trace exported to Stackdriver as result of the user interaction example. There, you can see the several network calls with the automatic generated spans (e.g. ‘Sent./sleep’) with annotations, the server-side spans (e.g. ‘/sleep’ and ‘ocweb.handlerequest’) and CORS Preflight related spans:

Relate user interactions back to the initial page load tracing

OC Web attaches the initial page load trace id to the user interactions as an attribute and a span link. This enables you to do a trace search by attribute to find the initial load trace and its interactions traces via a single attribute query as well as letting you understand the whole navigation of a user through the application for a given page load.

The next screenshot shows a search by initial_load_trace_id attribute containing all user interaction traces after the initial page loaded:


Making it Real

With OC Web and a few lines of instrumentation, you can now export distributed traces from your web application. Start exploring the initial load and user interaction examples and you're welcome to poke around the source code and send us feedback via either Gitter or contributing with Pull Requests!

By Cristian González – OpenCensus Team – Software Engineering intern at Google Summer 2019 and student of Computer and Systems Engineering at Universidad Nacional de Colombia.

Special thanks to Dave Raffensperger for being initial creator of OC Web and guiding me in the process to develop i
t.

OpenTelemetry: The Merger of OpenCensus and OpenTracing

We’ve talked about OpenCensus a lot over the past few years, from the project’s initial announcement, roots at Google and partners (Microsoft, Dynatrace) joining the project, to new functionality that we’re continually adding. The project has grown beyond our expectations and now sports a mature ecosystem with Google, Microsoft, Omnition, Postmates, and Dynatrace making major investments, and a broad base of community contributors.

We recently announced that OpenCensus and OpenTracing are merging into a single project, now called OpenTelemetry, which brings together the best of both projects and has a frictionless migration experience. We’ve made a lot of progress so far: we’ve established a governance committee, a Java prototype API + implementation, workgroups for each language, and an aggressive implementation schedule.

Today we’re highlighting the combined project at the keynote of Kubecon and announcing that OpenTelemetry is now officially part of the Cloud Native Computing Foundation! Full details are available in the CNCF’s official blog post, which we’ve copied below:

A Brief History of OpenTelemetry (So Far)

After many months of planning, discussion, prototyping, more discussion, and more planning, OpenTracing and OpenCensus are merging to form OpenTelemetry, which is now a CNCF sandbox project. The seed governance committee is composed of representatives from Google, Lightstep, Microsoft, and Uber, and more organizations are getting involved every day.

And we couldn't be happier about it – here’s why.

Observability, Outputs, and High-Quality Telemetry

Observability is a fashionable word with some admirably nerdy and academic origins. In control theory, “observability” measures how well we can understand the internals of a given system using only its external outputs. If you’ve ever deployed or operated a modern, microservice-based software application, you have no doubt struggled to understand its performance and behavior, and that’s because those “outputs” are usually meager at best. We can’t understand a complex system if it’s a black box. And the only way to light up those black boxes is with high-quality telemetry: distributed traces, metrics, logs, and more.

So how can we get our hands – and our tools – on precise, low-overhead telemetry from the entirety of a modern software stack? One way would be to carefully instrument every microservice, piece by piece, and layer by layer. This would literally work, it’s also a complete non-starter – we’d spend as much time on the measurement as we would on the software itself! We need telemetry as a built-in feature of our services.

The OpenTelemetry project is designed to make this vision a reality for our industry, but before we describe it in more detail, we should first cover the history and context around OpenTracing and OpenCensus.

OpenTracing and OpenCensus

In practice, there are several flavors (or “verticals” in the diagram) of telemetry data, and then several integration points (or “layers” in the diagram) available for each. Broadly, the cloud-native telemetry landscape is dominated by distributed traces, timeseries metrics, and logs; and end-users typically integrate with a thin instrumentation API or via straightforward structured data formats that describe those traces, metrics, or logs.



For several years now, there has been a well-recognized need for industry-wide collaboration in order to amortize the shared cost of software instrumentation. OpenTracing and OpenCensus have led the way in that effort, and while each project made different architectural choices, the biggest problem with either project has been the fact that there were two of them. And, further, that the two projects weren’t working together and striving for mutual compatibility.

Having two similar-yet-not-identical projects out in the world created confusion and uncertainty for developers, and that made it harder for both efforts to realize their shared mission: built-in, high-quality telemetry for all.

Getting to One Project

If there’s a single thing to understand about OpenTelemetry, it’s that the leadership from OpenTracing and OpenCensus are co-committed to migrating their respective communities to this single and unified initiative. Although all of us have numerous ideas about how we could boil the ocean and start from scratch, we are resisting those impulses and focusing instead on preparing our communities for a successful transition; our priorities for the merger are clear:
  • Straightforward backwards compatibility with both OpenTracing and OpenCensus (via software bridges)
  • Minimizing the time where OpenTelemetry, OpenTracing, and OpenCensus are being co-developed: we plan to put OpenTracing and OpenCensus into “readonly mode” before the end of 2019.
  • And, again, to simplify and standardize the telemetry solutions available to developers.
In many ways, it’s most accurate to think of OpenTelemetry as the next major version of both OpenTracing and OpenCensus. Like any version upgrade, we will try to make it easy for both new and existing end-users, but we recognize that the main benefit to the ecosystem is the consolidation itself – not some specific and shiny new feature – and we are prioritizing our own efforts accordingly.

How you can help

OpenTelemetry’s timeline is an aggressive one. While we have many open-source and vendor-licensed observability solutions providing guidance, we will always want as many end-users involved as possible. The single most valuable thing any end-user can do is also one of the easiest: check out the actual work we’re doing and provide feedback. Via GitHub, Gitter, email, or whatever feels easiest.

Of course we also welcome code contributions to OpenTelemetry itself, code contributions that add OpenTelemetry support to existing software projects, documentation, blog posts, and the rest of it. If you’re interested, you can sign up to join the integration effort by filling in this form.

By Ben Sigelman, co-creator of OpenTracing and member of the OpenTelemetry governing committee, and Morgan McLean, Product Manager for OpenCensus at Google since the project’s inception