z The Flat Field Z
[ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ]

Traces

Understanding why a distributed system is slow is a hard problem. A single request might trigger dozens of requests behind the scenes across a large number of discrete components.

Traces are an extraordinarily useful tool for this problem, and now with OpenTelemetry they are easy to implement in a vendor agnostic way.

What is a Trace

The core idea of a trace is quite simple:

  • A trace is a collection of one or more spans
  • A span is data about an event that's sent across the wire in a machine readable format

They must have at least the following fields:

FieldDescription
IDID of the span
Trace IDID of the overall trace
Parent IDOptional ID of a parent span
TimestampWhen this span started
DurationHow long the span lasted

The parent ID is where the magic happens since it means that these spans can be nested in on another. The most common way traces are visualized is with a waterfall view:

#+begin_src mermaid :file "waterfall.svg" :pupeteer-config-file "~/.emacs.d/pupeteer-config.json" :mermaid-config-file "~/.emacs.d/mermaid-config.json" :background-color "transparent" gantt dateFormat YYYY-MM-DD axisFormat %M tickInterval 1minute POST /dog :parent, 2023-09-04, 5m Authenticate :authn, 2023-09-04, 1m Authorize :authz, after authn, 1m Insert into dogs :after authz, 3m #+end_src #+RESULTS: [[file:waterfall.svg]]

An example of a waterfall view

POST /dog is the root span which doesn't have a parent ID. It encompasses all the work that was done to service that request. Below it are three child spans representing three requests the back-end service made to service the request:

  • Authenticate: A call to an authentication service
  • Authorize: A call to an authorization service
  • Insert into dogs: A call made to a relational database

These three calls were made in serial. You can quickly intuit that most of the time was spent dealing with the relational database. At a per request level waterfall views are a wonderful tool to visualize performance.

Adding Tracing to a Service

Adding tracing to service has gotten a lot better thanks to OpenTelmetry. By adding their libraries to your back-end service you can quickly achieve a baseline level of vendor neutral tracing support.

To get started you add a tracer to your project:

// The import ordering is the way it is because of this bug:
// https://github.com/open-telemetry/opentelemetry-js/issues/3796

import { HttpInstrumentation } from "@opentelemetry/instrumentation-http";
const httpInstrumentation = new HttpInstrumentation();

import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import {
  AlwaysOnSampler,
  ConsoleSpanExporter,
  SimpleSpanProcessor,
} from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

export const provider = new NodeTracerProvider({
  sampler: new AlwaysOnSampler(),
});

// Span processors

if (process.env.PROD_TRACE_EXPORT) {
  provider.addSpanProcessor(new SimpleSpanProcessor(new OTLPTraceExporter()));
}

if (process.env.DEBUG_TRACE_EXPORT) {
  provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
}

provider.register();

// Instrumentations

registerInstrumentations({
  instrumentations: [httpInstrumentation],
});

The first thing this code does is instantiate a provider from the OpenTelemetry SDK for Node.js. This takes care of all the low level plumbing of adding tracing support to your runtime. From there you have to configure two important parts.

The first of these are the instrumentations. They are libraries that map to some kind of dependency your back-end service has. I.e. if your back-end service uses Redis you would install the @opentelemetry/instrumentation-ioredis package. Your back-end service would then begin collecting spans for Redis automatically.

The second of these are the span processors. These are where the spans get written to. In this case the spans are written to console if the DEBUG_TRACE_EXPORT environment variable is present, and they are written over GRPC if the PROD_TRACE_EXPORT environment variable is present.

This is not quite enough to collect spans, but the rest depends on your framework. We'll take Fastify as an example here. We need to add a library that integrates OpenTelemetry with Fastify: @autotelic/fastify-opentelemetry. The initialization code needed would be:

// Order matters.
import "dotenv/config";
import "./tracer";
import Fastify from "fastify";
import openTelemetryPlugin from "@autotelic/fastify-opentelemetry";

const server = Fastify();

// Configure Fastify

await server.register(openTelemetryPlugin, { wrapRoutes: true });

Note the import of the tracer just after dotenv. Your back-end service is now collecting spans that can be sent to large number of observability vendors.

Telemetry Pipelines

A telemetry pipeline is an intermediary between services that write spans, and the ultimate span destinations (i.e. an observability vendor). They will increase the reliability and performance of your overall tracing implementation with retries and batching.

Right now when PROD_TRACE_EXPORT is set the spans are being written out over GRPC. This will work with a basic telemetry pipeline called the OpenTelemetry collector. The way this works is you run that docker image as a sidecar container with your back-end service, and write the spans to that sidecar which will in turn write the spans to their ultimate destination(s).

#+begin_src mermaid :file "pipeline.svg" :pupeteer-config-file "~/.emacs.d/pupeteer-config.json" :mermaid-config-file "~/.emacs.d/mermaid-config.json" :background-color "transparent" flowchart LR OTC["OpenTelemetry Collector Container"]-->CT["Cloud Trace"] subgraph "Cloud Run" API["Backend Service Container"]-->OTC["OpenTelemetry Collector Container"] end #+end_src #+RESULTS: [[file:pipeline.svg]]

Primitive telemetry pipeline

In this case we are running our back-end service in Cloud Run. The Cloud Run instance has a sidecar collector image that the service writes spans to. The sidecar collector then writes the spans to Cloud Trace.

Exporting Your Traces

The real magic happens when you start to write queries against your traces. In a lot of observability vendors like Datadog or Honeycomb you can do this by default. For Cloud Trace you would need to export the traces to BigQuery.

The reasons this is so important are out of scope for this blog post, but at a high level:

  • It's easier to track down what caused errors or performance regressions
  • You can craft SLOs against your traces

Custom Traces and Attributes

You are not limited by the traces that come out of the box from the SDK and instrumentations: you can (and should) add your own. For example:

tracer.startActiveSpan("my span", async (span) => {
  // Do stuff
  span.end();
})

The fields on spans are called attributes, and in general you want a lot of them. Ones with a high degree of cardinality (think uniqueness) like user ID are particularly desirable. These will make tracking down issues much faster. You can add custom attributes like this:

const activeSpan = opentelemetry.trace.getActiveSpan();
activeSpan.setAttribute("test", "1234");

Conclusion

Tracing is easier to add your back-end service than it ever has been, and by adding it you will have much more observability into your systems.