The Dapr observability building block¶
Modern distributed systems are complex. You start with small, loosely coupled, independently deployable services. These services cross process and server boundaries. They then consume different kinds of infrastructure backing services (databases, message brokers, key vaults). Finally, these disparate pieces compose together to form an application.
With so many separate, moving parts, how do you make sense of what is going on? Unfortunately, legacy monitoring approaches from the past aren't enough. Instead, the system must be observable from end-to-end. Modern observability practices provide visibility and insight into the health of the application at all times. They enable you to infer the internal state by observing the output. Observability is mandatory for monitoring and troubleshooting distributed applications.
The system information used to gain observability is referred to as telemetry. It can be divided into four broad categories:
- Distributed tracing provides insight into the traffic between services and services involved in distributed transactions.
- Metrics provides insight into the performance of a service and its resource consumption.
- Logging provides insight into how the code is executing and if errors have occurred.
- Health endpoints provide insight into the availability of a service.
The depth of telemetry is determined by the observability features of an application platform. Consider the Azure cloud. It provides a rich telemetry experience that includes all of the telemetry categories. Without any configuration, most Azure IaaS and PaaS services propagate and publish telemetry to the Azure Application Insights service. Application Insights presents system logging, tracing, and problem areas with highly visual dashboards. It can even render a diagram showing the dependencies between services based on their communication.
However, what if an application can't use Azure PaaS and IaaS resources? Is it still possible to take advantage of the rich telemetry experience of Application Insights? The answer is yes. A non-Azure application can import libraries, add configuration, and instrument code to emit telemetry to Azure Application Insights. However, this approach tightly couples the application to Application Insights. Moving the app to a different monitoring platform could involve expensive refactoring. Wouldn't it be great to avoid tight coupling and consume observability outside of the code?
With Dapr, you can. Let's look at how Dapr can add observability to our distributed applications.
What it solves¶
The Dapr observability building block decouples observability from the application. It automatically captures traffic generated by Dapr sidecars and Dapr system services that make up the Dapr control plane. The block correlates traffic from a single operation that spans multiple services. It also exposes performance metrics, resource utilization, and the health of the system. Telemetry is published in open-standard formats enabling information to be fed into your monitoring back end of choice. There, the information can be visualized, queried, and analyzed.
As Dapr abstracts away the plumbing, the application is unaware of how observability is implemented. There's no need to reference libraries or implement custom instrumentation code. Dapr allows the developer to focus on building business logic and not observability plumbing. Observability is configured at the Dapr level and is consistent across services, even when created by different teams, and built with different technology stacks.
How it works¶
Dapr's sidecar architecture enables built-in observability features. As services communicate, Dapr sidecars intercept the traffic and extract tracing, metrics, and logging information. Telemetry is published in an open standards format. By default, Dapr supports OpenTelemetry and Zipkin.
Dapr provides collectors that can publish telemetry to different back-end monitoring tools. These tools present Dapr telemetry for analysis and querying. Figure 9-1 shows the Dapr observability architecture:
Figure 9-1. Dapr observability architecture.
- Service A calls an operation on Service B. The call is routed from a Dapr sidecar for Service A to a sidecar for Service B.
- When Service B completes the operation, a response is sent back to Service A through the Dapr sidecars. They gather and publish all available telemetry for every request and response.
- The configured collector ingests the telemetry and sends it to the monitoring back end.
As a developer, keep in mind that adding observability is different from configuring other Dapr building blocks, like pub/sub or state management. Instead of referencing a building block, you add a collector and a monitoring back end. Figure 9-1 shows it's possible to configure multiple collectors that integrate with different monitoring back ends.
At the beginning of this chapter, four categories of telemetry were identified. The following sections will provide detail for each category. They'll include instruction on how to configure collectors that integrate with popular monitoring back ends.
Distributed tracing¶
Distributed tracing provides insight into the traffic that flows across services in a distributed application. The log of exchanged request and response messages is an invaluable source of information for troubleshooting issues. The hard part is correlating messages that originate from the same operation.
Dapr uses the W3C Trace Context to correlate related messages. It injects the same context information into requests and responses that form a unique operation. Figure 9-2 shows how correlation works:
Figure 9-2. W3C Trace Context example.
- Service A invokes an operation on Service B. As Service A starts the call, Dapr creates a unique trace context and injects it into the request.
- Service B receives the request and invokes an operation on Service C. Dapr detects that the incoming request contains a trace context and propagates it by injecting it into the outgoing request to Service C.
- Service C receives the request and handles it. Dapr detects that the incoming request contains a trace context and propagates it by injecting it into the outgoing response back to Service B.
- Service B receives the response and handles it. It then creates a new response and propagates the trace context by injecting it into the outgoing response back to Service A.
A set of requests and responses that belong together is called a trace. Figure 9-3 shows a trace:
Figure 9-3. Traces and spans.
In the figure, note how the trace represents a unique application transaction that takes place across many services. A trace is a collection of spans. Each span represents a single operation or unit of work done within the trace. Spans are the requests and responses that are sent between services that implement the unique transaction.
The next sections discuss how to inspect tracing telemetry by publishing it to a monitoring back end.
Use a Zipkin monitoring back end¶
Zipkin is an open-source distributed tracing system. It can ingest and visualize telemetry data. Dapr offers default support for Zipkin. The following example demonstrates how to configure Zipkin to visualize Dapr telemetry.
Enable and configure tracing¶
To start, tracing must be enabled for the Dapr runtime using a Dapr configuration file. Here's an example of a configuration file named tracing-config.yaml
:
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: tracing-config
namespace: default
spec:
tracing:
samplingRate: '1'
zipkin:
endpointAddress: 'http://zipkin.default.svc.cluster.local:9411/api/v2/spans'
The samplingRate
attribute specifies the interval used for publishing traces. The value must be between 0
(tracing disabled) and 1
(every trace is published). With a value of 0.5
, for example, every other trace is published, significantly reducing published traffic. The endpointAddress
points to an endpoint on a Zipkin server running in a Kubernetes cluster. The default port for Zipkin is 9411
. The configuration must be applied to the Kubernetes cluster using the Kubernetes CLI:
Install the Zipkin server¶
When installing Dapr in self-hosted mode, a Zipkin server is automatically installed and tracing is enabled in the default configuration file located in $HOME/.dapr/config.yaml
or %USERPROFILE%\.dapr\config.yaml
on Windows.
When installing Dapr on a Kubernetes cluster though, Zipkin isn't added by default. The following Kubernetes manifest file named zipkin.yaml
, deploys a standard Zipkin server to the cluster:
kind: Deployment
apiVersion: apps/v1
metadata:
name: zipkin
namespace: eshop
labels:
service: zipkin
spec:
replicas: 1
selector:
matchLabels:
service: zipkin
template:
metadata:
labels:
service: zipkin
spec:
containers:
- name: zipkin
image: openzipkin/zipkin-slim
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9411
protocol: TCP
---
kind: Service
apiVersion: v1
metadata:
name: zipkin
namespace: eshop
labels:
service: zipkin
spec:
type: NodePort
ports:
- port: 9411
targetPort: 9411
nodePort: 32411
protocol: TCP
name: zipkin
selector:
service: zipkin
The deployment uses the standard openzipkin/zipkin-slim
container image. The Zipkin service exposes the Zipkin web front end, which you can use to view the telemetry on port 32411
. Use the Kubernetes CLI to apply the Zipkin manifest file to the Kubernetes cluster and deploy the Zipkin server:
Configure the services to use the tracing configuration¶
Now everything is set up correctly to start publishing telemetry. Every Dapr sidecar that is deployed as part of the application must be instructed to emit telemetry when started. To do that, add a dapr.io/config
annotation that references the tracing-config
configuration to the deployment of each service. Here's an example of the eShop ordering API service's manifest file containing the annotation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ordering-api
namespace: eshop
labels:
app: eshop
spec:
replicas: 1
selector:
matchLabels:
app: eshop
template:
metadata:
labels:
app: simulation
annotations:
dapr.io/enabled: 'true'
dapr.io/app-id: 'ordering-api'
dapr.io/config: 'tracing-config'
spec:
containers:
- name: simulation
image: eshop/ordering.api:linux-latest
Inspect the telemetry in Zipkin¶
Once the application is started, the Dapr sidecars will emit telemetry to the Zipkin server. To inspect this telemetry, point a web-browser to http://localhost:32411. You'll see the Zipkin web front end:
On the Find a trace tab, you can query traces. Pressing the RUN QUERY button without specifying any restrictions will show all the ingested traces:
Clicking the SHOW button next to a specific trace, will show the details of that trace:
Each item on the details page, is a span that represents a request that is part of the selected trace.
Inspect the dependencies between services¶
Because Dapr sidecars handle traffic between services, Zipkin can use the trace information to determine the dependencies between the services. To see it in action, go to the Dependencies tab on the Zipkin web page and select the button with the magnifying glass. Zipkin will show an overview of the services and their dependencies:
The animated dots on the lines between the services represent requests and move from source to destination. Red dots indicate a failed request.
Use a Jaeger or New Relic monitoring back end¶
Beyond Zipkin itself, other monitoring back-end software also supports ingesting telemetry using the Zipkin format. Jaeger is an open source tracing system created by Uber Technologies. It's used to trace transactions between distributed services and troubleshoot complex microservices environments. New Relic is a full-stack observability platform. It links relevant data from a distributed application to provide a complete picture of your system. To try them out, specify an endpointAddress
pointing to either a Jaeger or New Relic server in the Dapr configuration file. Here's an example of a configuration file that configures Dapr to send telemetry to a Jaeger server. The URL for Jaeger is identical to the URL for the Zipkin. The only difference is the port on which the server runs:
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: tracing-config
namespace: default
spec:
tracing:
samplingRate: '1'
zipkin:
endpointAddress: 'http://localhost:9415/api/v2/spans'
To try out New Relic, specify the endpoint of the New Relic API. Here's an example of a configuration file for New Relic:
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: tracing-config
namespace: default
spec:
tracing:
samplingRate: '1'
zipkin:
endpointAddress: 'https://trace-api.newrelic.com/trace/v1?Api-Key=<NR-API-KEY>&Data-Format=zipkin&Data-Format-Version=2'
Check out the Jaeger and New Relic websites for more information on how to use them.
Metrics¶
Metrics provide insight into performance and resource consumption. Under the hood, Dapr emits a wide collection of system and runtime metrics. Dapr uses Prometheus as a metric standard. Dapr sidecars and system services, expose a metrics endpoint on port 9090
. A Prometheus scraper calls this endpoint at a predefined interval to collect metrics. The scraper sends metric values to a monitoring back end. Figure 9-4 shows the scraping process:
Figure 9-4. Scraping Prometheus metrics.
In the above figure, each sidecar and system service exposes a metric endpoint that listens on port 9090. The Prometheus Metrics Scrapper captures metrics from each endpoint and published the information to the monitoring back end.
Service discovery¶
You might wonder how the metrics scraper knows where to collect metrics. Prometheus can integrate with discovery mechanisms built into target deployment environments. For example, when running in Kubernetes, Prometheus can integrate with the Kubernetes API to find all available Kubernetes resources running in the environment.
Metrics list¶
Dapr generates a large set of metrics for Dapr system services and its runtime. Some examples include:
Metric | Source | Description |
---|---|---|
dapr_operator_service_created_total | System | The total number of Dapr services created by the Dapr Operator service. |
dapr_injector_sidecar_injection/requests_total | System | The total number of sidecar injection requests received by the Dapr Sidecar-Injector service. |
dapr_placement_runtimes_total | System | The total number of hosts reported to the Dapr Placement service. |
dapr_sentry_cert_sign_request_received_total | System | The number of certificate signing requests (CRSs) received by the Dapr Sentry service. |
dapr_runtime_component_loaded | Runtime | The number of successfully loaded Dapr components. |
dapr_grpc_io_server_completed_rpcs | Runtime | Count of gRPC calls by method and status. |
dapr_http_server_request_count | Runtime | Number of HTTP requests started in an HTTP server. |
dapr_http/client/sent_bytes | Runtime | Total bytes sent in request body (not including headers) by an HTTP client. |
For more information on available metrics, see the Dapr metrics documentation.
Configure Dapr metrics¶
At runtime, you can disable the metrics collection endpoint by including the --enable-metrics=false
argument in the Dapr command. Or, you can also change the default port for the endpoint with the --metrics-port 9090
argument.
You can also use a Dapr configuration file to statically enable or disable runtime metrics collection:
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: dapr-config
namespace: eshop
spec:
tracing:
samplingRate: '1'
metric:
enabled: false
Visualize Dapr metrics¶
With the Prometheus scraper collecting and publishing metrics into the monitoring back end, how do you make sense of the raw data? A popular visualization tool for analyzing metrics is Grafana. With Grafana, you can create dashboards from the available metrics. Here's an example of a dashboard displaying Dapr system services metrics:
The Dapr documentation includes a tutorial for installing Prometheus and Grafana.
Logging¶
Logging provides insight into what is happening with a service at runtime. When running an application, Dapr automatically emits log entries from Dapr sidecars and Dapr system services. However, logging entries instrumented in your application code aren't automatically included. To emit logging from application code, you can import a specific SDK like OpenTelemetry SDK for .NET. Logging application code is covered later in this chapter in the section Using the Dapr .NET SDK.
Log entry structure¶
Dapr emits structured logging. Each log entry has the following format:
Field | Description | Example |
---|---|---|
time | ISO8601 formatted timestamp | 2021-01-10T14:19:31.000Z |
level | Level of the entry (debug | info | warn | error ) | info |
type | Log Type | log |
msg | Log Message | metrics server started on :62408/ |
scope | Logging Scope | dapr.runtime |
instance | Hostname where Dapr runs | TSTSRV01 |
app_id | Dapr App ID | ordering-api |
ver | Dapr Runtime Version | 1.0.0 -rc.2 |
When searching through logging entries in a troubleshooting scenario, the time
and level
fields are especially helpful. The time field orders log entries so that you can pinpoint specific time periods. When troubleshooting, log entries at the debug level provide more information on the behavior of the code.
Plain text versus JSON format¶
By default, Dapr emits structured logging in plain-text format. Every log entry is formatted as a string containing key/value pairs. Here's an example of logging in plain text:
== DAPR == time="2021-01-12T16:11:39.4669323+01:00" level=info msg="starting Dapr Runtime -- version 1.0.0-rc.2 -- commit 196483d" app_id=ordering-api instance=TSTSRV03 scope=dapr.runtime type=log ver=1.0.0-rc.2
== DAPR == time="2021-01-12T16:11:39.467933+01:00" level=info msg="log level set to: info" app_id=ordering-api instance=TSTSRV03 scope=dapr.runtime type=log ver=1.0.0-rc.2
== DAPR == time="2021-01-12T16:11:39.467933+01:00" level=info msg="metrics server started on :62408/" app_id=ordering-api instance=TSTSRV03 scope=dapr.metrics type=log ver=1.0.0-rc.2
While simple, this format is difficult to parse. If viewing log entries with a monitoring tool, you'll want to enable JSON formatted logging. With JSON entries, a monitoring tool can index and query individual fields. Here's the same log entries in JSON format:
{"app_id": "ordering-api", "instance": "TSTSRV03", "level": "info", "msg": "starting Dapr Runtime -- version 1.0.0-rc.2 -- commit 196483d", "scope": "dapr.runtime", "time": "2021-01-12T16:11:39.4669323+01:00", "type": "log", "ver": "1.0.0-rc.2"}
{"app_id": "ordering-api", "instance": "TSTSRV03", "level": "info", "msg": "log level set to: info", "scope": "dapr.runtime", "type": "log", "time": "2021-01-12T16:11:39.467933+01:00", "ver": "1.0.0-rc.2"}
{"app_id": "ordering-api", "instance": "TSTSRV03", "level": "info", "msg": "metrics server started on :62408/", "scope": "dapr.metrics", "type": "log", "time": "2021-01-12T16:11:39.467933+01:00", "ver": "1.0.0-rc.2"}
To enable JSON formatting, you need to configure each Dapr sidecar. In self-hosted mode, you can specify the flag --log-as-json
on the command line:
In Kubernetes, you can add a dapr.io/log-as-json
annotation to each deployment for the application:
annotations:
dapr.io/enabled: 'true'
dapr.io/app-id: 'ordering-api'
dapr.io/app-port: '80'
dapr.io/config: 'dapr-config'
dapr.io/log-as-json: 'true'
When you install Dapr in a Kubernetes cluster using Helm, you can enable JSON formatted logging for all the Dapr system services:
helm repo add dapr https://dapr.github.io/helm-charts/
helm repo update
kubectl create namespace dapr-system
helm install dapr dapr/dapr --namespace dapr-system --set global.logAsJson=true
Collect logs¶
The logs emitted by Dapr can be fed into a monitoring back end for analysis. A log collector is a component that collects logs from a system and sends them to a monitoring back end. A popular log collector is Fluentd. Check out the How-To: Set up Fluentd, Elastic search and Kibana in Kubernetes in the Dapr documentation. This article contains instructions for setting up Fluentd as log collector and the ELK Stack (Elastic Search and Kibana) as a monitoring back end.
Health status¶
The health status of a service provides insight into its availability. Each Dapr sidecar exposes a health API that can be used by the hosting environment to determine the health of the sidecar. The API has one operation:
The operation returns two HTTP status codes:
- 204: When the sidecar is healthy
- 500: when the sidecar isn't healthy
When running in self-hosted mode, the health API isn't automatically invoked. You can invoke the API though from application code or a health monitoring tool.
When running in Kubernetes, the Dapr sidecar-injector automatically configures Kubernetes to use the health API for executing liveness probes and readiness probes.
Kubernetes uses liveness probes to determine whether a container is up and running. If a liveness probe returns a failure code, Kubernetes will assume the container is dead and automatically restart it. This feature increases the overall availability of your application.
Kubernetes uses readiness probes to determine whether a container is ready to start accepting traffic. A pod is considered ready when all of its containers are ready. Readiness determines whether a Kubernetes service can direct traffic to a pod in a load-balancing scenario. Pods that aren't ready are automatically removed from the load-balancer.
Liveness and readiness probes have several configurable parameters. Both are configured in the container spec section of a pod's manifest file. By default, Dapr uses the following configuration for each sidecar container:
livenessProbe:
httpGet:
path: v1.0/healthz
port: 3500
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: v1.0/healthz
port: 3500
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
The following parameters are available for the probes:
- The
path
specifies the Dapr health API endpoint. - The
port
specifies the Dapr health API port. - The
initialDelaySeconds
specifies the number of seconds Kubernetes will wait before it starts probing a container for the first time. - The
periodSeconds
specifies the number of seconds Kubernetes will wait between each probe. - The
timeoutSeconds
specifies the number of seconds Kubernetes will wait on a response from the API before timing out. A timeout is interpreted as a failure. - The
failureThreshold
specifies the number of failed status code Kubernetes will accept before considering the container not alive or not ready.
Dapr dashboard¶
Dapr offers a dashboard that presents status information on Dapr applications, components, and configurations. Use the Dapr CLI to start the dashboard as a web-application on the local machine on port 8080:
For Dapr application running in Kubernetes, use the following command:
The dashboard opens with an overview of all services in your application that have a Dapr sidecar. The following screenshot shows the Dapr dashboard for the eShopOnDapr application running in Kubernetes:
The Dapr dashboard is invaluable when troubleshooting a Dapr application. It provides information about Dapr sidecars and system services. You can drill down into the configuration of each service, including the logging entries.
The dashboard also shows the configured components (and their configuration) for your application:
There's a large amount of information available through the dashboard. You can discover it by running a Dapr application and browsing the dashboard. You can use the accompanying eShopOnDapr application to start.
Check out the Dapr dashboard CLI command reference in the Dapr docs for more information on the Dapr dashboard commands.
Use the Dapr .NET SDK¶
The Dapr .NET SDK doesn't contain any specific observability features. All observability features are offered at the Dapr level.
If you want to emit telemetry from your .NET application code, you should consider the OpenTelemetry SDK for .NET. The Open Telemetry project is cross-platform, open source, and vendor agnostic. It provides an end-to-end implementation to generate, emit, collect, process, and export telemetry data. There's a single instrumentation library per language that supports automatic and manual instrumentation. Telemetry is published using the Open Telemetry standard. The project has broad industry support and adoption from cloud providers, vendors, and end users.
Reference application: eShopOnDapr¶
Observability in accompanying eShopOnDapr reference application consists of several parts. Telemetry from all of the sidecars is captured. Additionally, there are other observability features inherited from the earlier eShopOnContainers sample.
Custom health dashboard¶
The WebStatus project in eShopOnDapr is a custom health dashboard that gives insight into the health of the eShop services. This dashboard doesn't use the Dapr health API but uses the built-in health checks mechanism of ASP.NET Core. The dashboard not only provides the health status of the services, but also the health of the dependencies of the services. For example, a service that uses a database also provides the health status of this database as shown in the following screenshot:
Seq log aggregator¶
Seq is a popular log aggregator server that is used in eShopOnDapr to aggregate logs. Seq ingests logging from application services, but not from Dapr system services or sidecars. Seq indexes application logging and offers a web front end for analyzing and querying the logs. It also offers functionality for building monitoring dashboards.
The eShopOnDapr application services emit structured logging using the SeriLog logging library. Serilog publishes log events to a construct called a sink. A sink is simply a target platform to which Serilog writes its logging events. Many Serilog sinks are available, including one for Seq. Seq is the Serilog sink used in eShopOnDapr.
Application Insights¶
eShopOnDapr services also send telemetry directly to Azure Application Insights using the Microsoft Application Insights SDK for .NET Core. For more information, see Azure Application Insights for ASP.NET Core applications in the Microsoft docs.
Summary¶
Good observability is crucial when running a distributed system in production.
Dapr provides different types of telemetry, including distributed tracing, logging, metrics, and health status.
Dapr only produces telemetry for the Dapr system services and sidecars. Telemetry from your application code isn't automatically included. You can however use a specific SDK like the OpenTelemetry SDK for .NET to emit telemetry from your application code.
Dapr telemetry is produced in an open-standards based format so it can be ingested by a large set of available monitoring tools. Some examples are: Zipkin, Azure Application Insights, the ELK Stack, New Relic, and Grafana. See Monitor your application with Dapr in the Dapr documentation for tutorials on how to monitor your Dapr applications with specific monitoring back ends.
You'll need a telemetry scraper that ingests telemetry and publishes it to the monitoring back end.
Dapr can be configured to emit structured logging. Structured logging is favored as it can be indexed by back-end monitoring tools. Indexed logging enables users to execute rich queries when searching through the logging.
Dapr offers a dashboard that presents information about the Dapr services and configuration.