Monitoring your clusters

In this guide, we'll look at how you can use Datadog to collect metrics and logs, monitor clusters, and receive alerts for events in your Release environments.

Observability is a crucial part of modern DevOps and software development. By collecting logs and stack traces in a central place, developers can keep an eye on how their distributed systems function. Having access to logs and system metrics in a single view can speed up debugging, help identify bottlenecks or performance issues, and alert developers about possible security and scaling issues before they happen.

At Release, we use Datadog to monitor our systems. Datadog automatically collects system metrics, monitors selected metrics, and can be configured to send alerts for events.

Datadog is especially suited for collecting metrics from Kubernetes clusters, which is why we recommend adding Datadog monitoring to your Release clusters.

Setting up your Datadog integration

To get started, you'll need a Datadog account, and you'll need to activate the Datadog agent on your Kubernetes clusters in Release.

Set up a Datadog account

If you don't have a Datadog account yet, you can follow our guide on setting up a new Datadog account. Keep in mind you'll need to use the same region for Datadog (for example US1), that you use for your Release clusters (for example us-east-1).

Enable the Datadog agent in Release

If you haven't set up your integration yet, follow our guide on integrating Datadog in Release. Make sure to enable the Datadog agent on any new clusters you create in Release.

How the Datadog agent works in Release

When you activate the Datadog agent on a Release cluster, Release installs and starts a Datadog cluster agent to record cluster metrics, and runs a Datadog agent on each node in your cluster.

Release automatically configures your Datadog agents with the API key you provided, and sets the Datadog site to the appropriate region based on your cluster's region.

Metrics tracked by Datadog

Release configures Datadog agent running on nodes to collect all standard system metrics (such as CPU usage, memory usage, and disk usage). Node agents also collect any logs that your container images write to stdout or stderr.

The Datadog cluster agent records metrics about the entire cluster, such as the number of active pods, Kubernetes network throughput, dropped packets, etc.

The best way to get a full view of all the metrics recorded, is to log in to Datadog and use the metrics explorer.

Setting up a monitor with alerts in Datadog

Before we get started, make sure you're logged in to your Datadog account.

Decide what to monitor by viewing a dashboard

If you don't have a specific statistic in mind yet, or to get an overview of how metrics can be combined to form meaningful statistics, start by viewing the dashboards that Datadog creates automatically.

For instance, if you are interested in monitoring the state of Kubernetes pods, navigate to the Kubernetes Pods Overview dashboard.

Dashboards created by Datadog have filters to help you view only relevant statistics. On the Kubernetes Pods Overview, click the namespace dropdown.

Here you'll notice that Release creates a Kubernetes namespace per environment, and a separate namespace for supporting services, called kube-system.

It is useful to get an overview of everything Release runs on your cluster, but you might be interested to filter your statistics by environment later on.

Since we're interested in the state of pods, scroll down to the Pods section on the dashboard.

The Pods in Bad State by Namespace statistic shows pods that are not in the ready state at any given time. This would be a useful statistic to keep an eye on.

To see how this statistic is calculated:

Click on the Pods in Bad State by Namespace statistic for one of the namespaces.
Datadog displays the metric used for this statistic.
Click on the Copy icon to copy the metric to your clipboard.

In this case, the metric that shows pods in a bad state is:

exclude_null(sum:kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded})

Create a metric monitor

To create a monitor based on our chosen metric:

Click on monitors in the sidebar.
Click on new monitor.
Finally, click on metric.

Now you can configure your new monitor.

Under Choose the detection method, select threshold alert.
Under Define the metric, click source.
Paste the metric you copied earlier:
exclude_null(sum:kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded})
To exclude Release's kube-system namespace, add !kube_namespace:kube-system to the list of filters. The final metric should look like this:
exclude_null(sum:kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded,!kube_namespace:kube-system})
Under Set alert conditions, select the option above or equal to.
Enter the number 1 as the alert threshold.

Under Notify your team, enter a subject for this alert.
Enter an alert message. Note that you can use variables from a monitor in the monitor's alert message. Type {{ in this field to see suggested variables.
Select Datadog users or groups of users who should be notified.
Click create to save your new monitor.

Your new monitor will alert your team when the alert threshold is exceeded.

Datadog's recommended monitors.

Datadog offers many monitor examples that are ready to use with data collected from your clusters.

To use a recommended monitor:

Click on monitors in the sidebar.
Click new monitor.
Select recommended from the tabs at the top of the page.

Viewing Release application logs in Datadog

Metrics offer a clear indication of when unexpected events take place, while logs often provide the answer to why a metric had an unexpected value.

The Datadog agents installed by Release are configured to collect logs from your clusters and nodes automatically.

To view logs in Datadog, click on logs in the sidebar, then on search.

Logging in a Kubernetes environment can be quite verbose, so it is helpful to filter by service or container when viewing logs.

Troubleshooting

No metrics in Datadog

There might be a communication problem between the Datadog agents in your cluster and Datadog's servers, or the agent might not be installed correctly.

See our Datadog integration troubleshooting steps for more help.

Missing application logs in Datadog

If you see metrics in Datadog, such as resource usage statistics, but no or only some application logs, make sure your Docker containers write all logs to stdout or stderr.

Read more about logging in Docker's guide on viewing logs for a container or service.

Of particular interest from Docker's guide, are the two examples of applications that always log to files: The official nginx Docker image and Apache's https Docker image. To output logs to stdout and stderr, these images either create symbolic links from their log files to /dev/stdout and /dev/stderr, or write logs directly to /proc/self/fd/1 and /proc/self/fd/2.

PreviousMicroservices architecture NextPerformance tuning

Last updated 1 year ago