Monitoring your clusters
When it comes to performance tuning, having the right tools to troubleshoot your applications can help save valuable time and money, and prevent monitoring headaches. Without the right tools, you could spend days or even weeks diagnosing multiple complex issues.
There is no shortage of application performance management tools to choose from, but here at ReleaseHub, we use Datadog internally to monitor our Kubernetes clusters and applications, and we think you should use it too.
Datadog provides a 360-degree view into your infrastructure and applications and can quickly identify bottlenecks. You can even send yourself alerts via email, Slack, or other Datadog-supported channels.
Here is some information to help you gain insight into your clusters using Datadog.
Your first step is to identify a metric that you would like to monitor. For example, you may notice that your application has been performing poorly, and you've traced this back to pod health checks failing. You investigate the metrics in the Kubernetes Pods dashboard and find an interesting rise in "Pods in Bad State (Not Ready) by Namespace".
This seems like a good metric to set an alert for.
Notice how this metric is for "pods" and "namespaces" (which will be how you separate out your application environments and services in the ReleaseHub UI).
Next you'll want to zero in on the metric. In the left-hand pane, select Monitor then New Monitor as shown:
Create a new monitor in Datadog
Select Metric to create a monitor based on the metric you are interested in.
Select a Metric type to create the monitor
For this example, we'll select a Threshold Alert, and then define the metric. If you're not sure which metric to choose, you can start typing metric names you see in the dashboard and experiment. We select
kubernetes\_state.pod.status\_phaseto monitor the "Pods in Bad State (Not Ready) by Namespace" metric we identified earlier.
We've marked the steps in red to show you our example.
Chose a threshold alert and a metric to alert on
Fill in the rest of the fields as below. Here we're specifying that we'd like to be alerted when pod phase status is "failed" for more than 15 minutes. You can adjust the threshold and exclude clusters or namespaces you are not interested in.
Enter fields for your monitor trigger, threshold, and metrics
Now complete the remaining fields to send, test, and notify people when the alert is triggered.