Release uses Datadog internally to monitor our Kubernetes clusters and applications and we think you should too. Here is a guide for useful information that you can use to gain insight and visibility into your cluster or clusters. You can even send yourself alerts via email, slack, or other Datadog supported channels.
If you do not see the metrics described here, verify the integrations are installed in your Datadog account, and that your cluster is operating correctly in your account, or contact us and we'd be more than happy to help.
Creating the Monitor
Selecting a Metric
You will first want to find a metric that is interesting or that you would like to alert on. For example, you may notice that your application was acting up and could be traced back to pod health checks failing. You may decide that you want to know when something like this happens again. You investigate the metrics and find an interesting rise in the Kubernetes Pods dashboard with the metric called "Pods in Bad State (Not Ready) by Namespace"
This seems like a perfect metric you might be interested in alerting on. Keep in mind this is just an example and there may be a combination of events or metrics and logs that you might be interested in finding an alert. But this metric will be a good starting point for now. Notice how this metric is for "pods" and "namespaces" (which will be how you separate out your application environments and services in our UI).
Creating a Monitor
The next part will require you to zero in on the metric and how to respond. On the left pane, select Monitor -> New Monitor as shown:
Create a new monitor in Datadog
Next you will want to select a monitor to be based on the Metric you are interested in.
Select a Metric type to create the monitor
For this example we're keeping it simple with a threshold alert based on the "kubernetes_state.pod.status_phase" metric we saw above (if you were not sure which metric to choose, you can start typing based on metric names you see in the dashboard and experiment!). We've marked the steps in red to show you our example.
Chose a threshold alert and a metric to alert on
Setting the Threshold
Fill in the rest of the fields as you can see below. We think this might make an interesting alert to use when pod phase status is "failed" for more than 15 minutes. You could adjust it to any threshold you prefer and exclude clusters or namespaces you are not interested in.
Enter fields for your monitor trigger, threshold, and metrics