这是用户在 2025-7-3 18:57 为 https://learn.datascientest.com/lesson/804/2855 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
COURSE

Datadog - Supervision of Ubuntu servers

DIFFICULTY
Normal
APPROXIMATE TIME
1h30
RELATED MACHINE

You have no instance yet for this lesson.
Launching new machine may take time.



Datadog


III - Ubuntu server supervision

a - Presentation

Ubuntu is a operating system based on Linux. It is designed for computers, smartphones and network servers. The system is developed by a UK-based company called Canonical Ltd. All the principles used to develop Ubuntu software are based on the principles of Open Source software development.

DataDog is a popular cross-platform service for monitoring various servers, services, databases and tools via a data analysis platform. Users can check available free space on RAM and disk, or check web request latency or CPU utilization on their systems.

These problems may seem insignificant, but they end up causing problems for servers with production applications. That's why alerts are created to inform administrators whenever critical events occur. We'll see how to create monitors using the DataDog tool, and we'll also create alerts to check RAM usage and CPU utilization.

b - Installing DataDog on Ubuntu

We're going to generate an API key within our Datadog account and use it to install DataDog on Ubuntu, so in the command below we need to replace the variable DDAPIKEY with the key we've been provided with:

DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX DD_SITE="us5.datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

We can click on the search field go to.. and search for the menu api keys.

We can click on the api keys menu and we'll be redirected to the page within which we can manage our API keys.

We can retrieve the key present by clicking on the record. We can also click on the new key button to generate a new key. We'll use the current key. So let's click on the record and click on the copy button to r the key we'll use later to register the Datadog agent within our machine.

We can now run the following command to register our agent:

DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=0720dfddfd929a528d558ecbcee9eac3 DD_SITE="us5.datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

Output display:

* Adding your API key to the Datadog Agent configuration: /etc/datadog-agent/datadog.yaml
* Setting SITE in the Datadog Agent configuration: /etc/datadog-agent/datadog.yaml
/usr/bin/systemctl
* Starting the Datadog Agent...

  Your Datadog Agent is running and functioning properly.
  It will continue to run in the background and submit metrics to Datadog.
  If you ever want to stop the Datadog Agent, run:

      sudo systemctl stop datadog-agent

  And to run it again run:

      sudo systemctl start datadog-agent

We can check the state of the Datadog agent to make sure it's running:

sudo systemctl status datadog-agent

Output display:

● datadog-agent.service - Datadog Agent
     Loaded: loaded (/lib/systemd/system/datadog-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-07-16 08:20:08 UTC; 1min 3s ago
   Main PID: 4025942 (agent)
      Tasks: 22 (limit: 147274)
     Memory: 84.8M
     CGroup: /system.slice/datadog-agent.service
             └─4025942 /opt/datadog-agent/bin/agent/agent run -p /opt/datadog-agent/run/agent.pid

The DataDog agent runs correctly in the background, and will continue to run. If we wish to stop the DataDog agent, we can execute the following command:

sudo systemctl stop datadog-agent

To start the agent:

sudo systemctl start datadog-agent

After creating the API key and installing the Datadog agent, we can now configure a monitor for our machine.

c - Datadog monitor

c.1 - Overview

Nowadays, alert monitoring is crucial for any organization running applications or microservices. Let's say we've decided to create new Linux virtual machines or a Kubernetes cluster host, and if one of our nodes goes offline or the CPU goes up, We need a way to inform our team. To monitor our entire infrastructure from a single location, we need to receive notifications when critical changes occur. Datadog allows us to create monitors that will actively check our metrics, the availability of your integrations, our network or storage devices, etc...

The metric monitors allow us to define alerts for specific events and receive notifications according to thresholds we've taken care to configure beforehand. We can now configure a metrics monitor that can alert us to high CPU usage or low disk space and RAM reaching a certain usage threshold.

.

c.2 - Configuring a monitor for server status

We can now configure a monitor for our Datascientest machine to set up monitoring and alert elements for this server. Let's go to our Datadog dashboard and click on the Monitor menu and the new monitor sub-menu.

On the next page we will select the host menu to configure the monitor for our server.

We need to choose a host by pulling down the pick hosts by name or tag drop-down menu. So we can choose our Datascientest server and go to the 2 field labeled Set alert conditions.

We can leave the selection check alert. This means that the monitor will track whether a host stops reporting to Datadog at any time. Let's also choose within the Trigger a separate alert for each field the value Host. We can leave the other options unchanged.

Under the Notify your team field,

We're going to create a message that will be sent to the various users so that they can be notified. In the EDIT section, we'll fill in the following message:

The host {{host.name}} with IP {{host.ip}} is down.

This title includes {{host.name}}and {{host.ip}}, which are the host name and IP address variables for the host in question.

It's also possible to click on the message Use Message Template Variables to view message templates we can use and also see the variables available.

In the body of the message, we'll fill in the following message:

{{ #is_alert }}
Hello, Just look for the host {{host.name}} that have with IP {{host.ip}}, that host is not up and running
{{/is_alert}}

Then select @all from the drop-down menu. This option means notifying every user we've added to our organizationDatadog. We'll talk about adding users to our Datadog organization later. We can leave the rest of the settings as default and click on the create button below the form.

We can then click under the Test Notifications button to run a simulated monitoring alert and make sure the monitor is working. We can then select Alert and click on the run test button.

We can check our mailbox to make sure we've received the alert message if our server is not available.

We can finally save our monitor and are redirected to a page that returns our monitor details.

And that's how we create a monitoring alert in Datadog. we can now try to create different types of monitors, such as monitoring metrics and audit logs.

c.3 - Monitor RAM usage

In order to check RAM usage, the amount available and generate an alert if the limit is exceeded, we're going to create a metric alert. Let's click on the Monitor menu and under the New Monitor sub-menu.

Let's click on the Metric menu.

We arrive on a new page on which we can select several alert types. Let's select Threshold Alert and set below the define the metric field the following value:

system.mem.pct_usable # which here is usable memory split over total memory.

in the from field next to it let's choose our Datascientest server and leave the Avg By field selected ( It will return the average value of usable RAM) and choose Host on the next field.

Below on the Set Alert Conditions part, we define the threshold for which we'll be alerted for RAM usage. In the field, Alert Threshold we set the value 0.05 and in the Warning Threshold field, we set the value 0.01.

Here, in the Defined metric, we've chosen from the metrics to be analyzed by Datadog system.mem.pct_usable and selected our host. In the alert condition, we simply define that whenever RAM availability is below 6%, Datadog should generate an alert for us and save the parameters. We can define messages for various conditions as follows in the Say what's happening field:

  • In the Edit field, we'll add the message: Memory Load is High for host {{host.name}}

  • For the message content, we'll use conditions so we can display different messages depending on the RAM condition.

  • {{#is_alert}} RAM is below 6%{{/is_alert}} to specify that RAM is below 6 percent, this message will be sent in the case of an alert.

  • {{#is_warning}}The RAM is at a warning Level {{/is_warning}} to specify that the RAM is at Warning level. This will be sent if it is a Warning type message.

  • {{#is_recovery}}The RAM looks good {{/is_recovery}} to specify that the RAM is at an acceptable level. This will be sent if it is a Recovery type message.

Let's then select @all from the drop-down menu that will notify each user we've added to our Datadog organization. We can leave the rest of the information as default and click the create button to create this new Monitor.

If the Datascientest machine is in the alert state, the message is displayed in red color. Otherwise, the color is green.

We can also click on the Monitor menu and the Manage Monitor sub-menu to check the list of monitors present on our Datadog account.

In the context of the course, the server used is precisely in a state of alert because the RAM is used beyond the alert threshold and therefore an email has been received.

The part below the e-mail is more explicit.

We can that for the last 5 minutes, RAM consumption is of the order of 96.9%

c.4 - Monitoring processor usage

In order to alert the system whenever CPU usage crosses a specific threshold, we're going to create another monitor called CPU Metrics, So we can go to the monitor add form. So let's click on the Monitor menu and under the New Monitor sub-menu.

Next, let's click on the Metric field.

We arrive on a new page on which we can select several alert types. Let's select Threshold Alert and set below the define the metric field the following value:

system.cpu.user # which here is the Percentage of time spent by the processor executing user-space processes.

in the from field next to it let's choose our Datascientest server and leave the Avg By field selected ( It will return the average value of usable RAM) and choose Host on the next field.

Below on the Set Alert Conditions part, we define the threshold for which we'll be alerted for CPU usage. In the field, Alert Threshold we set the value 90 and in the Warning Threshold field, we set the value 75.

Here again the alert type is "Threshold" but this time the metric flag is system.cpu.user. An alert will be generated when CPU utilization exceeds 90%. We've also defined a warning to be generated when CPU utilization exceeds 80%, and the appropriate message to be displayed depending on the condition. We'll use the same as in the previous case when alerting about RAM consumption.

We're going to create a message that will be sent to the various users so that they can be notified. In the EDIT section, we'll fill in the following message:

CPU Usage is critical on host {{host.name}} with IP {{host.ip}}.

In the body of the message, we'll fill in the following message:

{{#is_alert }} # if an alert is generated (CPU consumption above 90 Percent), displays the following message
CPU usage is critical on host {{host.name}}
{{/is_alert}}
{{#is_warning }} # if an alert is generated (CPU usage above 75 Percent), displays the following message
CPU usage is above 75% on host {{host.name}}
{{/is_warning}}

Let's then select @all from the drop-down menu that will notify each user we've added to our Datadog organization.

We can leave the rest of the settings as default and click on the create button below the form.

We will now be alerted if the CPU usage of our Datascientest machine is beyond the configured thresholds.

c.5 - Processor stress and alert checking

We will now run a tool called stress which will allow us to run a load test on our processor to check whether the system is reacting as we predicted.

The stress tool is a minimal utility we can use to test our system's processor, memory and I/O. By default, Stress is not installed on most distributions. However, it is available on most official package repositories.

To install for Ubuntu-based distributions, let's use our apt package manager:

sudo apt install stress -y

Output display:

(Reading database ... 113292 files and directories currently installed.)
Preparing to unpack .../stress_1.0.4-6_amd64.deb ...
Unpacking stress (1.0.4-6) ...
Setting up stress (1.0.4-6) ...
Processing triggers for install-info (6.7.0.dfsg.2-5) ...
Processing triggers for man-db (2.9.1-1) ...

Once installation is complete, we can use Stress by specifying the resource to be tested:

stress --cpu 20

Output display:

stress: info: [278381] dispatching hogs: 20 cpu, 0 io, 0 vm, 0 hdd

In our case, we want to stress the CPU. So we passed the argument -cpu followed by the number of workers to generate. We can press the CTRL-C keys to kill the running Stress process.

After a few minutes, we get an alert for the CPU load that drastically increased due to the Stress.

If we go below the alert, we can check when the CPU load started to be high.

As we look at our emails, we realize that we've received two new emails. One for the warning part when the CPU went beyond the 75% utilization and a second email when the 90% percent mark was reached.

c.6 - Monitoring processes

In order to keep an eye on various processes running on our system or to check one in particular, we can create a monitor to generate a alert. This is useful because it tells us which process is running and which application processes are being killed. Finally, we'll create a process monitor that tracks whether a particular process is running on the machine or not. This process can be quite useful for some of the following reasons and many others:

  • If we have Nginx running and want to know it's still working
  • .
  • We're running our web application and its process isn't being killed due to other interference

There are also a few drawbacks to this monitor. Often, a process doesn't run internally due to its own exceptions, but it isn't killed by the system.

In this case, DataDog won't flag it as an alert.

In order to create a monitor for the process, we'll go to the directory where Datadog's configuration files are stored:

cd /etc/datadog-agent/conf.d

Now let's go to the process.d directory:

cd process.d

We find a file called conf.yaml.example that we will r and name it conf.yaml:

sudo cp conf.yaml.example conf.yaml

Now let's open the conf.yaml file and insert the following content:

sudo nano conf.yaml

We'll open the file in which we'll find the following fields:

  • Nam: The name displayed on Datadog for our process
  • Search_string: a unique string that is displayed when we search for the process on your system.
  • exact_match: let's set it to False so that the string is searched without respecting the formatting.

the file will therefore have the following contents:

init_config:

instances:
    - name: ssh
      search_string: ['ssh', 'sshd']
      exact_match : false

    - name: nginx
      search_string: ['nginx']
      exact_match : false

Now we need to install nginx in order to have the service present on our machine.

sudo apt-get install nginx -y
sudo systemctl enable --now nginx

Next we'll configure the datadog agent to monitor our processes in real time. To configure Live Process monitoring, we need to enable it in our Datadog Agent.

sudo nano /etc/datadog-agent/datadog.yaml

and we need to add the code below:

process_config:
  process_collection:
    enabled: "true"

Once the configuration is complete, we can restart the agent.

sudo systemctl restart datadog-agent

We can go to the Datadog interface and we'll click on the Monitor menu and the New Monitor sub-menu to create a new monitor for our processes. We'll choose the Process check menu.

We arrive on a new page and need to fill in the form to configure an alarm for our process. For the Pick a Process field we'll choose Nginx from the drop-down menu. For the Pick monitor scope field, we'll choose the Datascientest machine from the drop-down menu. For the Set alert conditions field, we can leave it defaulted to Check Alert and below that on the Trigger a separate alert for each field, we'll select the Host value.

Now we can define the process alert thresholds. On the Trigger the alert after selected consecutive failures: field we set the warning status value to 2, Critical to 4 and Ok to 2. We can then leave the fields below by default.

We're going to create a message that will be sent to the various users so that they can be notified. In the EDIT section, we'll fill in the following message:

Nginx state for the host {{host.name}}.

In the body of the message, we'll fill in the following message:

{{#is_alert }} # if an alert is generated (2 errors), displays the following message
you have to take care of the Nginx process. there are 4 failures for this process, it's critical
{{/is_alert}}
{{#is_warning }} # if an alert is generated (CPU consumption above 75)
you have to take care of the Nginx process. there are 4 failures for this process, it's critical
{{/is_warning}}

Next, let's select @all from the drop-down menu that will notify each user we've added to our Datadog organization. We can leave the rest of the settings as default and save our configuration.

We can now stop the Nginx service to check Datadog's behavior.

sudo systemctl stop nginx

After a few minutes, we are notified by Datadog that the process is no longer running.

We can go and check our Email box to verify the alerts.

We can now restart the service to check that everything is Ok now.

sudo systemctl start nginx

After a few minutes, we can see that everything is Ok on the Datadog interface.

Datadog is an incredible service through which we can track CPU usage, RAM usage and various processes running on our system. We can do this by creating monitors that give us alerts whenever a threshold on each monitor is reached.

Although they may seem like minor problems, for servers with production applications they can really create a big mess. We've seen how to create alerts so that every time one of the events we want to monitor occurs, that stakeholders are immediately alerted.

Lesson done

Lesson finished?

Module progress : Datadog (EN)

Incomplete exercise
Incomplete exercise
Incomplete exercise
Incomplete exercise
Incomplete exercise
Incomplete exercise
Incomplete exercise
Incomplete exercise