Guillaume Azerad

Senior DevOps engineer with Docker/Ansible/Gitlab expertise and full stack developer capability (Go, PHP, Javascript, Python)

Guillaume Azerad
Guillaume Azerad

Senior DevOps engineer with Docker/Ansible/Gitlab expertise and full stack developer capability (Go, PHP, Javascript, Python)

Automatically scale an application with an Azure hybrid cloud - part 3 : monitoring alerts


Published on: 2024-11-13
Reading time: 17 min
Last update: 2024-11-24
Also available in: French

Hybrid cloud monitoring

Introduction

The first two articles in the hybrid cloud series allowed us to first set up the infrastructure combining the private network and the Azure public cloud, then to implement the application scalability mechanism based on an Azure function.

Now it remains to establish the automatic triggering of VM creations/deletions on the cloud based on monitoring events. This is what we will address here by configuring a Grafana / Prometheus / Alertmanager monitoring stack.

Use cases

Let’s recall the local VM monitoring rules that we set for ourselves to determine the creation or deletion of VMs on the Azure cloud:

  • CPU > 70% over the last 5 minutes => creation of Azure VM (up to two maximum)
  • CPU < 30% over the last hour with presence of at least one peak > 70% (in 5 minute steps) => deletion of Azure VM

On the other hand, we recall the scalability mechanism put in place which links these monitoring alerts to the VM creation/deletion actions on Azure.

Autoscaling with Azure Functions

So we will look at how to generate monitoring events and make them interact with Azure Functions to ensure the scalability of our application.

Installation of local monitoring

As mentioned, we will choose to install a Grafana / Prometheus / Alertmanager stack for monitoring management (Prometheus), alert definition (Alertmanager) and graphical display of data (Grafana). The system metrics of the server where the application runs are provided by Node exporter which communicates directly with Prometheus for their processing.

Harvesting metrics with Node exporter

Our first goal is to monitor the system resources (CPU, RAM, storage, network, …) of the local network server on which our application runs.

Node exporter is an open source tool, primarily used in system monitoring environments, to collect and expose system metrics on Linux servers. It is specifically designed to be used with Prometheus, a monitoring and alerting solution.

In a very simple way, Node exporter can be deployed using Docker Compose and communicate the collected data on port 9100 (by default). We will notice that in our case, we constitute a common docker-compose.yml file with the application running on the server that we wish to monitor.

Since we are using a containerized Node exporter installation, we need to carefully define the volumes with sufficient rights on the host to monitor the server metrics and not the container itself.

services:
  # Server Web Application
  application:
    image: crccheck/hello-world
    container_name: application
    ports:
      - "80:8000"

  # Service Node Exporter
  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node_exporter
    # We expose port 9100 which will be accessible for Prometheus
    # deployed on another server dedicated to monitoring
    ports:
      - "9100:9100"
    volumes:
      # The following volumes define the system directories
      # from the server where the metrics will be collected
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      # This volume allows to correct a possible launch error
      - /srv/app/node-exporter/textfile_collector:/var/lib/node_exporter/textfile_collector:ro
      # Here we associate the system date of the server and that of the container
      - /etc/localtime:/etc/localtime:ro
    command:
      # node-exporter launch parameters
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.textfile.directory=/var/lib/node_exporter/textfile_collector'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: always

Once the environment is launched with the command docker compose up -d, we have access to a page on port 9100 of the server (here IP address 192.168.10.5).

Node exporter home page Metrics Page

Deploying Grafana / Prometheus / Alertmanager

The deployment of services performing the analysis (dashboard, definition of alerts) of the system data sent back by Node exporter is also done with Docker Compose on another server dedicated to monitoring.

services:
  grafana:
    image: grafana/grafana:10.4.4
    container_name: grafana
    ports:
      - "3000:3000"
    networks:
      - monitoring
    volumes:
      - grafana-data:/var/lib/grafana
      - /etc/localtime:/etc/localtime:ro
      # - /home/guaz/certs/perso/perso.com.key:/etc/grafana/grafana.key:ro
      # - /home/guaz/certs/perso/perso.com.crt:/etc/grafana/grafana.crt:ro
    environment:
      # Setting admin account credentials
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: NV1lgz7ViL4xibQ8NtqV
      # Setting Grafana server URL
      # GF_SERVER_DOMAIN: "grafana.perso.com"
      # GF_SERVER_ROOT_URL: "https://grafana.perso.com/"
    restart: always

  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus-data:/prometheus
      - /etc/localtime:/etc/localtime:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    restart: always

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    networks:
      - monitoring
    volumes:
      - alertmanager-data:/data
      - ./alert-manager/alertmanager.yml:/config/alertmanager.yml
      - /etc/localtime:/etc/localtime:ro
    command: --config.file=/config/alertmanager.yml --log.level=debug

volumes:
  grafana-data:
  prometheus-data:
  alertmanager-data:

networks:
  monitoring:
    driver: 'bridge'

Some notes on the installation:

  • The ports exposed for each of the services are: 3000 for Grafana, 9090 for Prometheus and 9093 for Alertmanager. We have chosen to open them directly to the outside to have access to their respective interfaces. This is not necessary for them to communicate with each other since they belong to the same Docker network monitoring.
  • Each service needs a volume to store data: grafana-data, prometheus-data, alertmanager-data.
  • Grafana configuration is managed by environment variables while Prometheus and Alertmanager require creating bind mount volumes for their configuration files.
  • The /etc/localtime mounting volume has no other purpose than to ensure that containers and the host share the same system time.
  • The Prometheus launch command parameters ensure the following:
    • --config.file: specification of the configuration file in a volume allowing its modification from the outside
    • --storage.tsdb.path: The data storage path is correctly assigned to the persistent volume.
    • -web.enable-lifecycle: This option allows to reload the configuration without restarting the container, which is useful for applying dynamic changes.

Implementing alerting

Configuration de Prometheus

We have seen previously that the Prometheus configuration consists of two files: /etc/prometheus/prometheus.yml and /etc/prometheus/alert.rules.yml (i.e. ./prometheus/prometheus.yml and ./prometheus/alert.rules.yml on the monitoring server where Docker Compose is located).

prometheus.yml

global:
  scrape_interval: 1m

rule_files:
  - /etc/prometheus/alert.rules.yml

alerting:
  alertmanagers:
    - scheme: http
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['vm-debian-3:9100']

The general Prometheus configuration defines the following:

  • scrape_interval: the interval for retrieving metrics (here, one minute)
  • rule_files: the file containing the alert rules
  • alerting: the part defining the sending of alerts, here to the alertmanager service accessible on port 9093
  • scrape_configs: the configuration of the servers to monitor. Here, we only have the VM of our application named vm-debian-3 on which Node exporter is accessible by port 9100

alert.rules.yml

groups:
  - name: cpu_alerts
    rules:
      - alert: HighCPUUsage
        expr: (1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100", mode="idle"}[5m])) * 100 > 70
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected on instance {{ $labels.instance }}"
          description: "The CPU usage on instance {{ $labels.instance }} has been over 70% for more than 5 minutes."
      - alert: NormalCPUUsage
        expr: |
          max_over_time((1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100",mode="idle"}[5m]))[1h:5m]) * 100 > 70 
          and (1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100", mode="idle"}[1h])) * 100 < 30          
        for: 1h
        labels:
          severity: none
        annotations:
          summary: "CPU back to normal usage on instance {{ $labels.instance }}"
          description: "The CPU usage on instance {{ $labels.instance }} has come down under 30% after a peak."

This file therefore contains the definition of CPU usage alerts as previously defined:

  • CPU > 70% over the last 5 minutes
  • CPU < 30% over the last hour with presence of at least one peak > 70% (in 5 minute steps)

These alerts are defined under the expr parameter by PromQL expressions.

The next configuration parameter, for, defines a wait time for which the alert must remain active before it is sent. Without this clause, alerts would be sent immediately upon receipt.

Finally, we indicate the severity of the alert in the labels, and the annotations add an explicit definition of the alerts.

On the Prometheus interface (available in our case at the URL http://localhost:9090) we can see these alerts and their status.

Prometheus Alerts

In the Graph part, it is possible to visualize metrics by indicating PromQL expressions.

Prometheus: graphic

Alert definition details

The PromQL expressions defining monitoring alerts are quite complex at first glance. We will explain some elements of them here.

  • HighCPUUsage

    expr: (1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100", mode="idle"}[5m])) * 100
    
    • rate(node_cpu_seconds_total{instance="vm-debian-3:9100", mode="idle"}[5m]: the rate function calculates the average idle rate of the CPU of the VM that we wish to monitor over the last 5 minutes elapsed.
    • (1 - ...) * 100: we get the CPU usage rate by subtracting the previous result from 1, then multiplying it by 100 to get a percentage
    • > 70: checks that this CPU usage rate exceeds the 70% threshold in order to trigger the alert
  • NormalCPUUsage

    expr: |
    (max_over_time((1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100",mode="idle"}[5m]))[1h:5m]) * 100) > 70)
    and ((1 - rate(node_cpu_seconds_total{instance="vm-debian-3:9100", mode="idle"}[1h])) * 100) < 30)
    

    Here, the expression is divided into two parts:

    • Detection of a usage peak CPU greater than 70% in the last hour
      • 1 - rate(...)[5m]: as before, we get the average CPU usage rate per 5 minute period
      • max_over_time((...[1h:5m]) : this function captures the maximum CPU activity peak observed over the last hour with 5 minute measurement intervals ([1h:5m])
      • * 100) > 70 : we ensure that this peak activity was greater than 70%
    • Check that the average CPU usage has fallen below 30% over the last hour: this is the same expression as for the HighCPUUsage alert but with the < 30 test.

For a monitored infrastructure made of several servers, we precede the calculation of the resource utilization rate by PromQL by avg by(instance) to get the average per instance: avg by(instance)(rate(node_cpu_seconds_total{instance=~"server1|server2|server3", mode="idle"}[5m]))

This way, the NormalCPUUsage alert is triggered when we ensure that resource usage has returned to normal. of the local VM on which our application is deployed, and causes the deletion of VMs created on the cloud using Alertmanager.

We will now describe how to configure Alertmanager to associate the desired tasks when monitoring alerts are triggered.

Managing alerts with Alertmanager

We have seen that Prometheus has the ability to retrieve system metrics and generate alerts from them.

Now, Alertmanager will allow you to exploit these alerts by associating a task with them. In our case, we want to create/delete VMs on the Azure cloud in order to implement our hybrid cloud strategy. This will be done by API calls to the Azure function that we detailed in the article dedicated to it.

Now let’s look at the contents of the alertmanager.yml configuration file:

route:
  group_by: ['alertname']
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - receiver: 'create-vm'
      matchers:
        - alertname = "HighCPUUsage"
    - receiver: 'delete-vm'
      matchers:
        - alertname = "NormalCPUUsage"

receivers:
  - name: 'default'
  - name: 'create-vm'
    webhook_configs:
    - url: https://tp-cloud-autoscale-vpn.azurewebsites.net/api/autoscale-vpn?action=create
  - name: 'delete-vm'
    webhook_configs:
    - url: https://tp-cloud-autoscale-vpn.azurewebsites.net/api/autoscale-vpn?action=delete

This file configures how alerts are processed and sent to specific destinations (receivers).

The route section defines how alerts are grouped, waited for, and forwarded to specific destinations called “receivers”.

  • group_by: ['alertname']: alerts are grouped by name (alertname).
  • receiver: 'default': if an alert does not match any specific route, it is sent to the ‘default’ receiver.
  • group_wait: 30s, group_interval: 5m, repeat_interval: 12h: define the delays for sending and repeating alerts.

Under the routes key, specific routes are defined for particular alert types, each directed to a different receiver.

  • create-vm: sends HighCPUUsage alerts to a webhook to create a VM.
  • delete-vm: sends NormalCPUUsage alerts to a webhook to delete a VM.

Finally, the receivers section defines the webhook URLs corresponding to each of the routes.

Alertmanager also offers a web page available on port 9093 of the monitoring server.

Alertmanager Home Page

Testing scaling on monitoring alerts

CPU load simulation on local server

We will install the stress-of tool which allows, on Linux systems, to submit at a given load the target server.

On a Debian distribution, this is done very simply with the following command:

sudo apt-get install stress-ng

In order to trigger the previously defined HighCPUUsage monitoring alert, we will run the command below which will increase the CPU usage to 75% for a single processor (which is sufficient since our application’s local server is a 1 vCPU VM).

$ stress-ng -c 1 -l 75
stress-ng: info:  [948] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [948] dispatching hogs: 1 cpu

Logically, we can see in Prometheus the CPU usage averaged in 5 minute steps quickly going above the 70% threshold that is supposed to trigger the HighCPUUsage alert.

Prometheus high CPU test

We can then see that the alert is indeed emitted on the Prometheus interface, and initially in Pending state for the duration defined for the for parameter of the Prometheus configuration.

Prometheus test high CPU alert - pending

In our case, we must therefore wait 5 minutes before the alert goes into Firing status, meaning it has been sent to Alertmanager.

Prometheus test high CPU alert - firing

The Alertmanager interface allows us to see that the HighCPUUsage alert has been received.

Alertmanager test high CPU alert

Now let’s look at the logs of the Alertmanager container and we see that the reception of the HighCPUUsage alert sent by Prometheus which triggers the call to the Azure Functions API (receiver=create-vm) to create a VM on the Azure cloud (here after an initial failure).

2024-07-18 09:02:46 ts=2024-07-18T07:02:46.332Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=HighCPUUsage[b709916][active]
2024-07-18 09:02:46 ts=2024-07-18T07:02:46.334Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{alertname=\"HighCPUUsage\"}:{alertname=\"HighCPUUsage\"}" msg=flushing alerts=[HighCPUUsage[b709916][active]]
2024-07-18 09:02:54 ts=2024-07-18T07:02:54.347Z caller=notify.go:848 level=warn component=dispatcher receiver=create-vm integration=webhook[0] aggrGroup="{}/{alertname=\"HighCPUUsage\"}:{alertname=\"HighCPUUsage\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp: lookup tp-cloud-autoscale-vpn.azurewebsites.net on 127.0.0.11:53: server misbehaving"
2024-07-18 09:04:46 ts=2024-07-18T07:04:46.308Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=HighCPUUsage[b709916][active]
2024-07-18 09:05:45 ts=2024-07-18T07:05:45.824Z caller=notify.go:860 level=info component=dispatcher receiver=create-vm integration=webhook[0] aggrGroup="{}/{alertname=\"HighCPUUsage\"}:{alertname=\"HighCPUUsage\"}" msg="Notify success" attempts=2 duration=2m51.482925757s

Return to normal CPU load and delete the VM

Now we interrupt the CPU load test launched by the stress-ng command on our application’s local server.

We can see on Prometheus that the hourly CPU average never exceeded 30% even during the peak activity generated.

Prometheus CPU test normal - average

Additionally, we see that the peak is well captured over the last hour elapsed during a given period.

Prometheus test CPU normal - pic

We are indeed in the conditions where the return to normal NormalCPUUsage alert can be generated, which we can first see on Prometheus.

Prometheus test CPU normal - alerte pending

Due to the Prometheus configuration for parameter set to 1h, this time you have to wait an hour before the alert is actually launched. This delay was chosen so as not to overreact to CPU variations and allow a smoother transition between the different states.

Prometheus test CPU normal - alerte firing

As before, the alert is received on the Alertmanager side and results in a call to the Azure Functions API to delete the VM created on Azure (endpoint delete-vm).

Alertmanager test CPU normal

2024-07-18 12:13:01 ts=2024-07-18T10:13:01.283Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=NormalCPUUsage[ccec9d5][active]
2024-07-18 12:13:01 ts=2024-07-18T10:13:01.286Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{alertname=\"NormalCPUUsage\"}:{alertname=\"NormalCPUUsage\"}" msg=flushing alerts=[NormalCPUUsage[ccec9d5][active]]
2024-07-18 12:13:09 ts=2024-07-18T10:13:09.298Z caller=notify.go:848 level=warn component=dispatcher receiver=delete-vm integration=webhook[0] aggrGroup="{}/{alertname=\"NormalCPUUsage\"}:{alertname=\"NormalCPUUsage\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp: lookup tp-cloud-autoscale-vpn.azurewebsites.net on 127.0.0.11:53: server misbehaving"
2024-07-18 12:14:35 ts=2024-07-18T10:14:35.741Z caller=notify.go:860 level=info component=dispatcher receiver=delete-vm integration=webhook[0] aggrGroup="{}/{alertname=\"NormalCPUUsage\"}:{alertname=\"NormalCPUUsage\"}" msg="Notify success" attempts=2 duration=1m26.446343094s
Table of contents