1. Alertmanager集群
AlertManager是天生支持集群模式的,他们的集群是通过两个实例之间的http通讯来进行信息传递的。alertmanager的功能我们在alertmanager的单机版中已经说过了,这里就不赘述了。我们直接开始搭建alertmanager集群。
我们在10.0.1.13和10.0.1.14上安装和配置alertmanager。
-
初始化安装
下载地址:https://prometheus.io/download/
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz tar xf alertmanager-0.24.0.linux-amd64.tar.gz cd alertmanager-0.24.0.linux-amd64 mv alertmanager amtool /usr/local/sbin mkdir /etc/alertmanager mv alertmanager.yml /etc/alertmanager/
-
修改systemd文件,/etc/systemd/system/alertmanager.service,两台机器都这么配置,从而互相监听
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/docs/alerting/latest/overview/
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/sbin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=10.0.1.13:9094 \
--cluster.peer=10.0.1.14:9094
ExecReload=/bin/kill -HUP
TimeoutStopSec=20s
Restart=always
[Install]
WantedBy=multi-user.target
- 修改alertmanager.yml
global:
# global parameter
resolve_timeout: 5m
# smtp parameter
smtp_smarthost: smtp.qq.com:465
smtp_from: 29371962@qq.com
smtp_auth_username: 29371962@qq.com
smtp_auth_password: qjrqmfxodukpbihc
smtp_require_tls: false
route:
receiver: default-reciever
group_wait: 15s
group_interval: 45s
repeat_interval: 24h
group_by: [alertname]
routes:
# below configuration is for DBAs, we have oracle, mysql, sql-server databases
- receiver: 'mssql-dba-qa'
continue: true
match:
dbserver: "mysql"
receivers:
- name: 'default-reciever'
webhook_configs:
- send_resolved: false
url: 'http://127.0.0.1:15000'
- name: 'mssql-dba-qa'
email_configs:
- send_resolved: true
to: '29371962@qq.com'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'env', 'project']
2. 配置报警
2.1. 报警的过程
报警规则的匹配有下面几个步骤
- ruler,不管是prometheus的ruler还是thanos的ruler,他们会使用配置文件中的匹配规则去对应的时序数据库中匹配,如果匹配到了就会把匹配到的规则发给alertmanager
- alertmanager拿到了需要报警的请求之后,通过router中的规则,对应标签分发报警到对应的receiver。
- alertmanager的receiver中记录的不同的报警渠道,比如微信或者SNS
- alertmanager会根据hibite的规则对同类型的报警进行比对,如果满足发送条件,会把报警进行发送
2.2. Thanos-ruler
thanos的rule和prometheus的rule功能完全一样,但是为了集群化,我们也会使用两个同样的实例。如果两个实例同时发现了报警,还是和query一样会drop掉label之后合并报警,最后会把报警同时发给alertmanager。
配置/etc/systemd/system/thanos-ruler.service
[Unit]
Description=thanos-ruler
Documentation=https://thanos.io/
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/sbin/thanos rule \
--http-address=0.0.0.0:10910 \
--grpc-address=0.0.0.0:10911 \
--data-dir=/app/thanos/ruler \
--rule-file=/etc/thanos/rules/*/*.yml \
--alert.query-url=http://10.0.0.11:10903 \ # 查询的URL
--alertmanagers.config-file=/etc/thanos/thanos-ruler-alertmanager.yml \
--query.config-file=/etc/thanos/thanos-ruler-query.yml \
--objstore.config-file=/etc/thanos/thanos-minio.yml \
--web.external-prefix=http://10.0.0.11:10910 \ # 对外暴露的url
--label=cluster="aws" \
--label=replica="A" \ # 另外的集群写B
--alert.label-drop="replica"
ExecReload=/bin/kill -HUP
TimeoutStopSec=20s
Restart=always
[Install]
WantedBy=multi-user.target
/etc/thanos/thanos-ruler-alertmanager.yml文件
alertmanagers:
- http_config:
static_configs: ['10.0.1.13:9093','10.0.1.14:9093']
scheme: http
timeout: 30s
/etc/thanos/thanos-ruler-query.yml文件
- http_config:
static_configs: ["10.0.0.11:10903"]
scheme: http
2.3. 报警规则
groups:
- name: linux_OS
rules:
- alert: NodeExporterIsUp
expr: up{job="consul-node-exporter",project="monitoring"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node_exporter on is unreachable"
description: "Node_exporter on is unreachable for 1m"
# Alert for any instance that is unreachable for >5 minutes.
# function a() { $(a) ; }
# a &
- alert: MemoryUsage
expr: round(((node_memory_MemTotal_bytes-node_memory_MemAvailable_bytes{project="monitoring"})/node_memory_MemTotal_bytes{project="monitoring"}) * 100) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "Memory of instance is not enough"
description: "Memory usage of is too much for more than 15 minutes. (current value: %"
# Alert for any instance that is unreachable for >5 minutes.
# use "cat /dev/urandom | md5sum" to for CPU working with 100%
- alert: CPUUsage
expr: round((1 - avg(rate(node_cpu_seconds_total{mode="idle",project="monitoring"}[15m])) by (instance)) * 100) > 80
#expr: round((1 - irate(node_cpu_seconds_total{mode="idle"}) by (instance)) * 100) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "CPU usage of instance is too hight"
description: "CPU usage of is too much for more than 15 minutes. (current value: %"
# Alert for any instance that has a median request latency >1s.
# dd if=/dev/zero of=test bs=1024M count=40
- alert: FileSysystemUsage
expr: round((node_filesystem_size_bytes{job="consul-node-exporter",device!="tmpfs",project="monitoring"}-node_filesystem_free_bytes{job="consul-node-exporter",device!="tmpfs",project="monitoring"})/node_filesystem_size_bytes{job="consul-node-exporter",device!="tmpfs",project="monitoring"} * 100) > 80
for: 30s
labels:
serverity: warning
annotations:
summary: "Not enough space for file system on "
description: "Not enough space for file system fs on . (current value: )%"