第一步 部署prometheus operator环境

git地址:https://github.com/prometheus-operator/kube-prometheus.git

选用适用自己k8s版本的release,例如我k8s是1.13的,所以我选择了release-o.1。

部署文件都在manifests/文件夹下,直接一键部署就行。

第二步 修改alertmanager告警配置

由于内置的告警方式不符合需求,所以需要修改下,加入邮箱和webhook配置。

alertmanager.yaml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '***@163.com'
  smtp_auth_username: '***@163.com'
  smtp_auth_password: '***'
  smtp_hello: '163.com'
  smtp_require_tls: false
route:
  group_by: ['job', 'severity']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1m
  receiver: 'webhook'
receivers:
- name: 'default'
  email_configs:
  - to: '****@qq.com'
    send_resolved: true
- name: 'webhook'
  webhook_configs:
  - url: 'http://172.16.3.63:9006/webhook/'
    send_resolved: true

这一份配置中,配置了邮件告警和webhook告警,route里面指定了只开启webhook告警。webhook的实现很简单,示例:

@RestController
@Slf4j
@RequestMapping("/webhook")
public class WebHookController {

    @RequestMapping("/")
    public String webhook(@RequestBody String body) {
        log.info("webhook警报系统,body:{}",body);
        return "success";
    }
}

所有信息都会出现在body中,程序拿到告警信息后进行二次处理。

第三步 部署PrometheusRule

prometheus operator部署完成后,会有一个默认的prometheus配置,如下:

[root@master manifests]# kubectl get prometheusRule --all-namespaces
NAMESPACE    NAME                   AGE
default      prometheus-k8s-rules   18h
fline        rule                   15h
monitoring   etcd-rules             12h

其中,prometheus-k8s-rules 是自带的配置,里面定义了很多监控项。

PrometheusRule作为一个新的自定义资源类型,定义alertmanager的监控项,里面的写法重点是统计的表达式,以下是一份示例:

etcd-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-rules
  namespace: monitoring
spec:
  groups:
  - name: etcd
    rules:
    - alert: EtcdClusterUnavailable
      annotations:
        summary: etcd cluster small
        description: If one more etcd peer goes down the cluster will be unavailable
      expr: |
        count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)

      for: 3m
      labels:
        severity: critical

文件中expr的表达式是prometheus表达式,用于定时统计,然后告警。

第四步 fabric8代码操作PrometheusRule

代码示例:

String rule = "apiVersion: monitoring.coreos.com/v1\n" +
                "kind: PrometheusRule\n" +
                "metadata:\n" +
                "  name: "+ byId.getAlertName() +"\n" +
                "  namespace: monitoring\n" +
                "  labels:\n" +
                "    prometheus: k8s\n"+
                "    role: alert-rules\n"+
                "spec:\n" +
                "  groups:\n" +
                "  - name: "+ byId.getAlertName() +"\n" +
                "    rules:\n" +
                "      - alert: Prometheus scraping errors\n" +
                "        expr: >-\n" +
                "          "+PrometheusExprUtil.getExpr(byId.getTarget(), Double.valueOf(byId.getQuota()),byId.getAppName())+"\n" +
                "        for: 5m\n" +
                "        labels:\n" +
                "          page: monitoring\n" +
                "          team: monitoring\n" +
                "        annotations:\n" +
                "          summary: "+ byId.getAlertDesc() +"\n" +
                "          description: |\n" +
                "            Check failing services";
        CustomResourceDefinitionContext crdContext = new CustomResourceDefinitionContext.Builder()
                .withGroup("monitoring.coreos.com")
                .withPlural("prometheusrules")
                .withScope("Namespaced")
                .withVersion("v1")
                .build();
        try {
            kubernetesClient.customResource(crdContext)
                    .createOrReplace("monitoring",rule);
        }catch (Exception e){
            e.printStackTrace();
            return ResultVo.renderErr(CodeEnum.ERR).withRemark("操作出错:"+e.getMessage());
        }

基本就是字符串拼接成合法的yaml文件格式,然后直接传入。

总结

fabric8 支持操作自定义资源,但是很明显没有内置的项deployment和service的支持好用。