Nginx主动健康检查upstream

栏目：云苍穹知识作者：金蝶来源：金蝶云社区发布：2024-09-23浏览：2

Nginx主动健康检查upstream

1 背景说明

nginx是常用的反向代理服务和负载均衡服务，因它的稳定性、强大并发能力、丰富的功能集、简单的配置文件和低系统资源的消耗而闻名。本文介绍了nginx的健康检查功能，保障准确地转发请求到后端健康的服务器。

nginx的健康检查有两种，一种是被动健康检查，也就是nginx自带健康检查模块ngx_http_upstream_module，另一种就是主动健康检查，使用第三方模块nginx_upstream_check_module。

nginx被动健康检查的缺陷

（1）Nginx只有当有访问时后，才发起对后端节点探测。

（2）如果本次请求中，节点正好出现故障，Nginx依然将请求转交给故障的节点，然后再转交给健康的节点处理。所以不会影响到这次请求的正常进行。但是会影响效率，因为多了一次转发。

（3）自带模块无法做到预警。

nginx主动健康检查

（1）区别于nginx自带的非主动式的心跳检测，淘宝开发的tengine自带了一个提供主动式后端服务器心跳检测模块，若健康检查包类型为http，在开启健康检查功能后，nginx会根据设置的间隔向指定的后端服务器端口发送健康检查包，并根据期望的HTTP回复状态码来判断服务是否健康。

（2）后端真实节点不可用，则请求不会转发到故障节点

（3）故障节点恢复后，请求正常转发

tips：对于k8s的nodeport转发，如果有apiserver的vip的话，可以通过vip进行转发。如果提供不了vip的话，那么可以用这个nginx主动健康检查模块来检测后端服务组的健康状态。

2 配置使用

样例配置

     http {
        upstream cluster {
            # simple round-robin
            server 192.168.0.1:80 weight=1;
            server 192.168.0.2:80 weight=1;
            check interval=5000 rise=1 fall=3 timeout=4000;
            #check interval=3000 rise=2 fall=5 timeout=1000 type=ssl_hello;
            #check interval=3000 rise=2 fall=5 timeout=1000 type=http;
            #check_http_send "HEAD / HTTP/1.0\r\n\r\n";
            #check_http_expect_alive http_2xx http_3xx;
        }
        server {
            listen 80;
            location / {
                proxy_pass http://cluster;
            }

            location /status {
                check_status;
                access_log   off;
                #allow SOME.IP.ADD.RESS;
                #deny all;
           }
        }
    }

check功能：

interval: 向后端发送的健康检查包的间隔，单位为毫秒
rsie: 如果连续成功次数达到rise_count，服务器就被认为是up
fall: 如果连续失败次数达到fall_count，服务器就被认为是down
timeout: 后端健康请求的超时时间，单位为毫秒
type: 健康检查包的类型，支持tcp、ssl_hello、http、mysql、ajp

用法: check interval=milliseconds [fall=count] [rise=count] [timeout=milliseconds] [default_down=true|false] [type=tcp|http|ssl_hello|mysql|ajp] [port=check_port]
默认值: 如果没有配置参数，默认值是：interval=30000 fall=5 rise=2 timeout=1000 default_down=true type=tcp
位置：upstream块
#port: 指定后端服务器的检查端口。你可以指定不同于真实服务的后端服务器的端口，比如后端提供的是443端口的应用，你可以去检查80端口的状态来判断后端健康状况。默认是0，表示跟后端server提供真实服务的端口一样。

check_http_send功能：

用法：check_http_send "HEAD /ierp/ HTTP/1.0\r\n\r\n"
默认值： "GET / HTTP/1.0\r\n\r\n"
位置：upstream块
说明：http://IP:8080/做健康检测，但有问题的是，我们的服务不一定都是/结尾，有时需要加后缀才能访问到资源。比如，如果不在后端tomcat配置上下文路径那么（test.war）正常访问路径就是http://IP:8080/test，对于非根访问上述配置健康检查就一定都是error状态。
check_http_send字段 HEAD后面的 / 就是路径的配置，与其对应的正确能被识别到的地址为"HEAD /ierp/checkk8shealth HTTP/1.0\r\n\r\n"，/后面可以为项目中的某个url只要能请求到就可以

check_http_expect_alive功能：

用法： check_http_expect_alive [ http_2xx | http_3xx | http_4xx | http_5xx ]
默认值: http_2xx | http_3xx
位置：upstream块
说明：这些状态码表示上游服务器的http响应是正常的，后端是活的。

check_keepalive_requests功能：

用法: check_keepalive_requests num
默认值: check_keepalive_requests 1
位置：upstream块
说明：该指令指定在一个连接上发送的请求数，默认值1表示nginx在收到请求后肯定会关闭连接。

check_fastcgi_param功能：

用法：check_fastcgi_params parameter value  ，如，默认指令是这样的：
          check_fastcgi_param "REQUEST_METHOD" "GET";
          check_fastcgi_param "REQUEST_URI" "/";
          check_fastcgi_param "SCRIPT_FILENAME" "index.php";
位置：upstream块
说明：如果设置检查类型为fastcgi，则检查函数将发送这个fastcgi报头来检查上游服务器。

check_shm_size功能：

用法：check_shm_size size
默认值：1M
位置：http块
说明：默认大小为1m。如果检查数千台服务器，用于健康检查的共享内存可能不够用，可以使用此指令扩大内存。

check_status功能：

用法: *check_status [html|csv|json]*
默认值: *none*
位置：location块
说明：通过HTTP方式显示健康检查服务器的状态。这个指令应该在http块中设置。可以通过浏览器查看到所有upstream中配置的后端服务器组的健康状态

3 安装配置

3.1 编译安装

nginx_upstream_check_module模块下载地址：https://github.com/yaoweibin/nginx_upstream_check_module

下载并解压nginx_upstream_check_module模块，此步骤跳过

下载nginx版本，已安装nginx可忽略
# wget http://nginx.org/download/nginx-1.22.0.tar.gz
# tar -xzvf nginx-1.22.0.tar.gz

将模块补丁更新到nginx安装目录下
# cd nginx-1.22.0/
# patch -p1 < /模块解压的绝对路径/check_1.20.1+.patch    
打补丁的说明： 根据nginx版本和补丁版本区间来更新补丁，例如本例中nginx版本是1.22.0，那么可以更新1.20.1+的patch补丁。如果nginx版本是1.18.0，而补丁版本区间是1.16.0+到1.20.1+，那么更新的补丁版本为1.16.0+的patch补丁。
tips: 打补丁这一步会报错，补丁更新失败的可以查看本文附录中补丁手动处理方法


查看原安装nginx的编译模块
# /usr/local/nginx/sbin/nginx -V
例如： --prefix=/usr/local/nginx --with-http_ssl_module --with-http_gunzip_module --with-pcre=/kingdee/cosmic/nginx-appstatic/nginx/pcre-8.34 --with-stream --with-stream_ssl_preread_module

在原编译的前提下添加健康检查模块并进行编译安装nginx
# ./configure  --prefix=/usr/local/nginx --with-http_ssl_module --with-http_gunzip_module --with-pcre=/kingdee/cosmic/nginx-appstatic/nginx/pcre-8.34 --add-module=/kingdee/cosmic/nginx-appstatic/nginx/nginx_upstream_check_module  --with-stream --with-stream_ssl_preread_module
# make 
# make install

3.2 配置案例

# upstream.conf配置文件部分配置
# 检查后端http的/ierp/ 路径响应码是否为2xx或3xx，检查间隔3秒，成功2次认为服务正常，连续失败5次认为不正常，响应超时1秒
upstream next-ierp {
    server 172.18.11.61:30002;
    server 172.18.11.62:30002;
    server 172.18.11.63:30002;
    check interval=3000 rise=2 fall=5 timeout=1000 type=http port=30002;
    check_http_send "HEAD /ierp/ HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}



# ierp.conf配置文件部分配置
server {
    listen 80;  
    ......
    ......

    underscores_in_headers on;

# 后端请求配置
    location ~(/ierp/.*\.(do|jsp)$)|(/ierp/(kapi|kdctlres|attachment|excelpreview|kws)/)|(/ierp/?$)|(/ierp/(index\.html|mobile\.html|login\.html|login-mobile\.html)$)|(monitor/) {
        proxy_pass http://next-ierp;
        proxy_set_header Cookie $http_cookie;
        proxy_set_header X-Real-IP $remote_addr;
        .......
}

......
......
......

# 健康检查监控
    location /nstatus {
        check_status ;
        access_log off;
        #allow SOME.IP.ADD.RESS;
        #deny all;
    }
}

查看健康检查监控地址： http://地址:端口/nstatus

如上图查看到有个节点处于down状态，检测是不通过的，可以看到失败次数，检查端口，检查类型等信息。

附录：补丁手动处理方法

打补丁报错日志：
[root@kd-app-01 nginx-1.22.0]# patch p1 < ../nginx_upstream_check_module/check_1.20.1+.patch
patching file p1
Hunk #1 FAILED at 9.
Hunk #2 FAILED at 238.
Hunk #3 FAILED at 560.
3 out of 3 hunks FAILED -- saving rejects to file p1.rej
patching file p1
Hunk #1 FAILED at 9.
Hunk #2 FAILED at 208.
2 out of 2 hunks FAILED -- saving rejects to file p1.rej
patching file p1
Hunk #1 FAILED at 9.
Hunk #2 FAILED at 147.
Hunk #3 FAILED at 202.
3 out of 3 hunks FAILED -- saving rejects to file p1.rej
patching file p1
Hunk #1 FAILED at 9.
Hunk #2 FAILED at 104.
Hunk #3 FAILED at 174.
Hunk #4 FAILED at 241.
Hunk #5 FAILED at 358.
Hunk #6 FAILED at 392.
Hunk #7 FAILED at 457.
Hunk #8 FAILED at 551.
8 out of 8 hunks FAILED -- saving rejects to file p1.rej
patching file p1
Hunk #1 FAILED at 38.
1 out of 1 hunk FAILED -- saving rejects to file p1.rej
patch: **** Can't reopen file p1 : No such file or directory

报错信息可以判断是没有将补丁更新到nginx源码中的，那么就需要我们手动处理，将加载模块的功能写入到nginx的源码中去，查看补丁更新文件p1.rej内容：

# cat p1.rej
--- ngx_http_upstream_hash_module.c
+++ ngx_http_upstream_hash_module.c
@@ -9,6 +9,9 @@
#include <ngx_core.h>
#include <ngx_http.h>
+#if (NGX_HTTP_UPSTREAM_CHECK)
+#include "ngx_http_upstream_check_module.h"
+#endif
typedef struct {
     uint32_t                            hash;
@@ -238,6 +241,14 @@
             goto next;
         }
+#if (NGX_HTTP_UPSTREAM_CHECK)
+        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, pc->log, 0,
+                       "get hash peer, check_index: %ui", peer->check_index);
+        if (ngx_http_upstream_check_peer_down(peer->check_index)) {
+            goto next;
+        }
+#endif
+
         if (peer->max_fails
             && peer->fails >= peer->max_fails
             && now - peer->checked <= peer->fail_timeout)
@@ -560,6 +571,15 @@
                 continue;
             }
......
此处省略不粘贴出来了
......

补丁失败日志说明：

（1）+++ ngx_http_upstream_hash_module.c +++表示补丁差异作对比的文件名

（2）@@ +9,9 表示第9行处

（3）+开头行的表示（+右边的内容）需要新增到nginx源码的内容。

按照这个方式，将+开头的内容复制到对应源码文件中去，根据补丁报错行数和上下文来确定添加补丁内容的位置。

例如:

查找需要修改的源码文件：# find nginx-1.22.0 -name ngx_http_upstream_hash_module.c

编辑源码文件添加补丁内容：# vi nginx-1.22.0/src/http/modules/ngx_http_upstream_hash_module.c

从下面的补丁日志来看应该涉及这5个文件 ngx_http_upstream_hash_module.c、ngx_http_upstream_ip_hash_module.c、ngx_http_upstream_least_conn_module.c、ngx_http_upstream_round_robin.c、ngx_http_upstream_round_robin.h。

共添加了17处补丁内容，仅供参考。