JDK在mac和linux上对connection reset的行为不一致?

遇到一个JDK(版本都是1.8.0_51-b16)在mac和linux上行为不一致的问题,这个问题是针对redis服务器端关闭连接时的状况处理;即在redis-server端设置连接的空闲时间,当超过这个空闲时间后server主动把该连接关闭掉。在我的mac上这个行为是符合预期的,对这个超时的连接再发起请求时会抛出 SocketException "Connection Reset",但在linux上却不会。

简单模拟一下,在Redis服务端配置timeout为3秒,客户端建立连接之后 sleep 4秒(或更久),让连接超时

import java.io.InputStream;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import java.net.Socket;

public class ConnRestTest {

    public static void main(String[] args) throws Exception {
        Socket socket = new Socket();
        socket.connect(new InetSocketAddress("localhost", 6379));

        InputStream is = socket.getInputStream();
        OutputStream os = socket.getOutputStream();

        Thread.sleep(4000);

        os.write("*1\r\n$4\r\nPING\r\n".getBytes());
        os.flush();

        for (int i = 0; i < 7; i++) {
            int r = is.read();
            System.out.print( r == -1 ? r : (char) r);
        }

        socket.close();
    }
}

上面的代码在mac上运行后会看到如下异常(如果注释掉 Thread.sleep(4000) 会得到 "+PONG\r\n")

Exception in thread "main" java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:209)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:223)
    at com.wacai.common.redis.io.Test.main(Test.java:23)    

而在linux上运行后则是打印7个连续的"-1",即并不会抛出"Connection reset"

分别进行tcpdump,在mac上:

 ➜  sudo tcpdump -i lo0 port 6379

23:13:43.398003 IP localhost.63864 > localhost.6379: Flags [S], seq 1658673677, win 65535, options [mss 16344,nop,wscale 5,nop,nop,TS val 1063188198 ecr 0,sackOK,eol], length 0
23:13:43.398064 IP localhost.6379 > localhost.63864: Flags [S.], seq 355134851, ack 1658673678, win 65535, options [mss 16344,nop,wscale 5,nop,nop,TS val 1063188198 ecr 1063188198,sackOK,eol], length 0
23:13:43.398075 IP localhost.63864 > localhost.6379: Flags [.], ack 1, win 12759, options [nop,nop,TS val 1063188198 ecr 1063188198], length 0
23:13:43.398085 IP localhost.6379 > localhost.63864: Flags [.], ack 1, win 12759, options [nop,nop,TS val 1063188198 ecr 1063188198], length 0

23:13:47.063640 IP localhost.6379 > localhost.63864: Flags [F.], seq 1, ack 1, win 12759, options [nop,nop,TS val 1063191852 ecr 1063188198], length 0
23:13:47.063671 IP localhost.63864 > localhost.6379: Flags [.], ack 2, win 12759, options [nop,nop,TS val 1063191852 ecr 1063191852], length 0

23:13:48.403144 IP localhost.63864 > localhost.6379: Flags [P.], seq 1:15, ack 2, win 12759, options [nop,nop,TS val 1063193184 ecr 1063191852], length 14
23:13:48.403255 IP localhost.6379 > localhost.63864: Flags [R], seq 355134853, win 0, length 0  

在linux上:

$ sudo tcpflow -p -c -i lo port 6379

00:26:13.303233 IP localhost.34609 > localhost.6379: Flags [S], seq 1094106697, win 43690, options [mss 65495,sackOK,TS val 8462190 ecr 0,nop,wscale 7], length 0
00:26:13.303272 IP localhost.6379 > localhost.34609: Flags [S.], seq 2755621045, ack 1094106698, win 43690, options [mss 65495,sackOK,TS val 8462190 ecr 8462190,nop,wscale 7], length 0
00:26:13.303298 IP localhost.34609 > localhost.6379: Flags [.], ack 1, win 342, options [nop,nop,TS val 8462190 ecr 8462190], length 0

00:26:17.037992 IP localhost.6379 > localhost.34609: Flags [F.], seq 1, ack 1, win 342, options [nop,nop,TS val 8465925 ecr 8462190], length 0
00:26:17.038450 IP localhost.34609 > localhost.6379: Flags [.], ack 2, win 342, options [nop,nop,TS val 8465926 ecr 8465925], length 0

00:26:18.305591 IP localhost.34609 > localhost.6379: Flags [P.], seq 1:15, ack 2, win 342, options [nop,nop,TS val 8467193 ecr 8465925], length 14
00:26:18.305630 IP localhost.6379 > localhost.34609: Flags [R], seq 2755621047, win 0, length 0     

在mac上比linux上多了一次在连接建立后从server发给client端的ack,在3秒之后,连接因为超过空闲时间server端向client端发起了fin,client回复ack,之后client端程序对socket进行写操作,在tcpdump里看到标记为P的标记,然后server端发送了reset

在linux上看到连接状态的变化也符合预期:

$ netstat -antp | grep 6379
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:6379          127.0.0.1:34607         ESTABLISHED -
tcp6       0      0 127.0.0.1:34607         127.0.0.1:6379          ESTABLISHED 3683/java

tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:6379          127.0.0.1:34607         FIN_WAIT2   -
tcp6       1      0 127.0.0.1:34607         127.0.0.1:6379          CLOSE_WAIT  3683/java

按说server端发送reset标记后,mac上抛出异常的行为更符合预期,可能是JDK网络层面的实现细节不同,或者tcpdump出的数据已经能解释原因只是我看不出来。

HttpURLConnection在底层是否复用socket的简单验证方式

关于JDK自身的HttpURLConnection在底层是否复用socket的测试方式,可以快速用repl和lsof来检测:

// 本地启动一个 http server,它返回十几个字符
 ➜  curl "http://localhost:8080/sleep?time=1000"
{"code":"ok"}

// 在repl下连续请求这个url 若干次
scala> val is = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.getInputStream; for(i <- 1 to 15) is.read; is.close
is: java.io.InputStream = sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@1ba9117e

scala> val is = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.getInputStream; for(i <- 1 to 15) is.read; is.close
is: java.io.InputStream = sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@a82c5f1

与此同时在另一个终端用lsof查看socket,每秒刷新一次,可看到客户端socket是同一个

 ➜  /usr/sbin/lsof -Pan -iTCP -r 1 -p 43280
=======
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)  

这个话题是由URLConnection在关闭的时候应该调用close还是disConnect所引起的,关于jdk里keep-alive相关的一些参数不展开了。

验证disconnect方法:

// 执行若干次
scala> val conn = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.asInstanceOf[java.net.HttpURLConnection]; val is=conn.getInputStream; for(i <- 1 to 15) is.read; conn.disconnect

这时没法用lsof观察了,它最小刷新单位是1秒,因为每次连接立即关闭导致没机会看到,得用tcpdump来观察

 ➜  sudo tcpdump -i lo0  -s 1024 -l -A  port 8080

 ^[[A04:59:57.066577 IP localhost.57355 > localhost.http-alt: Flags [S]
 ...
 -`=.-`=.GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

...

05:00:05.407691 IP localhost.57356 > localhost.http-alt: Flags [P.], seq 1:168, ack 1, win 12759, options [nop,nop,TS val 761290281 ecr 761290281], length 167: HTTP: GET /sleep?time=100 HTTP/1.1
E...LF@.@.........................1........
-`^)-`^)GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive 

... 

05:00:07.045830 IP localhost.57357 > localhost.http-alt: Flags [P.], seq 1:168, ack 1, win 12759, options [nop,nop,TS val 761291915 ecr 761291915], length 167: HTTP: GET /sleep?time=100 HTTP/1.1
E.....@.@................l.;.\.,..1........
-`d.-`d.GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

看到三次连接每次客户端socket端口都变了。

模拟tomcat bio模式下线程池利用率超过75%关闭keep-alive的情况

模拟一下在BIO模式下,当线程利用率超过75%时,将自动关闭keep-alive的场景。

通过curl命令来观察,默认情况下curl会开启keep-alive选项,不过注意curl复用socket的话是在同一进程内多次访问目标同一地址时才会复用,两次执行curl的话并不会复用,比如:

$ curl http://localhost:7001/main
$ curl http://localhost:7001/main

上面连续执行curl命令并不会复用socket,socket会随着进程的消失而关闭,下次新的进程会重新创建连接。可以通过tcpdump观察,上面两次连接是不同的socket:

$ sudo tcpdump -l -i lo0 port 7001

23:43:19.236948 IP6 localhost.62625 > localhost.afs3-callback
......
23:43:26.071504 IP6 localhost.62626 > localhost.afs3-callback
......

在同一个curl进程里多次访问同一地址的话,会复用socket,通过-v参数就可以观察到:

$ curl -v  http://localhost:7001/main  http://localhost:7001/main
* Hostname was NOT found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 7001 (#0)
> GET /main HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:7001
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Transfer-Encoding: chunked
< Date: Mon, 18 Aug 2014 15:49:28 GMT
<
* Connection #0 to host localhost left intact
ok
* Found bundle for host localhost: 0x7fa7d8c08c50
* Re-using existing connection! (#0) with host localhost
* Connected to localhost (::1) port 7001 (#0)
> GET /main HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:7001
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Transfer-Encoding: chunked
< Date: Mon, 18 Aug 2014 15:49:28 GMT
<
* Connection #0 to host localhost left intact
ok

注意,在第二次请求开头有一句:Re-using existing connection! 表明复用了上次的socket,使用tcpdump也会看到确实是同一个socket端口连的tomcat。

默认情况下,线程池的最大线程数是200个,BIO模式下当线程利用率超过75%的时候,server会对新来的连接不再使用keep-alive。我们先模拟建立151个连接(默认开启keep-alive的):

for i in {1..151}; do 
    ( 
        {echo -ne "GET /main HTTP/1.1\nhost: localhost:7001\n\n"; sleep 20} 
        | telnet localhost 7001
    )&;  
done

上面的zsh脚本模拟了151个连接(脚本里for循环里使用后台子进程方式启动模拟任务,通过jobs命令也可查看到),每次新建立socket并在服务器端响应后保持连接20秒(这也是服务器端默认keep-alive的超时时间)。tomcat对这151个连接保持keep-alive,BIO模式下会有151个线程耗在上面,即使socket上请求已处理完,后续没有新的请求也不会让出线程,而一直阻塞在上面。这时刚好达到了 151/200 ≈ 0.75 的临界值,那么后续建立的socket将不能再享用keep-alive

现在在这个临界值上,再执行curl命令模拟新的连接:

$  curl -v  http://localhost:7001/main http://localhost:7001/main

* Hostname was NOT found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 7001 (#0)
> GET /main HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:7001
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Transfer-Encoding: chunked
< Date: Mon, 18 Aug 2014 15:22:50 GMT
< Connection: close
<
* Closing connection 0
ok
* Hostname was found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 7001 (#1)
> GET /main HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:7001
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Transfer-Encoding: chunked
< Date: Mon, 18 Aug 2014 15:22:50 GMT
< Connection: close
<
* Closing connection 1
ok

注意,这次连接有2次请求,但看不到Re-using existing connection! 关键字,每次请求结束,服务器都显式的关闭了连接,即在header里看到的:Connection: close字段。表明超过75%之后,新建立的连接都不会再使用keep-alive。