标签归档:java

OneApm的问题

1) 我们没有使用它的大数据版本,普通版本在压测情况下服务器端如果吞吐不够,agent端会缓存大量的数据导致full gc,这里它占据了1G的内存

2) 对HttpURLConnection拦截时可能发生空指针异常,导致业务线程终止

java.lang.NullPointerException
 at com.blueware.monitor.bridge.reflect.ClassReflection$2.run(ClassReflection.java:30) ~[?:?]
 at com.blueware.monitor.bridge.reflect.ClassReflection$2.run(ClassReflection.java:28) ~[?:?]
 at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_65]
 at com.blueware.monitor.bridge.reflect.ClassReflection.loadClass(ClassReflection.java:28) ~[?:?]
 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) ~[?:1.8.0_65]
 at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2979) ~[?:1.8.0_65]
 at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:489) ~[?:1.8.0_65]
 at com...

JDK在mac和linux上对connection reset的行为不一致?

遇到一个JDK(版本都是1.8.0_51-b16)在mac和linux上行为不一致的问题,这个问题是针对redis服务器端关闭连接时的状况处理;即在redis-server端设置连接的空闲时间,当超过这个空闲时间后server主动把该连接关闭掉。在我的mac上这个行为是符合预期的,对这个超时的连接再发起请求时会抛出 SocketException "Connection Reset",但在linux上却不会。

简单模拟一下,在Redis服务端配置timeout为3秒,客户端建立连接之后 sleep 4秒(或更久),让连接超时

import java.io.InputStream;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import java.net.Socket;

public class ConnRestTest {

    public static void main(String[] args) throws Exception {
        Socket socket = new Socket();
        socket.connect(new InetSocketAddress("localhost", 6379));

        InputStream is = socket.getInputStream();
        OutputStream os = socket.getOutputStream();

        Thread.sleep(4000);

        os.write("*1\r\n$4\r\nPING\r\n".getBytes());
        os.flush();

        for (int i = 0; i < 7; i++) {
            int r = is.read();
            System.out.print( r == -1 ? r : (char) r);
        }

        socket.close();
    }
}

上面的代码在mac上运行后会看到如下异常(如果注释掉 Thread.sleep(4000) 会得到 "+PONG\r\n")

Exception in thread "main" java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:209)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:223)
    at com.wacai.common.redis.io.Test.main(Test.java:23)    

而在linux上运行后则是打印7个连续的"-1",即并不会抛出"Connection reset"

分别进行tcpdump,在mac上:

 ➜  sudo tcpdump -i lo0 port 6379

23:13:43.398003 IP localhost.63864 > localhost.6379: Flags [S], seq 1658673677, win 65535, options [mss 16344,nop,wscale 5,nop,nop,TS val 1063188198 ecr 0,sackOK,eol], length 0
23:13:43.398064 IP localhost.6379 > localhost.63864: Flags [S.], seq 355134851, ack 1658673678, win 65535, options [mss 16344,nop,wscale 5,nop,nop,TS val 1063188198 ecr 1063188198,sackOK,eol], length 0
23:13:43.398075 IP localhost.63864 > localhost.6379: Flags [.], ack 1, win 12759, options [nop,nop,TS val 1063188198 ecr 1063188198], length 0
23:13:43.398085 IP localhost.6379 > localhost.63864: Flags [.], ack 1, win 12759, options [nop,nop,TS val 1063188198 ecr 1063188198], length 0

23:13:47.063640 IP localhost.6379 > localhost.63864: Flags [F.], seq 1, ack 1, win 12759, options [nop,nop,TS val 1063191852 ecr 1063188198], length 0
23:13:47.063671 IP localhost.63864 > localhost.6379: Flags [.], ack 2, win 12759, options [nop,nop,TS val 1063191852 ecr 1063191852], length 0

23:13:48.403144 IP localhost.63864 > localhost.6379: Flags [P.], seq 1:15, ack 2, win 12759, options [nop,nop,TS val 1063193184 ecr 1063191852], length 14
23:13:48.403255 IP localhost.6379 > localhost.63864: Flags [R], seq 355134853, win 0, length 0  

在linux上:

$ sudo tcpflow -p -c -i lo port 6379

00:26:13.303233 IP localhost.34609 > localhost.6379: Flags [S], seq 1094106697, win 43690, options [mss 65495,sackOK,TS val 8462190 ecr 0,nop,wscale 7], length 0
00:26:13.303272 IP localhost.6379 > localhost.34609: Flags [S.], seq 2755621045, ack 1094106698, win 43690, options [mss 65495,sackOK,TS val 8462190 ecr 8462190,nop,wscale 7], length 0
00:26:13.303298 IP localhost.34609 > localhost.6379: Flags [.], ack 1, win 342, options [nop,nop,TS val 8462190 ecr 8462190], length 0

00:26:17.037992 IP localhost.6379 > localhost.34609: Flags [F.], seq 1, ack 1, win 342, options [nop,nop,TS val 8465925 ecr 8462190], length 0
00:26:17.038450 IP localhost.34609 > localhost.6379: Flags [.], ack 2, win 342, options [nop,nop,TS val 8465926 ecr 8465925], length 0

00:26:18.305591 IP localhost.34609 > localhost.6379: Flags [P.], seq 1:15, ack 2, win 342, options [nop,nop,TS val 8467193 ecr 8465925], length 14
00:26:18.305630 IP localhost.6379 > localhost.34609: Flags [R], seq 2755621047, win 0, length 0     

在mac上比linux上多了一次在连接建立后从server发给client端的ack,在3秒之后,连接因为超过空闲时间server端向client端发起了fin,client回复ack,之后client端程序对socket进行写操作,在tcpdump里看到标记为P的标记,然后server端发送了reset

在linux上看到连接状态的变化也符合预期:

$ netstat -antp | grep 6379
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:6379          127.0.0.1:34607         ESTABLISHED -
tcp6       0      0 127.0.0.1:34607         127.0.0.1:6379          ESTABLISHED 3683/java

tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:6379          127.0.0.1:34607         FIN_WAIT2   -
tcp6       1      0 127.0.0.1:34607         127.0.0.1:6379          CLOSE_WAIT  3683/java

按说server端发送reset标记后,mac上抛出异常的行为更符合预期,可能是JDK网络层面的实现细节不同,或者tcpdump出的数据已经能解释原因只是我看不出来。

优雅关闭与session draining

最近看到nginx plus收费版里有个session draining的概念,在用nginx做反向代理,要优雅停止后端应用的话,需要先在nginx层控制请求不再发到相应的节点,但与此同时已经建立的链接则要继续保持,一直等这些链接都结束了再停止应用。

其实在免费版的nginx/tengine就能实现这个特性,在以前的这篇nginx反向代理对后端某个节点优雅下线的问题 文章里测试过后端tomcat有一个耗时很长的请求,当tengine端健康监测已发现后端节点不可用的情况下,该请求并不会被tengine中止,而会等后端响应结束后再返回给客户端。

优化关闭应用并不只是在负载均衡或反向代理层面就能完美解决的,后端的应用本身也要考虑很多细节。举一个例子,Actor怎么有序的关闭,可以参考之前分享的Governor模式,所有干活的Actor由Governor创建,在创建时就设定好优先级。

在应用关闭的时候,对master发送PoisonPill,示例代码如下:

// 在你的容器或微容器里监听到应用关闭事件时触发shutdown,最简单的情况下是在shutdownhook里
def shutdown(): Unit = {
    gracefulStopActors
    system.terminate
    Await.result(system.whenTerminated, Duration.Inf)
    springContext.close
}

private def gracefulStopActors {
    master ! PoisonPill
    signal.acquire //wait
}

在governor角色里,当收到Terminated消息后顺序结束子actor

...
case Terminated(actor) if actor == observable => {
  logger.info("===receive terminated message from: " + actor)
  // 先顺序的杀死所有托管的子actor,并将结果发给自己
  stopAll(managedActors.toList.sorted) pipeTo self
}
...

// 顺序的停止所有子actor
protected def stopAll(kids: List[OrderedActorRef]): Future[Any] = {
    kids match {
      case first :: Nil =>
        logger.info("===graceful stop: " + first.actor)
        gracefulStop(first.actor, stopTimeout).flatMap { _ =>
          Future { AllDead }
      }
      case first :: rest =>
        logger.info("===graceful stop: " + first.actor)
        gracefulStop(first.actor, stopTimeout).flatMap { _ =>
        stopAll(rest)
      }
      case Nil =>
        Future { AllDead }
    }
}

再举一个例子,应用里有一个线程一直轮询从redis里获取数据,你在关闭应用的时候如何优雅终止这个线程呢?下面用一段代码简单模拟这个场景;假设这个应用没有任何容器,结束就是靠收到kill信号,那么在ShutdownHook里,要阻塞的等待(一段时间)worker线程结束后再退出才算安全

public class Test {
  static volatile boolean running = true;

  public static void main(String[] args) {
  Thread worker = new Thread() {
      public void run() {
          while (running) {
              try {
                  System.out.println("i'm running.1");
                  Thread.sleep(1000);
                  System.out.println("i'm running.2");
                  Thread.sleep(1000);
                  System.out.println("i'm running.3");
                  Thread.sleep(1000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
          }
      }
  };
  worker.start();
  Runtime.getRuntime().addShutdownHook(new Thread() {
      public void run() {
          System.out.println("receive shutdown signal.");
          running = false;

          while(worker.isAlive()) {
              System.out.println("waiting for worker thread finish.");
              try {
                  Thread.sleep(1000);
              } catch (InterruptedException e) {
              }
          }
          System.out.println("wroker thread finished!");
      }
  });
 }
}

HttpURLConnection在底层是否复用socket的简单验证方式

关于JDK自身的HttpURLConnection在底层是否复用socket的测试方式,可以快速用repl和lsof来检测:

// 本地启动一个 http server,它返回十几个字符
 ➜  curl "http://localhost:8080/sleep?time=1000"
{"code":"ok"}

// 在repl下连续请求这个url 若干次
scala> val is = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.getInputStream; for(i <- 1 to 15) is.read; is.close
is: java.io.InputStream = sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@1ba9117e

scala> val is = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.getInputStream; for(i <- 1 to 15) is.read; is.close
is: java.io.InputStream = sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@a82c5f1

与此同时在另一个终端用lsof查看socket,每秒刷新一次,可看到客户端socket是同一个

 ➜  /usr/sbin/lsof -Pan -iTCP -r 1 -p 43280
=======
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)
=======
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    43280 hongjiang   47u  IPv6 0x43acdfd2ea5b0c01      0t0  TCP 127.0.0.1:57304->127.0.0.1:8080 (ESTABLISHED)  

这个话题是由URLConnection在关闭的时候应该调用close还是disConnect所引起的,关于jdk里keep-alive相关的一些参数不展开了。

验证disconnect方法:

// 执行若干次
scala> val conn = new java.net.URL("http://localhost:8080/sleep?time=100").openConnection.asInstanceOf[java.net.HttpURLConnection]; val is=conn.getInputStream; for(i <- 1 to 15) is.read; conn.disconnect

这时没法用lsof观察了,它最小刷新单位是1秒,因为每次连接立即关闭导致没机会看到,得用tcpdump来观察

 ➜  sudo tcpdump -i lo0  -s 1024 -l -A  port 8080

 ^[[A04:59:57.066577 IP localhost.57355 > localhost.http-alt: Flags [S]
 ...
 -`=.-`=.GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

...

05:00:05.407691 IP localhost.57356 > localhost.http-alt: Flags [P.], seq 1:168, ack 1, win 12759, options [nop,nop,TS val 761290281 ecr 761290281], length 167: HTTP: GET /sleep?time=100 HTTP/1.1
E...LF@.@.........................1........
-`^)-`^)GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive 

... 

05:00:07.045830 IP localhost.57357 > localhost.http-alt: Flags [P.], seq 1:168, ack 1, win 12759, options [nop,nop,TS val 761291915 ecr 761291915], length 167: HTTP: GET /sleep?time=100 HTTP/1.1
E.....@.@................l.;.\.,..1........
-`d.-`d.GET /sleep?time=100 HTTP/1.1
User-Agent: Java/1.8.0_51
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

看到三次连接每次客户端socket端口都变了。

jmap的bug

这个版本的jdk在执行jmap时居然将旧生代的百分比算成了负数。

$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


$ jmap -heap 15903
Attaching to process ID 15903, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 3221225472 (3072.0MB)
   NewSize                  = 348913664 (332.75MB)
   MaxNewSize               = 348913664 (332.75MB)
   OldSize                  = 2872311808 (2739.25MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 314048512 (299.5MB)
   used     = 242824544 (231.57553100585938MB)
   free     = 71223968 (67.92446899414062MB)
   77.32071152115505% used
Eden Space:
   capacity = 279183360 (266.25MB)
   used     = 207959392 (198.32553100585938MB)
   free     = 71223968 (67.92446899414062MB)
   74.48846234961854% used
From Space:
   capacity = 34865152 (33.25MB)
   used     = 34865152 (33.25MB)
   free     = 0 (0.0MB)
   100.0% used
To Space:
   capacity = 34865152 (33.25MB)
   used     = 0 (0.0MB)
   free     = 34865152 (33.25MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 2872311808 (2739.25MB)
   used     = 17503507292142 MB
   free     = 92986414215918728 (8.86787550124347E10MB)
   -3.237336945265481E9% used

jstack不可用的原因

上午遇到的jstack没有输出的问题,印象里在阿里前几年也遇到过,不记得细节了;晚上查了一下,是jdk1.6的某些特定版本存在的bug,见这篇blog,简单的说就是本来hsperfdata_$user 这个目录一直约定是在系统的临时目录下的,对linux来说是”/tmp”

用 strace 来看一下正常版本启动jvm,hsperfdata_$user目录不受java.io.tmpdir变量影响

$ strace -ff -e trace=file  java -Djava.io.tmpdir=/data/test -version 2>&1 | perl -ne 's/^[^"]+"(([^\\"]|\\[\\"nt])*)".*/$1/ && print'  | grep hsperfdata

/tmp/hsperfdata_hongjiang
/tmp/hsperfdata_hongjiang/17019

而在 jdk1.6.0_21到24引入了bug,使用了jvm的环境变量-Djava.io.tmpdir做为了hsperfdata_$user的路径,正好一些应用比如tomcat是对这个变量设置了自己的路径的,比如”/server/tomcat/temp”,存在bug的这些版本把hsperfdata_$user目录也放到tomcat环境变量指定的目录下了,结果 jps, jstack, 之类工具都不能找到目标了。根据bug报告里给出的一种变通方式,在运行jstack时设置-J-Djava.io.tmpdir=$CATALINA_HOME/temp 也并未能工作。对于这个问题最佳的处理方式还是升级jdk版本。

补充:

虽然对jstack设置变量并不工作,但发现对jinfo设置该变量是有效的,比如:jinfo -J-Djava.io.tmpdir=/server/tomcat/temp -sysprops $pid 在jdk1.6.0_24下是可以工作的。

BoneCP的线程阻塞问题

某个业务采用BoneCP连接池遇到的情况,200个dubbo线程都阻塞在com.jolbox.bonecp.BoneCPDataSource.maybeInit方法上,这里wait的是同一个锁:

"DubboServerHandler-thread-203" daemon prio=10 tid=0x00007fc2400e7800 nid=0x658c waiting on condition [0x00007fc233dbb000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000c071f518> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
at com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:133)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
at org.mybatis.spring.SqlSessionUtils.getSqlSession(SqlSessionUtils.java:116)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:333)
at com.sun.proxy.$Proxy11.selectList(Unknown Source)
at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:189)
at org.apache.ibatis.binding.MapperMethod.executeForList(MapperMethod.java:100)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:70)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:38)

从BoneCP的代码里看不出直接问题,是一个ReentrantReadWriteLock去获取锁的逻辑:

private void maybeInit() throws SQLException {
    this.rwl.readLock().lock();  // ------->  133行,所有线程都阻塞在这里
    if (this.pool == null){
        this.rwl.readLock().unlock();
        this.rwl.writeLock().lock();
        if (this.pool == null){ //read might have passed, write might not
            try {
                if (this.getDriverClass() != null){
                    loadClass(this.getDriverClass());
                }
            }
            catch (ClassNotFoundException e) {
                throw new SQLException(PoolUtil.stringifyException(e));
            }


            logger.debug(this.toString());

            this.pool = new BoneCP(this);
        }

        this.rwl.writeLock().unlock(); // Unlock write
    } else {
        this.rwl.readLock().unlock(); // Unlock read
    }
}

按说通常如果是死锁的话,应该是某个线程A先获取到锁X,然后再去获取另一个锁Y时发被线程B占用,且线程B又要获取锁X,但这里的现象却是都在等同一把锁,并且一直处于这种状态导致服务不可用。jstack -F看看是否能检测到死锁,输出也是No deadlocks found. (也可能是ReentrantLock的读锁没有ownership的原因,jvm无法跟踪并检测出来,参考这篇这个bug里Doug Lea的回复),怀疑也可能跟jdk和os的版本有关,可能是该版本的bug(或其他我不了解的)。

jdk的HttpURLConnection提供了日志开关

下午运行一个junit程序,时不时启动时就“卡住”,没有输出任何日志,通过jstack发现是spring初始化时调用了sun.net.www.protocol.http.HttpURLConnection访问网络时阻塞住了。按说依赖的spring的jar里都应该包含了这些dtd/xsd之类的文件,怎么仍会访问远程网络呢?

jdk自带的HttpURLConnection提供了日志(JUL)开关,可以查看有哪些URL请求,在运行时指定

-Djava.util.logging.config.file=/tmp/logging.properties

不指定的话,默认从 jre/lib/logging.properties 加载。在配置文件里:

# 增加一条针对HttpURLConnection的配置
sun.net.www.protocol.http.HttpURLConnection.level = ALL

# 并修改终端的输出级别
java.util.logging.ConsoleHandler.level = ALL

启动后看到如下日志:

FINEST: ProxySelector Request for http://www.springframework.org/dtd/spring-beans.dtd
Jul 25, 2015 4:49:45 PM sun.net.www.protocol.http.HttpURLConnection plainConnect0

FINEST: Proxy used: DIRECT
Jul 25, 2015 4:49:45 PM sun.net.www.protocol.http.HttpURLConnection writeRequests

FINE: sun.net.www.MessageHeader@531be3c55 pairs: 
{GET /dtd/spring-beans.dtd HTTP/1.1: null}
{User-Agent: Java/1.8.0_20}
{Host: www.springframework.org}
{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}
{Connection: keep-alive}
Jul 25, 2015 4:49:45 PM sun.net.www.protocol.http.HttpURLConnection getInputStream0

FINE: sun.net.www.MessageHeader@52af6cff16 pairs: {null: HTTP/1.1 200 OK}...

日志里看到确实是去访问了spring网站,根据url信息找到包含spring-beans.dtd内容的xml配置;这个工程是一个很老的工程,里的写法是:

<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">

要防止springframework.org或者GFW引发的不稳定,采用新的写法来避免:

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:p="http://www.springframework.org/schema/p" 
    xmlns:c="http://www.springframework.org/schema/c"
    xsi:schemaLocation="http://www.springframework.org/schema/beans
    http://www.springframework.org/schema/beans/spring-beans-4.1.xsd
    http://www.springframework.org/schema/context
    http://www.springframework.org/schema/context/spring-context-4.1.xsd ">

fastjson与awt

晚上在repl下模拟一个json的解析时发现,当调用fastjson时会在我的dock上弹出一个Java应用的icon,如下图:

这通常是程序里引用awt/swing之类的情况下才会发生,我很奇怪fastjson怎么会引起,对repl添加启动参数-J-verbose:class看看到底是否有加载awt先关的类:

看来温少考虑到了这种类型的解析场景,在代码里有相关的引用。

觉得心烦的话,启动repl时加一个java.awt.headless=true参数吧。

再谈随机数引起的阻塞问题

Java的随机数实现有很多坑,记录一下这次使用jdk1.8里新增的加强版随机数实现SecureRandom.getInstanceStrong() 遇到的问题。

之前在维护ali-tomcat的时候曾发现过jvm随机数算法选用不当导致tomcat的SessionID生成非常慢的情况,可以参考JVM上的随机数与熵池策略Docker中apache-tomcat启动慢的问题 这两篇文章。不过当时没有太追究,以为使用了-Djava.security.egd=file:/dev/./urandom就可以避免了,在这次项目里再次遇到随机数导致所有线程阻塞之后发现这块还挺多规则。

本次项目中使用的是jdk1.8,启动参数里设置了

-Djava.security.egd=file:/dev/./urandom

使用的随机数方式是Java8新增的:

SecureRandom.getInstanceStrong();

碰到故障时,线程阻塞在

"DubboServerHandler-xxx:20880-thread-1789" #28440 daemon prio=5 os_prio=0 tid=0x0000000008ffd000 nid=0x5712 runnable [0x000000004cbd7000]
java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:246)
    at sun.security.provider.NativePRNG$RandomIO.readFully(NativePRNG.java:410)
    at sun.security.provider.NativePRNG$RandomIO.implGenerateSeed(NativePRNG.java:427)
    - locked <0x00000000c03a3c90> (a java.lang.Object)
    at sun.security.provider.NativePRNG$RandomIO.access$500(NativePRNG.java:329)
    at sun.security.provider.NativePRNG$Blocking.engineGenerateSeed(NativePRNG.java:272)
    at java.security.SecureRandom.generateSeed(SecureRandom.java:522)

因为这个地方有加锁,locked <0x00000000c03a3c90>,所以其它线程调用到这里时会等待这个lock:

"DubboServerHandler-xxx:20880-thread-1790" #28441 daemon prio=5 os_prio=0 tid=0x0000000008fff000 nid=0x5713 waiting for monitor entry [0x000000004ccd8000]
java.lang.Thread.State: BLOCKED (on object monitor)
    at sun.security.provider.NativePRNG$RandomIO.implGenerateSeed(NativePRNG.java:424)
    - waiting to lock <0x00000000c03a3c90> (a java.lang.Object)
    at sun.security.provider.NativePRNG$RandomIO.access$500(NativePRNG.java:329)
    at sun.security.provider.NativePRNG$Blocking.engineGenerateSeed(NativePRNG.java:272)
    at java.security.SecureRandom.generateSeed(SecureRandom.java:522)

去查 NativePRNG$Blocking代码,看到它的文档描述:

A NativePRNG-like class that uses /dev/random for both seed and random material. Note that it does not respect the egd properties, since we have no way of knowing what those qualities are.

奇怪怎么-Djava.security.egd=file:/dev/./urandom参数没起作用,仍使用/dev/random作为随机数的熵池,时间久或调用频繁的话熵池很容易不够用而导致阻塞;于是看了一下 SecureRandom.getInstanceStrong()的文档:

Returns a SecureRandom object that was selected by using the algorithms/providers specified in the securerandom.strongAlgorithms Security property.

原来有自己的算法,在 jre/lib/security/java.security 文件里,默认定义为:

securerandom.strongAlgorithms=NativePRNGBlocking:SUN

如果修改算法值为NativePRNGNonBlocking:SUN的话,会采用NativePRNG$NonBlocking里的逻辑,用/dev/urandom作为熵池,不会遇到阻塞问题。但这个文件是jdk系统文件,修改它或重新指定一个路径都有些麻烦,最好能通过系统环境变量来设置,可这个变量不像securerandom.source属性可以通过系统环境变量-Djava.security.egd=xxx来配置,找半天就是没有对应的系统环境变量。只好修改代码,不采用SecureRandom.getInstanceStrong这个新方法,改成了SecureRandom.getInstance("NativePRNGNonBlocking")

对于SecureRandom的两种算法实现:SHA1PRNGNativePRNGsecurerandom.source 变量的关系,找到一篇解释的很清楚的文章:Using the SecureRandom Class

On Linux:

1) when this value is “file:/dev/urandom” then the NativePRNG algorithm is registered by the Sun crypto provider as the default implementation; the NativePRNG algorithm then reads from /dev/urandom for nextBytes but /dev/random for generateSeed

2) when this value is “file:/dev/random” then the NativePRNG algorithm is not registered by the Sun crypto provider, but the SHA1PRNG system uses a NativeSeedGenerator which reads from /dev/random.

3) when this value is anything else then the SHA1PRNG is used with a URLSeedGenerator that reads from that source.

4) when the value is undefined, then SHA1PRNG is used with ThreadedSeedGenerator

5) when the code explicitly asks for “SHA1PRNG” and the value is either “file:/dev/urandom” or “file:/dev/random” then (2) also occurs

6) when the code explicitly asks for “SHA1PRNG” and the value is some other “file:” url, then (3) occurs

7) when the code explicitly asks for “SHA1PRNG” and the value is undefined then (4) occurs

至于SHA1PRNG算法里,为何用urandom时,不能直接设置为file:/dev/urandom而要用变通的方式设置为file:///dev/urandom或者 file:/dev/./urandom,参考这里

In SHA1PRNG, there is a SeedGenerator which does various things depending on the configuration.

  1. If java.security.egd or securerandom.source point to “file:/dev/random” or “file:/dev/urandom”, we will use NativeSeedGenerator, which calls super() which calls SeedGenerator.URLSeedGenerator(/dev/random). (A nested class within SeedGenerator.) The only things that changed in this bug was that urandom will also trigger use of this code path.

  2. If those properties point to another URL that exists, we’ll initialize SeedGenerator.URLSeedGenerator(url). This is why “file:///dev/urandom”, “file:/./dev/random”, etc. will work.