标签归档:signal

作业控制与前台进程组

这篇文章是对之前的SIGTTIN信号量的疑惑?的解答,对于为何会有这种奇怪的用法,在另一篇shell下精确的定位一个命令 也介绍过了,这里想讨论的重点不在于怎么变通解决那个问题,而是导致SIGTTIN发生的机制是怎么引起的。我的同事对这个问题也产生了好奇,在stackoverflow上发帖,有人给出了解释,解答的人直接给出了bash的源码jobs.c里的initialize_job_control方法片段,指出SIGTTIN正是那里面的逻辑。不过如果你跟我一样对shell和linux系统调用都懂得很肤浅的话,这段代码并不容易懂,所以在这里更详细的解释一下这个问题的来龙去脉。

刚开始碰到这个问题的时候,通过strace看到了是SIGTTIN信号量所致,因为这个信号量默认的行为是让进程STOP(暂停),即通过ps观察到的状态为T。对于SIGTTIN信号量《Linux/UNIX系统编程手册》上是这么说的:

只有前台作业中的进程才能够从控制终端读取输入。这个限制条件避免了多个作业竞争读取终端输入。如果后台作业尝试从终端读取输入,就会接收到一个SIGTTIN信号。SIGTTIN信号的默认处理动作是停止作业。

但我们的脚本里并没有后台进程,那两个进程也没有读取终端,跟上面的解释对不上。也没有在网上搜到其它引发SIGTTIN信号的情况,在这里困惑了很久。不过凭直觉知道这个问题应该跟作业控制有关,在脚本里显式的开启作业控制,是能够正常运行的:

$ cat sleep.sh
#!/bin/bash
set -m
bash -ic 'sleep 3'
bash -ic 'sleep 2'

所以一定是在进程某个状态上的不一致导致的。上周末的时候阅读了一下strace的log,对出问题的脚本:

#!/bin/bash
bash -ic 'sleep 3'
bash -ic 'sleep 2'

使用strace -f -e verbose=all -t ./sleep.sh 2>log 得到更详细的日志

...
03:39:06 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f997f03ca10) = 9897
...
[pid  9897] 03:39:06 execve("/usr/bin/bash", ["bash", "-ic", "sleep 3"], [/* 30 vars */]) = 0
...
[pid  9897] 03:39:06 open("/dev/tty", O_RDWR|O_NONBLOCK) = 3
[pid  9897] 03:39:06 getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid  9897] 03:39:06 fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
[pid  9897] 03:39:06 dup2(3, 255)       = 255
[pid  9897] 03:39:06 close(3)           = 0
[pid  9897] 03:39:06 ioctl(255, TIOCGPGRP, [9891]) = 0
[pid  9897] 03:39:06 setpgid(0, 9897)   = 0 //第一个子进程更改了它的进程组ID
...
03:39:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f997f03ca10) = 9922
...
[pid  9922] 03:39:09 execve("/usr/bin/bash", ["bash", "-ic", "sleep 2"], [/* 30 vars */]) = 0
...
[pid  9922] 03:39:09 access("/usr/bin/bash", R_OK) = 0
[pid  9922] 03:39:09 getpgrp()          = 9891
[pid  9922] 03:39:09 ioctl(2, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffd356e49c0) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  9922] 03:39:09 open("/dev/tty", O_RDWR|O_NONBLOCK) = 3
[pid  9922] 03:39:09 getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid  9922] 03:39:09 fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
[pid  9922] 03:39:09 dup2(3, 255)       = 255
[pid  9922] 03:39:09 close(3)           = 0
[pid  9922] 03:39:09 ioctl(255, TIOCGPGRP, [9897]) = 0
[pid  9922] 03:39:09 rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x7f912a22b650}, {SIG_IGN, [], SA_RESTORER, 0x7f912a22b650}, 8) = 0
[pid  9922] 03:39:09 kill(0, SIGTTIN)   = 0
[pid  9896] 03:39:09 <... wait4 resumed> 0x7ffc5b5e6800, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid  9922] 03:39:09 --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_USER, si_pid=9922, si_uid=1000} ---
[pid  9896] 03:39:09 --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_USER, si_pid=9922, si_uid=1000} ---
[pid  9922] 03:39:09 --- stopped by SIGTTIN ---
[pid  9896] 03:39:09 --- stopped by SIGTTIN ---

确认这个SIGTTIN信号是第二个bash -ic 'sleep 2'进程发出的,kill(0, SIGTTIN)表示它把这个信号发送到自己所在的进程组,整个进程组的进程都接收到这个信号,所以它和它的父进程sleep.sh都变成了stop状态。

脚本里两次执行的bash -ic子进程也都是shell,它们在初始化的时候会有作业控制的逻辑,结合jobs.cinitialize_job_control方法里的代码(shell初始化时调用到这里):

/* We can only have job control if we are interactive. */
if (interactive == 0)
{
  job_control = 0;
  original_pgrp = NO_PID;
  shell_tty = fileno (stderr);
}
else
{
  shell_tty = -1;

  /* If forced_interactive is set, we skip the normal check that stderr
    is attached to a tty, so we need to check here.  If it's not, we
    need to see whether we have a controlling tty by opening /dev/tty,
    since trying to use job control tty pgrp manipulations on a non-tty
    is going to fail. */ 
    // bash "-i" 参数会启用 forced_interactive
  if (forced_interactive && isatty (fileno (stderr)) == 0)
    shell_tty = open ("/dev/tty", O_RDWR|O_NONBLOCK);

  /* Get our controlling terminal.  If job_control is set, or
    interactive is set, then this is an interactive shell no
     matter where fd 2 is directed. */
   if (shell_tty == -1)
    shell_tty = dup (fileno (stderr));/* fd 2 */

  shell_tty = move_to_high_fd (shell_tty, 1, -1);

  /* Compensate for a bug in systems that compiled the BSD
 rlogind with DEBUG defined, like NeXT and Alliant. */
  if (shell_pgrp == 0)
{
  shell_pgrp = getpid ();
  setpgid (0, shell_pgrp);
  tcsetpgrp (shell_tty, shell_pgrp);
}

  while ((terminal_pgrp = tcgetpgrp (shell_tty)) != -1)
{
  if (shell_pgrp != terminal_pgrp)
    {
      SigHandler *ottin;

      ottin = set_signal_handler(SIGTTIN, SIG_DFL);
      kill (0, SIGTTIN); // 第二次执行bash -ic时触发了这里
      set_signal_handler (SIGTTIN, ottin);
      continue;
    }
  break;
}
if (terminal_pgrp == -1)
t_errno = errno;

  /* Make sure that we are using the new line discipline. */
  if (set_new_line_discipline (shell_tty) < 0)
{
  sys_error (_("initialize_job_control: line discipline"));
  job_control = 0;
}
  else
{
  original_pgrp = shell_pgrp;
  shell_pgrp = getpid ();

  // 第一次bash -ic 'sleep 3'触发了这里的 setpgid 修改了当前进程组
  if ((original_pgrp != shell_pgrp) && (setpgid (0, shell_pgrp) < 0))
    {
      sys_error (_("initialize_job_control: setpgid"));
      shell_pgrp = original_pgrp;
    }

  job_control = 1;

  /* If (and only if) we just set our process group to our pid,
     thereby becoming a process group leader, and the terminal
     is not in the same process group as our (new) process group,
     then set the terminal's process group to our (new) process
     group.  If that fails, set our process group back to what it
     was originally (so we can still read from the terminal) and
     turn off job control.  */
  if (shell_pgrp != original_pgrp && shell_pgrp != terminal_pgrp)
    {
      if (give_terminal_to (shell_pgrp, 0) < 0)
    {
      t_errno = errno;
      setpgid (0, original_pgrp);
      shell_pgrp = original_pgrp;
      job_control = 0;
    }
   }
...

关键点就在于shell_pgrpterminal_pgrp这两个变量,shell_pgrp是当前进程组,而terminal_pgrp是占用当前控制终端的进程所在的进程组(前台进程组),这些状态都是可以通过ps观察到的,可以跟踪一下:

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
12413 12410 12410 12410    -1 S    ?        sshd: hongjiang@pts/0
12414 12413 12414 12414 12580 Ss   pts/0     \_ -bash
12579 12414 12579 12414 12580 S    pts/0         \_ /bin/bash ./sleep.sh
12580 12579 12580 12414 12580 S+   pts/0             \_ sleep 3

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
12413 12410 12410 12410    -1 S    ?        sshd: hongjiang@pts/0
12414 12413 12414 12414 12414 Ss+  pts/0     \_ -bash
12579 12414 12579 12414 12414 T    pts/0         \_ /bin/bash ./sleep.sh
12607 12579 12579 12414 12414 T    pts/0             \_ bash -ic sleep 2

在第一次执行bash -ic 'sleep 3'的时候,sleep.sh父进程先clone出bash子进程(pid 12580),因为-i参数强制这个bash子进程用交互式运行,它会加载$HOME下的.bashrc等文件,这个过程可能会fork/clone出若干子进程(所以会看到第二次bash -ic sleep 2进程的ID跟第一次不是连续的),等这些配置文件加载完之后,它并不是fork/clone的形式执行sleep 3而是使用当前进程(12580)执行的sleep 3,这里很关键的信息是"PGID"和"TPGID"都是本身进程ID,而非父进程ID,跟第二次的状态不一样。

因为脚本默认是关闭作业控制的,本来每个子进程并不会设置为独立的进程组,比如下面这个脚本:

$ cat a.sh
#!/bin/bash
/usr/bin/sleep 10

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
12668 12665 12665 12665    -1 S    ?        sshd: hongjiang@pts/2
12669 12668 12669 12669 12736 Ss   pts/2     \_ -bash
12736 12669 12736 12669 12736 S+   pts/2         \_ /bin/bash ./a.sh
12737 12736 12736 12669 12736 S+   pts/2             \_ /usr/bin/sleep 10

上面脚本执行时sleep子进程"PGID"和"TPGID"都是进程父进程a.sh的,并没有被设置为一个独立的进程组。

sleep.sh脚本里之所以会对子进程设置一个独立的进程组,是因为"-i"参数使得bash -ic 'sleep 3'在非交互式脚本里运行时进程被强制设置成了独立的进程组(见initializejobcontroll里的setpgid),同时"TPGID"这个表示前台进程组的状态也被改为了bash -ic 'sleep 3'的进程组ID。

那为什么在接下来的bash -ic 'sleep 2'子进程执行时却不像前面的那样呢?这正是最诡异的地方。它们所在的sleep.sh脚本是非交互式运行的,它本来预期脚本执行过程不应该产生与脚本进程组不一致的前台进程组,所以前台子进程组结束的时候,不会去更新"TPGID",可以用下面脚本来验证:

$ cat wait.sh
#!/bin/bash
bash -ic 'sleep 5'
sleep 4
sleep 3 &  #不让wait.sh进程立即退出
wait

上面wait.sh脚本里第一个子进程强制修改了"TPGID",子进程退出,以及后续再执行前台进程都不会去更新这个状态

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
13867 13864 13864 13864    -1 S    ?        sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss   pts/0     \_ -bash
14137 13868 14137 13868 14138 S    pts/0         \_ /bin/bash ./wait.sh
14138 14137 14138 13868 14138 S+   pts/0             \_ sleep 5

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
13867 13864 13864 13864    -1 S    ?        sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss   pts/0     \_ -bash
14137 13868 14137 13868 14138 S    pts/0         \_ /bin/bash ./wait.sh
14165 14137 14137 13868 14138 S    pts/0             \_ sleep 4

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
13867 13864 13864 13864    -1 S    ?        sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss   pts/0     \_ -bash
14137 13868 14137 13868 14138 S    pts/0         \_ /bin/bash ./wait.sh
14168 14137 14137 13868 14138 S    pts/0             \_ sleep 3

回到sleep.sh脚本里,第二行bash -ic 'sleep 2'子进程初始化时,"TPGID"仍是上个进程bash -ic 'sleep 3'修改过的值。而bash -ic 'sleep 2'子进程也因为"-i"参数让自己以交互式运行,但是在还没有执行到setpgid之前,就先触发了SIGTTIN的逻辑:

if (shell_pgrp != terminal_pgrp)
{
    SigHandler *ottin;

    ottin = set_signal_handler(SIGTTIN, SIG_DFL);
    kill (0, SIGTTIN);
    set_signal_handler (SIGTTIN, ottin);
    continue;
}

因为这段代码会认为终端被其他前台进程占用,对当前进程组发出SIGTTIN信号。在这个场景里,这恰好是一种误会!

当我们显式的对sleep.sh脚本设置开启作业控制时:

$ cat sleep.sh
#!/bin/bash
set -m
bash -ic 'sleep 3'
bash -ic 'sleep 2'

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
12668 12665 12665 12665    -1 S    ?        sshd: hongjiang@pts/2
12669 12668 12669 12669 12874 Ss   pts/2     \_ -bash
12873 12669 12873 12669 12874 S    pts/2         \_ /bin/bash ./sleep.sh
12874 12873 12874 12669 12874 S+   pts/2             \_ sleep 3

$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
  PID  PPID  PGID   SID TPGID STAT TT       COMMAND
12668 12665 12665 12665    -1 S    ?        sshd: hongjiang@pts/2
12669 12668 12669 12669 12901 Ss   pts/2     \_ -bash
12873 12669 12873 12669 12901 S    pts/2         \_ /bin/bash ./sleep.sh
12901 12873 12901 12669 12901 S+   pts/2             \_ sleep 2

它对每个子进程都设置为独立的进程组,并在每个进程(前台)结束的时候更新"TPGID"为父进程组ID,避免了initialize_job_controll里发送SIGTTIN的逻辑。

有很多shell的问题都是跟作业控制相关的,另一个例子参考tomcat进程意外退出的问题分析;作业控制可以玩出很多高阶花样,但它也大大增加了shell的复杂度,这个例子是一个典型的反面教材,最好不要在非交互式脚本里调用bash -ic来执行命令。

SIGTTIN?

我有一段脚本自己的mac机器默认用zsh,运行时一直很正常,今天给别的同事用,在bash下有些异常,追查了一下这个问题,把问题简化后如下:

#!/bin/bash
zsh -ic "which mvn"
zsh -ic "which mvn"

上面的脚本执行没有问题,但将里面的zsh换位bash,就会出现问题:

#!/bin/bash
bash -ic "which mvn"
bash -ic "which mvn"

在mac上执行,第一次bash -ic "which mvn"是成功的,但第二次执行时就会挂住:

 ➜  ./b.sh
/usr/local/bin/mvn
[1]  + 24649 suspended (tty input)  ./b.sh

/tmp/dd   [23:43:21]
[jobs:1] ➜    

后来想到which在zsh里是一个内置命令,而在bash下则是一个外部命令,可能有所差异,将zsh执行的命令也声明为外部命令:

#!/bin/bash
zsh -ic "/usr/bin/which mvn"
zsh -ic "/usr/bin/which mvn"

执行时会在第二次阻塞住,即使Ctrl-C也无法停止脚本。

这个问题很奇怪,两次以交互式调用shell执行一段命令(必须是外部命令)的话,第一次会成功,第二次则会suspend住。猜测可能是shell在第一次交互式执行结束后改变了上下文的什么状态,导致第二次再执行的时候挂住。在linux上用strace跟踪了一下脚本,看样子是因为SIGTTIN信号量所致,但其中缘由并不清楚,我把代码和strace的信息贴在这里,希望明白的人解释一下

$ cat b.sh
#!/bin/bash
bash -ic "ls"
bash -ic "ls"

$ strace ./b.sh

[hongjiang@localhost dd]$ strace ./b.sh
execve("./b.sh", ["./b.sh"], [/* 31 vars */]) = 0
brk(0)                                  = 0x1175000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa545ddf000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=32094, ...}) = 0
mmap(NULL, 32094, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa545dd7000
close(3)                                = 0
open("/lib64/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\316\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=174520, ...}) = 0
mmap(NULL, 2268928, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa545995000
mprotect(0x7fa5459ba000, 2097152, PROT_NONE) = 0
mmap(0x7fa545bba000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7fa545bba000
close(3)                                = 0
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\16\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=19512, ...}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa545791000
mprotect(0x7fa545794000, 2093056, PROT_NONE) = 0
mmap(0x7fa545993000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fa545993000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2107760, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa545dd6000
mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa5453d0000
mprotect(0x7fa545586000, 2097152, PROT_NONE) = 0
mmap(0x7fa545786000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7fa545786000
mmap(0x7fa54578c000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa54578c000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa545dd4000
arch_prctl(ARCH_SET_FS, 0x7fa545dd4740) = 0
mprotect(0x7fa545786000, 16384, PROT_READ) = 0
mprotect(0x7fa545993000, 4096, PROT_READ) = 0
mprotect(0x7fa545bba000, 16384, PROT_READ) = 0
mprotect(0x6dc000, 4096, PROT_READ)     = 0
mprotect(0x7fa545de0000, 4096, PROT_READ) = 0
munmap(0x7fa545dd7000, 32094)           = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open("/dev/tty", O_RDWR|O_NONBLOCK)     = 3
close(3)                                = 0
brk(0)                                  = 0x1175000
brk(0x1196000)                          = 0x1196000
brk(0)                                  = 0x1196000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=106065056, ...}) = 0
mmap(NULL, 106065056, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa53eea9000
close(3)                                = 0
brk(0)                                  = 0x1196000
getuid()                                = 1000
getgid()                                = 1000
geteuid()                               = 1000
getegid()                               = 1000
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open("/proc/meminfo", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa545dde000
read(3, "MemTotal:        1017160 kB\nMemF"..., 1024) = 1024
close(3)                                = 0
munmap(0x7fa545dde000, 4096)            = 0
rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7fa545405650}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7fa545405650}, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
uname({sys="Linux", node="localhost.localdomain", ...}) = 0
stat("/tmp/dd", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getpid()                                = 2945
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26254, ...}) = 0
mmap(NULL, 26254, PROT_READ, MAP_SHARED, 3, 0) = 0x7fa545dd8000
close(3)                                = 0
getppid()                               = 2942
getpgrp()                               = 2942
rt_sigaction(SIGCHLD, {0x441090, [], SA_RESTORER|SA_RESTART, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7fa545405650}, 8) = 0
getrlimit(RLIMIT_NPROC, {rlim_cur=3909, rlim_max=3909}) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open("./b.sh", O_RDONLY)                = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff54edec40) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "#!/bin/sh\n\nbash -ic \"ls\"\nbash -i"..., 80) = 39
lseek(3, 0, SEEK_SET)                   = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
fcntl(255, F_GETFD)                     = -1 EBADF (Bad file descriptor)
dup2(3, 255)                            = 255
close(3)                                = 0
fcntl(255, F_SETFD, FD_CLOEXEC)         = 0
fcntl(255, F_GETFL)                     = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(255, {st_mode=S_IFREG|0775, st_size=39, ...}) = 0
lseek(255, 0, SEEK_CUR)                 = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, "#!/bin/sh\n\nbash -ic \"ls\"\nbash -i"..., 39) = 39
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("/home/hongjiang/.local/bin/bash", 0x7fff54ede900) = -1 ENOENT (No such file or directory)
stat("/home/hongjiang/bin/bash", 0x7fff54ede900) = -1 ENOENT (No such file or directory)
stat("/data/program/scala/bin/bash", 0x7fff54ede900) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/qt-3.3/bin/bash", 0x7fff54ede900) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/bash", 0x7fff54ede900) = -1 ENOENT (No such file or directory)
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", X_OK)           = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", R_OK)           = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", X_OK)           = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", R_OK)           = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
lseek(255, -14, SEEK_CUR)               = 25
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa545dd4a10) = 2946
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e500, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
wait4(-1, b.sh
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2946
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2946, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7fff54ede450, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn()                          = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, {0x43e500, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, "bash -ic \"ls\"\n", 39)      = 14
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", X_OK)           = 0
stat("/usr/bin/bash", {st_mode=S_IFREG|0755, st_size=960384, ...}) = 0
geteuid()                               = 1000
getegid()                               = 1000
getuid()                                = 1000
getgid()                                = 1000
access("/usr/bin/bash", R_OK)           = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa545dd4a10) = 2967
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e500, [], SA_RESTORER, 0x7fa545405650}, {SIG_DFL, [], SA_RESTORER, 0x7fa545405650}, 8) = 0
wait4(-1, 0x7fff54edea00, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTTIN {si_signo=SIGTTIN, si_code=SI_USER, si_pid=2967, si_uid=1000} ---
--- stopped by SIGTTIN ---

使用strace定位jvm退出的原因范围

今天遇到的一个tomcat启动过程中jvm退出的问题,不是jvm crash的情况,用户日志配置的不正确导致一些信息没有展现出来,只看到pandora执行了shutdownhook的信息。这可能是启动时的逻辑有触发System.exit,或被系统或人为kill掉了。

根据以往的经验,排除了oom killer或ulimit -t设置不当导致被内核给kill掉的情况,OS级别的signal通常不留机会给jvm执行shutdownhook的。如此一来singal的范围应该就是SIGTERM, SIGINT, SIGHUP这3种(参考这里)。

虽然singal范围缩小,但依然不能确定是因为代码里调用了System.exit还是人为(或被其他进程)kill引起的。直接上大招用systemtap需要安装kernal debuginfo,没有权限的话,还要找到对应的人去做;如果现象较容易重现的话,可以先通过strace命令进一步缩小问题的范围,究竟是因为jvm内部执行了System.exit还是外界的kill引起的。

这里通过启动一个scala的repl来模拟java进程,通过strace attach到jvm进程上,然后观察,如果是外界的kill所致,可以看到下面的信息:

$ sudo strace -p 1947
Process 1947 attached - interrupt to quit
futex(0x7fb7635959d0, FUTEX_WAIT, 1948, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
futex(0x7fb762762360, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn(0x7fb762762360)            = 202
futex(0x7fb7635959d0, FUTEX_WAIT, 1948, NULLPANIC: attached pid 1947 exited with 143
 <unfinished ... exit status 143>

里面的关键信息是SIGTERMexit status 143(即SIGTERM的code)

如果是kill -2或ctrl-c终止repl,可以看到有关SIGINT的信息

$ sudo strace -p 1813
Process 1813 attached - interrupt to quit
futex(0x7fb24d15a9d0, FUTEX_WAIT, 1814, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
futex(0x7fb24c327360, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn(0x7fb24c327360)            = 202
futex(0x7fb24d15a9d0, FUTEX_WAIT, 1814, NULLPANIC: attached pid 1813 exited with 130
 <unfinished ... exit status 130>

如果是jvm自身执行了System.exit比如:

scala> System.exit(0)

那么在跟踪的信息里,是看不到signal的:

$ sudo strace -p 2131
Process 2131 attached - interrupt to quit
futex(0x7fc14adb49d0, FUTEX_WAIT, 2132, NULLPANIC: attached pid 2131 exited with 0
 <unfinished ... exit status 0>

至此我们可以判断出到底是外部还是内部引起的了,如果是内部就不必麻烦Systemtap了,可以从源码去找。

jvm与系统信号(4)

stop 与 cont 信号

这两个信号对于jvm也是可用的,比如让jvm进程停止:

scala> var a=0;

scala>  while(true) { Thread.sleep(2000); println(a); a=a+1 }
0
1
2
3

$ ps -ostat -p `pidof java`
STAT
S+

$ kill -s stop `pidof java`

此时jvm进程被暂停住,进程状态也变为:T (TASK_STOPPED or TASK_TRACED),暂停状态或跟踪状态

$ ps -ostat -p `pidof java`
STAT
T+

发送cont信号恢复进程状态:

$ kill -s cont `pidof java`

repl端会继续输出。

jvm与系统信号(3)

一些相关的参数

1) -Xrs

man java里可以看到这个参数的介绍,大意如下:

这个参数是在java1.3.1 之后增加的,rs是reduce signal的缩写,即忽略系统信号。在java1.3.0添加了 Shutdown Hook,目的是用于在jvm关闭时清除一些代码(比如关闭数据库连接)。对于jvm非正常退出,Sun/Oracle的jvm通过捕获信号来实现shutdown hook。JVM使用 SIGHUP, SIGINT,SIGTERM 来初始化 shutdown hook

JVM使用了相似的机制来实现 pre-1.2 特性,dumping线程栈(用于调试目的)。Sun/Oracle的JVM使用 SIGQUIT 来执行 线程 dump.

应用时常也需要自己捕获SIGINT或SIGTERM,这会导致干扰JVM自己的signal handler,为了避免这种情况, -Xrs 命令行参数在java1.3.1里被增加了进来。当使用 Sun的JVM时, SIGINT, SIGTERM, SIGHUP, SIGQUIT不会被影响JVM,这些信号的handler不会被install。

使用-Xrs要注意这2个后果:

1) SIGQUIT 不再产生 thread dump
2) Shutdown hook将不被执行

尝试一下Xrs参数,使用scala -J-Xrs启动一个repl:

$ scala -J-Xrs

$ kill -3 `pidof java` 

这个时候确实不会产生 thread dump,并且java进程会退出。

有趣的是,我使用 jstack 依然可以看到 scala repl的所有线程,也就是jstack依然会把请求发送到java进程,即使它设置了-Xrs,可能与attach机制有关,这里的细节以后再展开。

2) -XX:-AllowUserSignalHandlers

限于Linux和Solaris,默认不启用。允许为java进程安装信号处理器,信号处理参见类:sun.misc.Signal, sun.misc.SignalHandler

3) -XX:+UseAltSigs

限于Solaris,默认启用。为了防止与其他发送信号的应用程序冲突,允许使用候补信号替代 SIGUSR1和SIGUSR2

jvm与系统信号(2)

core dump 与 thread stack dump

进程处理信号的行为参考这里,很多信号都将导致core dump,比如SIGILL, SIGSEGV等。

比如我们对一个java进程发送SIGILL会让进程退出,并产生core dump:

$ kill -s ILL `pidof java`

在java进程的错误输出流会产生如下信息,并且在home下产生core dump文件:hs_err_pid8385.log

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007fff89d1ba1a, pid=8385, tid=1287
#
# JRE version: Java(TM) SE Runtime Environment (7.0_60-b15) (build 1.7.0_60-ea-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libsystem_kernel.dylib+0x11a1a]  mach_msg_trap+0xa
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/hongjiang/hs_err_pid8385.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
/data/tools/scala/bin/scala: line 21:  8385 Abort trap: 6           "$@"

对于SIGQUIT(kill -3),jvm会捕获该信号并dump线程栈到标准错误流,不会产生core-dump文件;这个信号的处理方式是系统保留用户无法修改:

scala> import sun.misc._

scala>  Signal.handle(
 |          new Signal("QUIT"),
 |          new SignalHandler(){ 
 |              def handle(sig:Signal){ println("down") }
 |          }
 |      )
java.lang.IllegalArgumentException: Signal already used by VM or OS: SIGQUIT
at sun.misc.Signal.handle(Signal.java:166)
... 38 elided

$ kill -s QUIT `pidof java`

会看到repl输出线程堆栈信息以及heap信息,java进程并不退出。

可被用户设定的singals

对于SIGSEGV(kill -11), SIGFPE(kill -8),SIGILL(kill -4), SIGUSR1(kill -10) 等信号,无法被用户设置:

scala> Signal.handle(new Signal("USR1"), 
 |              new SignalHandler(){ def handle(sig:Signal){ println("down") }}
 |      )

java.lang.IllegalArgumentException: Signal already used by VM or OS: SIGUSR1
    at sun.misc.Signal.handle(Signal.java:166)
    ... 32 elided

对于SIGINT(kill -2),SIGTERM(kill -15),SIGUSR2(kill -12),SIGBUS(kill -7),SIGPIPE(kill -13), 可以被用户设置:

scala> Signal.handle(new Signal("INT"), 
 |              new SignalHandler(){ def handle(sig:Signal){ println("down") }}
 |      )

$ kill -s INT `pidof java`

scala> down // 被捕获,进程不会退出

触发shutdown hook的singals

SIGTERM(kill), SIGINT(kill -2), SIGHUP(kill -1) 会触发shutdown hook的执行。

scala>  Runtime.getRuntime().addShutdownHook(
            new Thread() { override def run() { println("ok") } })

$ kill -s INT `pidof java`

scala> ok 

jvm与系统信号(1)

下面是singal部分列表(POSIX.1-1990):

SIGHUP 1 Terminal line hangup
SIGINT 2 Interrupt program
SIGQUIT 3 Quit program
SIGILL 4 Illegal instruction
SIGABRT 6 Abort
SIGFPE 8 Floating-point exception
SIGKILL 9 Kill program
SIGSEGV 11 Invalid memory reference
SIGPIPE 13 Write on a pipe with no one to read it
SIGALRM 14 Alarm clock
SIGTERM 15 Termination signal
SIGUSR1 30,10,16 User-defined signal 1
SIGUSR2 31,12,17 User-defined signal 2
SIGCHLD 20,17,18 Child stopped or terminated
SIGCONT 19,18,25 Continue if stopped
SIGSTOP 17,19,23 Stop process

其中有多个值的singal,它的含义是与硬件体系相关,参考这里:一般alpha和sparc架构用第一个值,x86,arm架构用中间值,mips架构用第三个值, – 表示相应架构的取值未知。

比如SIGUSR1这个信号,有30,10,16 三种值,在x86上会用第二个值,也就是10.

$ /bin/kill -L | xargs -n2 | grep USR1
10 USR1

后来在POSIX.1-2001有部分新增或修改:

SIGTRAP 5 Trace trap
SIGBUS 10,7,10 Bus error (bad memory access)
SIGSYS 12,31,12 Bad system call
SIGURG 16,23,21 Urgent condition on socket (4.2BSD)
SIGXCPU 24,24,30 CPU time limit exceeded (4.2BSD)

现在看一下jvm里对singal的处理,参考这里 (The mention “optional” means that the signal is not necessary when the -Xrs option is specified):

Signal Description
SIGSEGV, SIGBUS, SIGFPE,
SIGPIPE, SIGILL
Used in the implementation for implicit null check, and so forth.
SIGQUIT Thread dump support: To dump Java stack traces at the standard error stream. (Optional.)
SIGTERM, SIGINT, SIGHUP Used to support the shutdown hook mechanism ( java.lang.Runtime.addShutdownHook) when the VM is terminated abnormally. (Optional.)
SIGUSR1 Used in the implementation of the java.lang.Thread.interrupt method. (Configurable.) Not used starting with Solaris 10 OS. Reserved on Linux.
SIGUSR2 Used internally. (Configurable.) Not used starting with Solaris 10 OS.
SIGABRT The HotSpot VM does not handle this signal. Instead it calls the abort function after fatal error handling. If an application uses this signal then it should terminate the process to preserve the expected semantics.