上周末做了一个分享,有关shell signal 以及jvm如何处理,ppt可以在这里获取 (需要翻墙)
Tag Archives: shell
zsh的字符串替换引起的卡顿
我的mac系统每次启动后第一次打开iterm2的时候,oh-my-zsh的启动总是明显的卡顿一下,而之后退出iterm2重启动则不会有这个卡顿,也就是只在第一次启动iterm2的时候发生。对启动的zsh增加了-xv参数后观察,发现这个卡顿发生在git_compare_version
函数的第4行:
找到这个函数后,发现第4行的操作并不是git等网络操作,而是一个字符串替换的操作,它使用zsh内置的字符串替换功能:INSTALLED_GIT_VERSION=(${(s/./)INSTALLED_GIT_VERSION[3]})
非常的不符合直觉(直觉上以为卡顿是因为网络阻塞引起的),模拟一下,在一个脚本里使用这个字符串替换操作,看看具体的耗时情况:
$ cat zsh-test.sh
#!/usr/bin/env zsh -xv
export PS4=$'%D{%M%S%.} %N:%i> '
INSTALLED_GIT_VERSION=($(command git --version 2>/dev/null));
INSTALLED_GIT_VERSION=(${(s/./)INSTALLED_GIT_VERSION[3]});
echo "$INSTALLED_GIT_VERSION"
然后再启动时调用这个zsh脚本:
$ cat run.sh
#!/usr/bin/env zsh -xv
export PS4=$'%D{%M%S%.} %N:%i> '
./zsh-test.sh
重启系统,启动后在bash下执行run.sh脚本:
INSTALLED_GIT_VERSION=($(command git --version 2>/dev/null));
5649865 ./zsh-test.sh:4> INSTALLED_GIT_VERSION=5649867 ./zsh-test.sh:4> git --version
5649865 ./zsh-test.sh:4> INSTALLED_GIT_VERSION=( git version 2.8.4 '(Apple' 'Git-73)' )
INSTALLED_GIT_VERSION=(${(s/./)INSTALLED_GIT_VERSION[3]});
5650999 ./zsh-test.sh:5> INSTALLED_GIT_VERSION=( 2 8 4 )
看到zsh-test.sh里的第5行字符串替换的操作耗时用了1秒多时间,如果再次执行的话会降到几个毫秒。这真是个蹊跷的问题,发邮件给 zsh-works@zsh.org 好几周也没有人回复,先在博客里记录一下这个问题,以后再追踪。zsh版本是:5.2 (x86_64-apple-darwin16.0.0)。
作业控制与前台进程组
这篇文章是对之前的SIGTTIN信号量的疑惑?的解答,对于为何会有这种奇怪的用法,在另一篇shell下精确的定位一个命令 也介绍过了,这里想讨论的重点不在于怎么变通解决那个问题,而是导致SIGTTIN
发生的机制是怎么引起的。我的同事对这个问题也产生了好奇,在stackoverflow上发帖,有人给出了解释,解答的人直接给出了bash的源码jobs.c
里的initialize_job_control
方法片段,指出SIGTTIN
正是那里面的逻辑。不过如果你跟我一样对shell和linux系统调用都懂得很肤浅的话,这段代码并不容易懂,所以在这里更详细的解释一下这个问题的来龙去脉。
刚开始碰到这个问题的时候,通过strace看到了是SIGTTIN
信号量所致,因为这个信号量默认的行为是让进程STOP(暂停),即通过ps观察到的状态为T。对于SIGTTIN
信号量《Linux/UNIX系统编程手册》上是这么说的:
只有前台作业中的进程才能够从控制终端读取输入。这个限制条件避免了多个作业竞争读取终端输入。如果后台作业尝试从终端读取输入,就会接收到一个SIGTTIN信号。SIGTTIN信号的默认处理动作是停止作业。
但我们的脚本里并没有后台进程,那两个进程也没有读取终端,跟上面的解释对不上。也没有在网上搜到其它引发SIGTTIN信号的情况,在这里困惑了很久。不过凭直觉知道这个问题应该跟作业控制有关,在脚本里显式的开启作业控制,是能够正常运行的:
$ cat sleep.sh
#!/bin/bash
set -m
bash -ic 'sleep 3'
bash -ic 'sleep 2'
所以一定是在进程某个状态上的不一致导致的。上周末的时候阅读了一下strace的log,对出问题的脚本:
#!/bin/bash
bash -ic 'sleep 3'
bash -ic 'sleep 2'
使用strace -f -e verbose=all -t ./sleep.sh 2>log
得到更详细的日志
...
03:39:06 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f997f03ca10) = 9897
...
[pid 9897] 03:39:06 execve("/usr/bin/bash", ["bash", "-ic", "sleep 3"], [/* 30 vars */]) = 0
...
[pid 9897] 03:39:06 open("/dev/tty", O_RDWR|O_NONBLOCK) = 3
[pid 9897] 03:39:06 getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid 9897] 03:39:06 fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
[pid 9897] 03:39:06 dup2(3, 255) = 255
[pid 9897] 03:39:06 close(3) = 0
[pid 9897] 03:39:06 ioctl(255, TIOCGPGRP, [9891]) = 0
[pid 9897] 03:39:06 setpgid(0, 9897) = 0 //第一个子进程更改了它的进程组ID
...
03:39:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f997f03ca10) = 9922
...
[pid 9922] 03:39:09 execve("/usr/bin/bash", ["bash", "-ic", "sleep 2"], [/* 30 vars */]) = 0
...
[pid 9922] 03:39:09 access("/usr/bin/bash", R_OK) = 0
[pid 9922] 03:39:09 getpgrp() = 9891
[pid 9922] 03:39:09 ioctl(2, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffd356e49c0) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 9922] 03:39:09 open("/dev/tty", O_RDWR|O_NONBLOCK) = 3
[pid 9922] 03:39:09 getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid 9922] 03:39:09 fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
[pid 9922] 03:39:09 dup2(3, 255) = 255
[pid 9922] 03:39:09 close(3) = 0
[pid 9922] 03:39:09 ioctl(255, TIOCGPGRP, [9897]) = 0
[pid 9922] 03:39:09 rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x7f912a22b650}, {SIG_IGN, [], SA_RESTORER, 0x7f912a22b650}, 8) = 0
[pid 9922] 03:39:09 kill(0, SIGTTIN) = 0
[pid 9896] 03:39:09 <... wait4 resumed> 0x7ffc5b5e6800, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 9922] 03:39:09 --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_USER, si_pid=9922, si_uid=1000} ---
[pid 9896] 03:39:09 --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_USER, si_pid=9922, si_uid=1000} ---
[pid 9922] 03:39:09 --- stopped by SIGTTIN ---
[pid 9896] 03:39:09 --- stopped by SIGTTIN ---
确认这个SIGTTIN信号是第二个bash -ic 'sleep 2'
进程发出的,kill(0, SIGTTIN)
表示它把这个信号发送到自己所在的进程组,整个进程组的进程都接收到这个信号,所以它和它的父进程sleep.sh都变成了stop状态。
脚本里两次执行的bash -ic
子进程也都是shell,它们在初始化的时候会有作业控制的逻辑,结合jobs.c
的initialize_job_control
方法里的代码(shell初始化时调用到这里):
/* We can only have job control if we are interactive. */
if (interactive == 0)
{
job_control = 0;
original_pgrp = NO_PID;
shell_tty = fileno (stderr);
}
else
{
shell_tty = -1;
/* If forced_interactive is set, we skip the normal check that stderr
is attached to a tty, so we need to check here. If it's not, we
need to see whether we have a controlling tty by opening /dev/tty,
since trying to use job control tty pgrp manipulations on a non-tty
is going to fail. */
// bash "-i" 参数会启用 forced_interactive
if (forced_interactive && isatty (fileno (stderr)) == 0)
shell_tty = open ("/dev/tty", O_RDWR|O_NONBLOCK);
/* Get our controlling terminal. If job_control is set, or
interactive is set, then this is an interactive shell no
matter where fd 2 is directed. */
if (shell_tty == -1)
shell_tty = dup (fileno (stderr));/* fd 2 */
shell_tty = move_to_high_fd (shell_tty, 1, -1);
/* Compensate for a bug in systems that compiled the BSD
rlogind with DEBUG defined, like NeXT and Alliant. */
if (shell_pgrp == 0)
{
shell_pgrp = getpid ();
setpgid (0, shell_pgrp);
tcsetpgrp (shell_tty, shell_pgrp);
}
while ((terminal_pgrp = tcgetpgrp (shell_tty)) != -1)
{
if (shell_pgrp != terminal_pgrp)
{
SigHandler *ottin;
ottin = set_signal_handler(SIGTTIN, SIG_DFL);
kill (0, SIGTTIN); // 第二次执行bash -ic时触发了这里
set_signal_handler (SIGTTIN, ottin);
continue;
}
break;
}
if (terminal_pgrp == -1)
t_errno = errno;
/* Make sure that we are using the new line discipline. */
if (set_new_line_discipline (shell_tty) < 0)
{
sys_error (_("initialize_job_control: line discipline"));
job_control = 0;
}
else
{
original_pgrp = shell_pgrp;
shell_pgrp = getpid ();
// 第一次bash -ic 'sleep 3'触发了这里的 setpgid 修改了当前进程组
if ((original_pgrp != shell_pgrp) && (setpgid (0, shell_pgrp) < 0))
{
sys_error (_("initialize_job_control: setpgid"));
shell_pgrp = original_pgrp;
}
job_control = 1;
/* If (and only if) we just set our process group to our pid,
thereby becoming a process group leader, and the terminal
is not in the same process group as our (new) process group,
then set the terminal's process group to our (new) process
group. If that fails, set our process group back to what it
was originally (so we can still read from the terminal) and
turn off job control. */
if (shell_pgrp != original_pgrp && shell_pgrp != terminal_pgrp)
{
if (give_terminal_to (shell_pgrp, 0) < 0)
{
t_errno = errno;
setpgid (0, original_pgrp);
shell_pgrp = original_pgrp;
job_control = 0;
}
}
...
关键点就在于shell_pgrp
和terminal_pgrp
这两个变量,shell_pgrp
是当前进程组,而terminal_pgrp
是占用当前控制终端的进程所在的进程组(前台进程组),这些状态都是可以通过ps观察到的,可以跟踪一下:
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
PID PPID PGID SID TPGID STAT TT COMMAND
12413 12410 12410 12410 -1 S ? sshd: hongjiang@pts/0
12414 12413 12414 12414 12580 Ss pts/0 \_ -bash
12579 12414 12579 12414 12580 S pts/0 \_ /bin/bash ./sleep.sh
12580 12579 12580 12414 12580 S+ pts/0 \_ sleep 3
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
PID PPID PGID SID TPGID STAT TT COMMAND
12413 12410 12410 12410 -1 S ? sshd: hongjiang@pts/0
12414 12413 12414 12414 12414 Ss+ pts/0 \_ -bash
12579 12414 12579 12414 12414 T pts/0 \_ /bin/bash ./sleep.sh
12607 12579 12579 12414 12414 T pts/0 \_ bash -ic sleep 2
在第一次执行bash -ic 'sleep 3'
的时候,sleep.sh父进程先clone出bash子进程(pid 12580),因为-i
参数强制这个bash子进程用交互式运行,它会加载$HOME
下的.bashrc
等文件,这个过程可能会fork/clone
出若干子进程(所以会看到第二次bash -ic sleep 2
进程的ID跟第一次不是连续的),等这些配置文件加载完之后,它并不是fork/clone
的形式执行sleep 3
而是使用当前进程(12580)执行的sleep 3
,这里很关键的信息是"PGID"和"TPGID"都是本身进程ID,而非父进程ID,跟第二次的状态不一样。
因为脚本默认是关闭作业控制的,本来每个子进程并不会设置为独立的进程组,比如下面这个脚本:
$ cat a.sh
#!/bin/bash
/usr/bin/sleep 10
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
PID PPID PGID SID TPGID STAT TT COMMAND
12668 12665 12665 12665 -1 S ? sshd: hongjiang@pts/2
12669 12668 12669 12669 12736 Ss pts/2 \_ -bash
12736 12669 12736 12669 12736 S+ pts/2 \_ /bin/bash ./a.sh
12737 12736 12736 12669 12736 S+ pts/2 \_ /usr/bin/sleep 10
上面脚本执行时sleep子进程"PGID"和"TPGID"都是进程父进程a.sh的,并没有被设置为一个独立的进程组。
而sleep.sh
脚本里之所以会对子进程设置一个独立的进程组,是因为"-i"参数使得bash -ic 'sleep 3'
在非交互式脚本里运行时进程被强制设置成了独立的进程组(见initializejobcontroll里的setpgid),同时"TPGID"这个表示前台进程组的状态也被改为了bash -ic 'sleep 3'
的进程组ID。
那为什么在接下来的bash -ic 'sleep 2'
子进程执行时却不像前面的那样呢?这正是最诡异的地方。它们所在的sleep.sh
脚本是非交互式运行的,它本来预期脚本执行过程不应该产生与脚本进程组不一致的前台进程组,所以前台子进程组结束的时候,不会去更新"TPGID",可以用下面脚本来验证:
$ cat wait.sh
#!/bin/bash
bash -ic 'sleep 5'
sleep 4
sleep 3 & #不让wait.sh进程立即退出
wait
上面wait.sh脚本里第一个子进程强制修改了"TPGID",子进程退出,以及后续再执行前台进程都不会去更新这个状态
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
PID PPID PGID SID TPGID STAT TT COMMAND
13867 13864 13864 13864 -1 S ? sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss pts/0 \_ -bash
14137 13868 14137 13868 14138 S pts/0 \_ /bin/bash ./wait.sh
14138 14137 14138 13868 14138 S+ pts/0 \_ sleep 5
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
PID PPID PGID SID TPGID STAT TT COMMAND
13867 13864 13864 13864 -1 S ? sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss pts/0 \_ -bash
14137 13868 14137 13868 14138 S pts/0 \_ /bin/bash ./wait.sh
14165 14137 14137 13868 14138 S pts/0 \_ sleep 4
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/0/"
PID PPID PGID SID TPGID STAT TT COMMAND
13867 13864 13864 13864 -1 S ? sshd: hongjiang@pts/0
13868 13867 13868 13868 14138 Ss pts/0 \_ -bash
14137 13868 14137 13868 14138 S pts/0 \_ /bin/bash ./wait.sh
14168 14137 14137 13868 14138 S pts/0 \_ sleep 3
回到sleep.sh
脚本里,第二行bash -ic 'sleep 2'
子进程初始化时,"TPGID"仍是上个进程bash -ic 'sleep 3'
修改过的值。而bash -ic 'sleep 2'
子进程也因为"-i"参数让自己以交互式运行,但是在还没有执行到setpgid
之前,就先触发了SIGTTIN
的逻辑:
if (shell_pgrp != terminal_pgrp)
{
SigHandler *ottin;
ottin = set_signal_handler(SIGTTIN, SIG_DFL);
kill (0, SIGTTIN);
set_signal_handler (SIGTTIN, ottin);
continue;
}
因为这段代码会认为终端被其他前台进程占用,对当前进程组发出SIGTTIN
信号。在这个场景里,这恰好是一种误会!
当我们显式的对sleep.sh
脚本设置开启作业控制时:
$ cat sleep.sh
#!/bin/bash
set -m
bash -ic 'sleep 3'
bash -ic 'sleep 2'
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
PID PPID PGID SID TPGID STAT TT COMMAND
12668 12665 12665 12665 -1 S ? sshd: hongjiang@pts/2
12669 12668 12669 12669 12874 Ss pts/2 \_ -bash
12873 12669 12873 12669 12874 S pts/2 \_ /bin/bash ./sleep.sh
12874 12873 12874 12669 12874 S+ pts/2 \_ sleep 3
$ ps xfo pid,ppid,pgid,sid,tpgid,stat,tty,command | awk "NR==1||/pts\/2/"
PID PPID PGID SID TPGID STAT TT COMMAND
12668 12665 12665 12665 -1 S ? sshd: hongjiang@pts/2
12669 12668 12669 12669 12901 Ss pts/2 \_ -bash
12873 12669 12873 12669 12901 S pts/2 \_ /bin/bash ./sleep.sh
12901 12873 12901 12669 12901 S+ pts/2 \_ sleep 2
它对每个子进程都设置为独立的进程组,并在每个进程(前台)结束的时候更新"TPGID"为父进程组ID,避免了initialize_job_controll
里发送SIGTTIN
的逻辑。
有很多shell的问题都是跟作业控制相关的,另一个例子参考tomcat进程意外退出的问题分析;作业控制可以玩出很多高阶花样,但它也大大增加了shell的复杂度,这个例子是一个典型的反面教材,最好不要在非交互式脚本里调用bash -ic
来执行命令。
shell下精确的定位一个命令
之前在这篇SIGTTIN? 里提出了问题,但没有交代背景,为什么用交互式shell执行”which mvn”命令,是为了更准确的获取用户当前环境里所用的mvn命令到底是哪个。若用户对mvn做过alias,优先使用alias过的指令,然后再选择$PATH路径下的命令;这个函数的完整内容如下:
function get_mvn_cmd() {
if [[ "$OSTYPE" == *cygwin* ]];then
ppid=$( ps -ef -p $$ | awk 'NR==2{print $3}' )
user_shell=$( ps -p $ppid | awk 'NR==2{print $8}' )
#has some trouble with cygwin, while Ctrl-c cannot terminal
set -m
else
ppid=$( ps -oppid= $$ )
user_shell=$( ps -ocomm= -p $ppid )
fi
# while as login shell, it's -bash not bash
if [[ "$user_shell" == "-"* ]];then
user_shell=${user_shell:1}
fi
mvn=$( $user_shell -ic "alias mvn" 2>/dev/null | cut -d'=' -f2 | sed "s/'//g" )
if [ -z "$mvn" ];then
$user_shell -ic "which mvn" >/dev/null
if [ $? -eq 0 ];then
mvn=$( $user_shell -ic "which mvn" | head -1 )
fi
fi
if [ -z "$mvn" ]; then
echo "mvn command not found" >&2
kill -s TERM $TOP_PID
else
echo $mvn
fi
}
函数里先获取用户所使用的shell(即当前脚本的父进程),然后以交互式执行alias mvn
看看是否又被用户别名过,没有的话再使用which mvn
获取mvn命令。
为什么写的那么啰嗦是想兼容好几种用户环境,如果是linux或者使用的gnu-which,可以利用gnu-which的一些参数一次性按优先级依次从”alias, functions, commands”里查找命令
$ (alias; declare -f) | gwhich --read-alias --read-functions mvn
上面利用gnu-which的--read-alias
和--read-functions
参数优先从前边alias
和declare -f
输出的结果里查找,最后再从$PATH里查找。实际上centos7的bash下which
命令就被alias过了:
alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'