欧阳亮的博客

编程不止是一份工作,还是一种乐趣!!!

线上服务CPU100%问题快速定位

步骤一:找到最耗CPU的进程

执行top -c,显示进程运行信息列表,并键入P(大写),进程按照CPU使用率排序

top - 23:35:34 up 44 min,  2 users,  load average: 0.38, 0.67, 1.39
Tasks: 103 total,   2 running, 101 sleeping,   0 stopped,   0 zombie
%Cpu(s): 22.9 us, 28.0 sy,  0.0 ni, 48.2 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem :  1865988 total,  1551680 free,   124868 used,   189440 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  1556884 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  1612 john      20   0 2812968  37232  11880 S  88.7  2.0   0:17.87 java Test
  1225 john      20   0  147844   2424   1084 S  13.0  0.1   4:11.92 sshd: john@pts/1
    44 root      20   0       0      0      0 S   0.7  0.0   0:08.67 [kworker/0:1]
    13 root      20   0       0      0      0 S   0.3  0.0   0:00.65 [ksoftirqd/1]
  1376 root      20   0       0      0      0 R   0.3  0.0   0:03.59 [kworker/1:0]
  1627 john      20   0  157704   2172   1520 R   0.3  0.1   0:00.01 top -c
     1 root      20   0  128168   6844   4080 S   0.0  0.4   0:02.15 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [kthreadd]
     3 root      20   0       0      0      0 S   0.0  0.0   0:00.02 [ksoftirqd/0]
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/0:0H]
     6 root      20   0       0      0      0 S   0.0  0.0   0:00.01 [kworker/u256:0]
     7 root      rt   0       0      0      0 S   0.0  0.0   0:00.48 [migration/0]
     8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [rcu_bh]
     9 root      20   0       0      0      0 S   0.0  0.0   0:00.32 [rcu_sched]
    10 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 [watchdog/0]
    11 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 [watchdog/1]
    12 root      rt   0       0      0      0 S   0.0  0.0   0:00.39 [migration/1]
    15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/1:0H]
    17 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [kdevtmpfs]
    18 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [netns]
    19 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [khungtaskd]
    20 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [writeback]
    21 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kintegrityd]
    22 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [bioset]
    23 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kblockd]
    24 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [md]
    30 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [kswapd0]
    31 root      25   5       0      0      0 S   0.0  0.0   0:00.00 [ksmd]

由top命令可得到进程1612最耗CPU。


步骤二:找到最耗CPU的线程

执行top -Hp 1612,显示一个进程的线程运行信息列表,键入P(大写),线程按照CPU使用率排序

top - 23:36:28 up 45 min,  2 users,  load average: 1.56, 0.96, 1.45
Threads:  14 total,   1 running,  13 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.3 us, 27.5 sy,  0.0 ni, 48.0 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
KiB Mem :  1865988 total,  1543940 free,   132608 used,   189440 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  1549144 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  1624 john      20   0 2812968  45080  11880 R 90.3  2.4   1:04.99 java
  1625 john      20   0 2812968  45080  11880 S  0.3  2.4   0:00.01 java
  1612 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.00 java
  1613 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.06 java
  1614 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.04 java
  1615 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.04 java
  1616 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.15 java
  1617 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.00 java
  1618 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.00 java
  1619 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.00 java
  1620 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.14 java
  1621 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.02 java
  1622 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.00 java
  1623 john      20   0 2812968  45080  11880 S  0.0  2.4   0:00.03 java

由top命令可得到进程1612中最耗CPU的线程是1624。

也可以通过pssort命令得到最耗CPU的线程,如下:

[john@server-1 ~]$ ps -mp 1612 -o THREAD,tid | sort -k 2 -rn
john     89.4   -    - -         -      -      -
john     88.9  19    - -         -      -   1624
john      0.2  19    - futex_    -      -   1616
john      0.0  19    - skb_wa    -      -   1642
john      0.0  19    - futex_    -      -   1625
john      0.0  19    - futex_    -      -   1623
john      0.0  19    - futex_    -      -   1622
john      0.0  19    - futex_    -      -   1621
john      0.0  19    - futex_    -      -   1620
john      0.0  19    - futex_    -      -   1619
john      0.0  19    - futex_    -      -   1618
john      0.0  19    - futex_    -      -   1617
john      0.0  19    - futex_    -      -   1615
john      0.0  19    - futex_    -      -   1614
john      0.0  19    - futex_    -      -   1613
john      0.0  19    - futex_    -      -   1612
USER     %CPU PRI SCNT WCHAN  USER SYSTEM    TID

输出的第一行是进程1612的信息,结果根据第二列进行排序,所以第二行的最后一列就是我们要询找的,最耗CPU的线程ID。


步骤三:将线程PID转化为16进制

$ printf "0x%x\n" 1624
0x658


步骤四:查看堆栈,找到线程在干嘛

[john@server-1 ~]$ jstack 1612 | grep 0x658 -A 20
"Thread-0" #8 prio=5 os_prio=0 tid=0x00007f164c0e8000 nid=0x658 runnable [0x00007f1636af1000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	- locked <0x00000000e3814d78> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:482)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000e3804cf0> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.newLine(PrintStream.java:546)
	- eliminated <0x00000000e3804d30> (a java.io.PrintStream)
	at java.io.PrintStream.println(PrintStream.java:807)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at Test$Thread1.run(Test.java:15)
	at java.lang.Thread.run(Thread.java:748)


根据jstack的输出,结合源码很容易就可以发现问题:

import java.util.concurrent.TimeUnit;

public class Test {

    public static void main(String[] args) throws InterruptedException {
        new Thread(new Thread1()).start();
        new Thread(new Thread2()).start();
    }

    static class Thread1 implements Runnable {

        @Override
        public void run() {
            for (;;) {
                System.out.println("hello");
            }
        }

    }

    static class Thread2 implements Runnable {

        @Override
        public void run() {
            for (;;) {
                System.out.println("hello");

                try {
                    TimeUnit.SECONDS.sleep(2);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }

    }
}


脚本化

线上定位问题都是争分夺秒,按照上面的步骤来会比较慢,可以把这些步骤写进一个脚本。

#!/bin/bash
pid=$1
tid=$(ps -mp $pid -o THREAD,tid | sort -k 2 -rn | awk 'NR==2 {printf "0x%x",$8}')
jstack $pid | grep $tid -A 20


现在只要将进程id作为参数转给脚本就可以了:

[john@server-1 ~]$ ./show-thread.sh 1612
"Thread-0" #8 prio=5 os_prio=0 tid=0x00007f164c0e8000 nid=0x658 runnable [0x00007f1636af1000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	- locked <0x00000000e3814d78> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:482)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000e3804cf0> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.newLine(PrintStream.java:546)
	- eliminated <0x00000000e3804d30> (a java.io.PrintStream)
	at java.io.PrintStream.println(PrintStream.java:807)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at Test$Thread1.run(Test.java:15)
	at java.lang.Thread.run(Thread.java:748)


实际上通过pssort命令,也可以计算出最耗CPU的进程,所以可以优化一下脚本:

#!/bin/bash
pid=$(ps -aux | sort -k 3 -rn | awk 'NR==1 {print $2}')
tid=$(ps -mp $pid -o THREAD,tid | sort -k 2 -rn | awk 'NR==2 {printf "0x%x",$8}')
jstack $pid | grep $tid -A 20


现在直接执行脚本即可看到线程栈信息了:

[john@server-1 ~]$ ./show-thread.sh
"Thread-0" #8 prio=5 os_prio=0 tid=0x00007f164c0e8000 nid=0x658 runnable [0x00007f1636af1000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	- locked <0x00000000e38169b0> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:482)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000e3804cf0> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.newLine(PrintStream.java:546)
	- eliminated <0x00000000e3804d30> (a java.io.PrintStream)
	at java.io.PrintStream.println(PrintStream.java:807)
	- locked <0x00000000e3804d30> (a java.io.PrintStream)
	at Test$Thread1.run(Test.java:15)
	at java.lang.Thread.run(Thread.java:748)