As a Linux/Unix system administrator we usually get complaint that systems are responding slow, application performance is degraded etc. One of the main reason for this performance degradation is your server might be heavily loaded with processes. During the troubleshooting of performance related issue we have to consider some elements:-

 

 

  • CPU utilization
  • Memory usage
  • Disk utilization
  • Network utilization

 

CPU Performance check:

Check the System/CPU load and load average. We may want to run uptime to check the load average snippet but this document suggests looking for the SAR file related to CPU load.

 

# uptime

user_name at 08:48  ın {~}

└─> ß » uptime

10:31:23 up 50 days, 18:02, 10 users, load average: 0.24, 0.15, 0.09

 

In above output we see load average value is 0.24, 0.15, 0.09 which means that calculated value over a given period of time respectively 1, 5 and 15 minutes. High load averages imply that a system is overloaded; many processes are waiting for CPU time. How do we know that the load average is high or low? We have to check how many cores (logical CPU) this CPU has by executing this command “cat /proc/cpuinfo | grep -i cores”

 

user_name at 10:31  ın {~}

└─> ß » cat /proc/cpuinfo | grep -i cores

cpu cores       : 1

cpu cores       : 1

 

From above example we see this server has 1 core CPU and load average values are < 1 which is normal. If it had 2 cores CPU and the load average value is greater than 2 means load average is above than normal.

 

To load find average statistics from SAR, execute below command.

 

# sar -q -f /var/log/sa/sa22

 

07:30:01 AM   runq-sz     plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked

07:40:01 AM         0       220      0.00      0.01      0.05         0

07:50:01 AM         1       215      0.01      0.02      0.05         0

08:00:01 AM         1       213      0.01      0.02      0.05         0

08:10:01 AM         0       212      0.00      0.01      0.05         0

10:20:01 AM         0       234      0.03      0.02      0.05         0

10:30:01 AM         0       235      0.00      0.03      0.05         0

10:40:01 AM         0       238      0.00      0.05      0.06         0

Average:            0       217      0.09      0.10      0.13         0

 

  • If Load average is high we have to check the CPU usage by utilizing “top” command.

Some useful TOP command cheat:

  • Press ‘z‘ option in running top command will display running process in color which may help you to identified running process easily.
  • Press ‘c‘ option in running top command, it will display absolute path of running process.
  • Press (Shift+P) to sort processes as per CPU utilization. See screenshot below.

 

 

 

  • Use top command with ‘u‘ option will display specific User process details.

 

# top -u oracle    ß If we want to see user wise utilization

 

top – 11:00:10 up 14 days, 26 min,  1 user,  load average: 0.01, 0.02, 0.05

Tasks: 264 total,   1 running, 263 sleeping,   0 stopped,   0 zombie

%Cpu(s):  1.3 us,  6.3 sy,  0.0 ni, 92.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

KiB Mem : 16414112 total,   270024 free,   703324 used, 15440764 buff/cache

KiB Swap: 16773116 total, 16723204 free,    49912 used. 11117340 avail Mem

 

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

80898 oracle    20   0 4494916  31912  29868 S   0.0  0.2   1:40.87 ora_pmon_sprt

80900 oracle    20   0 4494916  17776  15748 S   0.0  0.1   2:53.83 ora_psp0_sprt

80903 oracle    -2   0 4494916  16980  14952 S   0.0  0.1 190:59.25 ora_vktm_sprt

80907 oracle    20   0 4494916  18340  16312 S   0.0  0.1   1:09.39 ora_gen0_sprt

80909 oracle    20   0 4494916 281240 279212 S   0.0  1.7   0:29.05 ora_mman_sprt

80913 oracle    20   0 4494916  16728  14696 S   0.0  0.1   0:53.20 ora_diag_sprt

 

Utilize SAR command to get the statistics of CPU usage.

# sar –u 2 5   ß %CPU utilization within 2 seconds of interval each, 3 consecutive times.

 

user_name at 11:17  ın {~}

└─> ß » sar -u 2 5

Linux 3.10.0-862.6.3.el7.x86_64 (prlsshap001)   10/22/2018      _x86_64_        (2 CPU)

11:18:58 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle

11:19:00 AM     all      0.00      0.00      0.00      0.00      0.00    100.00

11:19:02 AM     all      0.25      0.00      0.25      0.00      0.00     99.50

11:19:04 AM     all      0.00      0.00      0.00      0.00      0.00    100.00

 

 

# sar -u -f /var/log/sa/sa22  ß %CPU utilization statistics 10 minutes interval

 

07:30:01 AM     CPU     %user     %nice   %system     %iowait    %steal   %idle

10:50:01 AM     all      0.21      0.00      0.23      0.00      0.00     99.55

11:00:01 AM     all      0.23      0.00      0.28      0.00      0.00     99.50

11:10:01 AM     all      0.24      0.00      0.26      0.00      0.00     99.50

11:20:01 AM     all      0.22      0.00      0.27      0.00      0.00     99.51

Average:        all      4.47      0.00      0.56      0.00      0.00     94.97

 

If CPU utilization is high then, it’s time to find out the processes that are consuming high CPU cycle. To do that we can utilize either “top” command with some tweak that I stated above or below “ps” command.

 

# ps -eo pid,ppid,cmd,%mem,%cpu –sort=-%mem | head

# ps aux –sort -pcpu | head -n 5

 

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

root     118413  0.3  0.0 153228  5416 ?        Ss   13:29   0:00 sshd: user_name [priv]

root        750  0.1  0.1 927444 14220 ?        Sl   Oct16  14:42 falcon-sensor

gsplunk   27317  0.1  1.6 229792 136144 ?       Sl   Oct16  12:22 splunkd -p 8089 restart

root          1  0.0  0.0  54352  6724 ?        Ss   Oct16   2:45 /usr/lib/systemd/systemd –switched-root –system –deserialize 22

 

“iostat” utility to check CPU utilization:

 

# iostate       ß Displays CPU and I/O statistics of all partitions as shown below.

# iostat –c    ß iostat with -c arguments displays only CPU statistics as shown below.

 

 

 

 

 

Once we find out the process then command should also show the user id that’s running the CPU eating process/es. If it is application process then we will reach out to application owner/team so they can house keep the process to bring the utilization back to normal.

Memory Utilization checks:

There are several ways to find out the memory utilization.

 

 “free” command line utility:

 

# free –k      ß show memory in kilobytes

# free –m     ß show memory in Megabytes

# free –g      ß show memory in Gegabytes

# watch –n 1 free –g  ß if we add watch utility before free command it will show the real time memory utilization result

# free -m | grep “Mem” | awk ‘{Mem=($2-($4+$7))/$2 * 100} END {print Mem “% Memory utilized”}’

 

“top” utility:

 

# top -b -o +%MEM | head -n 22

 

“SAR” utility:

 

# sar -r 2 4       ß %Memory utilization output within 2 seconds of interval each,10 consecutive times

 

Linux 3.10.0-862.el7.x86_64 (prlpacap011)       10/22/2018      _x86_64_        (4 CPU)

 

01:49:44 PM kbmemfree   kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty

01:49:46 PM   7162416    994200     12.19      2232    578276    479116      2.90    602460    171604       176

01:49:48 PM   7162448    994168     12.19      2232    578276    479116      2.90    602460    171604       176

01:50:00 PM   7162608    994008     12.19      2232    578276    479116      2.90    602504    171604       180

Average:      7162270    994346     12.19      2232    578278    479118      2.90    602501    171600       411

 

# sar -r -f /var/log/sa/sa26    ß %Memory utilization statistics 10 minutes interval

 

07:40:02 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty

07:50:01 AM   7164276    992340     12.17      2232    567956    497020      3.00    596840    169176      1140

11:40:01 AM   7160620    995996     12.21      2232    573200    495076      2.99    603256    169312       776

11:50:01 AM   7157608    999008     12.25      2232    573424    499632      3.02    604808    169316       916

01:40:01 PM   7157164    999452     12.25      2232    578052    483316      2.92    604300    171380      1160

01:50:01 PM   7155772   1000844     12.27      2232    578532    492428      2.98    606712    171604      1204

Average:      7165560    991056     12.15      2232    568003    504602      3.05    593245    173533      1095

 

“vmstat” command line utility:

vmstat – Summary information of Memory, Processes, Paging etc.

 

# vmstat –s       ß vmstat command and -s switch displays summary of various event counters and memory statistics

# vmstat 2 6      ß vmstat execute every two seconds and stop automatically after executing six intervals.

# vmstat -t 1 5   ß -t parameter shows timestamps with every line printed

 

Top 10 Memory utilizing processes:

# ps -eo size,pid,user,command –sort -size | awk ‘{ hr=$1/1024 ; printf(“%13.2f Mb “,hr) } { for ( x=4 ; x<=NF ; x++ ) { printf(“%s “,$x) } print “” }’ | head -10

 

 

 

[user_name@prlpacap011 ~]$ ps -eo size,pid,user,command –sort -size | awk ‘{ hr=$1/1024 ; printf(“%13.2f Mb “,hr) } { for ( x=4 ; x<=NF ; x++ ) { printf(“%s “,$x) } print “” }’ | head -10

0.00 Mb COMMAND

756.09 Mb falcon-sensor

602.46 Mb /usr/sbin/nscd

439.52 Mb /usr/lib/polkit-1/polkitd –no-debug

297.56 Mb /usr/bin/python -Es /usr/sbin/tuned -l -P

225.45 Mb /usr/sbin/lvmetad -f

164.20 Mb splunkd -p 8089 restart

145.36 Mb /usr/sbin/rsyslogd -n

73.32 Mb /usr/bin/vmtoolsd

72.37 Mb /sbin/audispd

 

Once we find out the top processes then command should also show the user id that’s running the memory/RAM eating process name. If it is application process then we will reach out to application owner and team so they can house keep the process to bring the utilization back to normal.

 

Disk utilization Check:

 

Note: If there are too many read/write requests on a single hard disk drive, it will become slow and we’ll have to upgrade it to a faster drive (with more RPM and cache). The alternate option is splitting the load onto multiple drives by spreading the data by using RAID. To identify, if  I/O issues:

 

# iostat –d        ß displays only disks I/O statistics of all partitions

# iostat -p sda  ß displays only disks I/O statistics for specific device only

# iostat –N        ß displays only LVM statistics

# sar -d 2 5

# sar -b -f /var/log/sa/sa22  ß -b Number of KB paged in (and out) from disk per second

# sar -d -f /var/log/sa/sa22

 

tps – Transactions per second (this includes both read and write)

rtps – Read transactions per second

wtps – Write transactions per second

bread/s – Bytes read per second

bwrtn/s – Bytes written per second

 

Investigate Network Utilization:

 

# netstat -s

# netstat –I                   ß Check the network TX/RX drop

# ethtool -S eth0          ß to verify packet discrd/drop etc

# ethtool -g eth0          ß Verify RX ring buffer current and maximum settings

# ip –s -s link               ß To verify the packet loss

# ping –c 10 –i 0.2 –w 3 8.8.8.8  ß Test network latency

# ifconfig <interface_name> | grep –i drop

# ss –s                         ß

# sar -n DEV                ß SAR statistics of network utilization.

 

02:00:01 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s

02:50:01 PM      eth0      1.58      2.22      0.20      0.48      0.00      0.00      0.00

02:50:01 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00

03:00:01 PM      eth0     34.86     26.76      3.44      3.15      0.00      0.00      0.00

03:00:01 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:         eth0     48.54     27.70    235.44      3.99      0.00      0.00      0.00

Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00

user_name at 14:34  ın {~}

└─> ß » netstat -i

Kernel Interface table

Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg

eth0      1500 65997003      0  60907 0      58607470      0      0      0 BMRU

lo       65536   108738      0      0 0        108738      0      0      0 LRU

 

user_name at 14:39  ın {~}

└─> ß » ifconfig eth0 | grep -i drop

RX errors 0  dropped 60907  overruns 0  frame 0

TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 

 

 

We can utilize “ping” utility to check the latency of the network. the ping utility can serve as a quick and dirty tool to measure latency

 

# ping –c 10 –I 0.2 –w 3 8.8.8.8

 

à -c 10 option tells ping to issue a count of 10 requests.

à -i 0.2 telling ping to issue its requests at intervals of 0.2 seconds

à -w 3 means give up after 3 seconds if no replies show up.

Thus, 10 pings should take about 2 seconds

 

 

Check utilization From VCenter:

 

If it the Linux/Unix server hosted in VMWare VSphere then you can also check them from VCenter.

Vcenter –> select the VMname –> Click on Monitor Tab –> Click on Performance tab à

You can toggle the “Time Range” drop down list to select the time range that you want the graph to demonstrate to.