We find out a strange phenomenon in a product server. By using “free” command, it shows there is no free memory in this server. But when we add all processes’s memory allocation:

ps aux|awk '{s+=$6} END{print s}'

it show all processes cost only 60GB memory (The whole physical memory of this server is 126GB).
Where is the lost memory?
Firstly we umount the tmpfs but it does not make any change. Then we use:

cat /proc/meminfo

and soon notice that the “Slab” cost more than 10GB memory. Slab is a linux kernel component for managing memory. If the “Slab” is too high in “/proc/meminfo”, it means kernel may allocate too much resource for user-space program. But, what type of resource? Finally, by using the command:

ss | awk '{if ($2>0) print $2;}' | awk '{sum+=$1} END {print "Sum = ", sum}'

it shows the all TCP connections’s Recv-Q cost more than 20GB memory. Now we uncover the root cause: too much TCP connections (more than 1 hundred thousand) been created. The solution could be:

  • Reduce the size of TCP Recv-Q
    sudo bash -c 'echo “4096 4096 4096” > /proc/sys/net/ipv4/tcp_rmem'
            
  • Modify user program to reduce the number of TCP connections