How sysctl has broken the Network

Rimantas Ragainis
4 min readApr 27, 2020

Incident

So this story begins once Nagios, one of our monitoring systems, has started reporting unexpected issues in late evening:

dsp-graphite-monitoring/NTP process is CRITICAL:
CRITICAL - Plugin timed out while executing system call
dsp-graphite-monitoring/NTP time is UNKNOWN:
UNKNOWN: check_ntp_time: Invalid hostname/address - us.pool.ntp.org

At this point, the whole message per se does not yet indicate that issues are related with network only. As you might know, Nagios checks may fail due to failed/stopped service itself or SNMP misconfiguration; so.. the digging starts.

Digging

As both of these mentioned alerts above are handled by NRPE and being ran based on SNMP, these services were the first to check. Strangely, they were ok, up & running without any issue. Another thing which caught the eye was service’s inability to resolve NTP hostname. And that does implicate issue with network, same as GCP Console UI.

Before running simple checks with ping and dig, I’ve double checked on /etc/resolv.conf file if the config looks ok:

# Generated by NetworkManager
search europe-west2-c.c.<project_id>.internal c.<project_id>.internal google.internal
nameserver 169.254.169.254

Everything looked in right places, configured by the book (https://cloud.google.com/compute/docs/internal-dns), therefore ping nor dig didn’t worked as expected:

# ping us.pool.ntp.org
ping: us.pool.ntp.org: Name or service not known
# dig us.pool.ntp.org; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> us.pool.ntp.org
;; global options: +cmd
;; connection timed out; no servers could be reached

Then I tried to ping to specific IP, resolved NTP hostname from other server (us.pool.ntp.org <-> 204.11.201.12), and got the right response without any packet loss:

# ping 204.11.201.12 -c 4
PING 204.11.201.12 (204.11.201.12) 56(84) bytes of data.
64 bytes from 204.11.201.12: icmp_seq=1 ttl=48 time=130 ms
64 bytes from 204.11.201.12: icmp_seq=2 ttl=48 time=130 ms
64 bytes from 204.11.201.12: icmp_seq=3 ttl=48 time=130 ms
64 bytes from 204.11.201.12: icmp_seq=4 ttl=48 time=130 ms
--- 204.11.201.12 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 130.307/130.415/130.638/0.132 ms

At this point, it’s clear something is wrong with DNS itself. I even tried different DNS name servers explicitly:

# dig @169.254.169.254 us.pool.ntp.org; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @169.254.169.254 us.pool.ntp.org
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
# dig @8.8.8.8 us.pool.ntp.org; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @8.8.8.8 us.pool.ntp.org
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
# dig @8.8.4.4 us.pool.ntp.org; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @8.8.4.4 us.pool.ntp.org
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
# dig @1.1.1.1 us.pool.ntp.org; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @1.1.1.1 us.pool.ntp.org
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

None of them worked. So the last resort for me was to catch networking packets with ngrep and to analyze them via Whireshark, maybe I’m lucky enough and will find some solid clues:

# terminal 1
dig @169.254.169.254 us.pool.ntp.org
# terminal 2
ngrep -q -d any src 169.254.169.254 -O tcpdump

It seemed VM is getting proper response over the network, but it vanishes somewhere. Even tried the wild shot — restarted instance. Nothing new. As the resolution was not clear at the moment and I’m not good at solving networking issues, decided just to launch new VM via Ansible playbook and leave the old one for deeper further analysis. For the latter I had solid help from Antonio Messina, TSE at Google Cloud Support, who was able to discover actual root case and has shared interesting findings in his post https://cloud.google.com/blog/topics/inside-google-cloud/google-cloud-support-engineer-solves-a-tough-dns-case.

Resolution

So, as we’ve learned, DNS was not the one to blame — sysctl it is. Kernel misconfiguration has led to network packets being dropped, that’s why few services have broken: DNS, SNMP, NTP. Was this our issue? Yes and No. “Yes” — we did used ambitious values for certain kernel parameters such as net.core.rmem_default as for 2GB(!), which is way too high. “No” — due to kernel bug any error arose after the changes were applied, so we never knew.

--

--