[Dnsmasq-discuss] Dnsmasq stops caching for a while on receive of failed or retried lookup?

Tue Jun 12 16:59:57 BST 2018

On 12/06/18 12:21, Mark Fermor, HolidayExtras.com wrote:
> Hello,
> 
> Running dnsmasq with these options:
> /usr/sbin/dnsmasq -k --cache-size=50 --log-facility=- --user=nobody
> --group=nobody --no-hosts --neg-ttl=60 --max-ttl=240 --max-cache-ttl=300
> 
> No local dnsmasq config file so that's literally all the config other
> than defaults applied by dnsmasq
> 
> dnsmasq -v
> Dnsmasq version 2.78  Copyright (c) 2000-2017 Simon Kelley
> Compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6
> no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
> 
> This is something running running in Kubernetes, they run as sidekick
> containers to the main application container. I have multiple of the
> same deployment running in the cluster, so they're all at the same
> versions and receiving equal amounts of traffic via load balancing. They
> all talk to the same endpoints outbound and do the same work load etc.
> I've sent sigusr1 signal to all of the pods individually (all pods have
> been running for approx 48 hours bar pod4 which has been running less
> hours):
> 
> pod1
> I0608 15:11:34.998127       1 nanny.go:116] dnsmasq[19]: time 1528470694
> I0608 15:11:34.998169       1 nanny.go:116] dnsmasq[19]: cache size 50,
> 0/2267416 cache insertions re-used unexpired cache entries.
> I0608 15:11:34.998175       1 nanny.go:116] dnsmasq[19]: queries
> forwarded 3218560, queries answered locally 3182486
> I0608 15:11:34.998180       1 nanny.go:116] dnsmasq[19]: queries for
> authoritative zones 0
> I0608 15:11:34.998184       1 nanny.go:116] dnsmasq[19]: server
> 10.227.240.10#53: queries sent 3218560, retried or failed 16
> 
> pod2
> I0608 15:11:35.909168       1 nanny.go:116] dnsmasq[18]: time 1528470695
> I0608 15:11:35.909206       1 nanny.go:116] dnsmasq[18]: cache size 50,
> 0/197465 cache insertions re-used unexpired cache entries.
> I0608 15:11:35.909211       1 nanny.go:116] dnsmasq[18]: queries
> forwarded 240843, queries answered locally 6159789
> I0608 15:11:35.909216       1 nanny.go:116] dnsmasq[18]: queries for
> authoritative zones 0
> I0608 15:11:35.909219       1 nanny.go:116] dnsmasq[18]: server
> 10.227.240.10#53: queries sent 240843, retried or failed 4
> 
> pod3
> I0608 15:11:36.948015       1 nanny.go:116] dnsmasq[20]: time 1528470696
> I0608 15:11:36.948083       1 nanny.go:116] dnsmasq[20]: cache size 50,
> 0/63648 cache insertions re-used unexpired cache entries.
> I0608 15:11:36.948138       1 nanny.go:116] dnsmasq[20]: queries
> forwarded 46004, queries answered locally 6347223
> I0608 15:11:36.948188       1 nanny.go:116] dnsmasq[20]: queries for
> authoritative zones 0
> I0608 15:11:36.948219       1 nanny.go:116] dnsmasq[20]: server
> 10.227.240.10#53: queries sent 46004, retried or failed 1
> 
> pod4
> I0608 15:11:38.032330       1 nanny.go:116] dnsmasq[24]: time 1528470698
> I0608 15:11:38.032374       1 nanny.go:116] dnsmasq[24]: cache size 50,
> 0/1359727 cache insertions re-used unexpired cache entries.
> I0608 15:11:38.032382       1 nanny.go:116] dnsmasq[24]: queries
> forwarded 1939395, queries answered locally 742411
> I0608 15:11:38.032388       1 nanny.go:116] dnsmasq[24]: queries for
> authoritative zones 0
> I0608 15:11:38.032394       1 nanny.go:116] dnsmasq[24]: server
> 10.227.240.10#53: queries sent 1939395, retried or failed 7
> 
> 
> The problem I notice, is some pods (pod1, pod2, pod4) are forwarding far
> more requests than the other pods (an indication of what other is, would
> be pod3). I'm not sure what's caused this seeing as the application is
> doing the same across all pods. The only thing I notice here, is that
> pods 1/2/4 all have a number of "retried or failed", which the other
> pods don't. Therefore I wonder if the reason that those pods have sent
> so many more requests upstream instead of hitting the cache, is because
> of something to do with a "retried or failed", which then stops the
> cache from working for a decent period of time? I've not been able to
> find anything (google) relating to this scenario but it's the only thing
> that makes sense to me right now. I can accept a couple of failures for
> lookup here and there, but one failure (if i'm onto something that is),
> seems to then cause no cache hits for a large period of time?
> 

There's no reason why a retry would affect the cache, so I think you're
jumping to conclusions there. The "retried or failed" counts are in the
noise: don't forget that DNS uses UDP transport by default, so those
numbers arise from very rare dropped UDP packets - nothing to worry about.

As for the differences in cache hit rates - it could be subtle
synchronisation effects, especialy as all the dnsmasq instances point to
the same upstream. If one pod hits a particular DNS name first, then
that name will end up with a longer TTL in that pod than the others,
which get the record later from the upstream after some of the TTL has
run down. If something makes that happen a lot, that would affect the stats.

TL;DR

There doesn't appear to be a problem.

Simon.