[Dnsmasq-discuss] Dnsmasq stops caching for a while on receive of failed or retried lookup?
Simon Kelley
simon at thekelleys.org.uk
Tue Jun 12 16:59:57 BST 2018
On 12/06/18 12:21, Mark Fermor, HolidayExtras.com wrote:
> Hello,
>
> Running dnsmasq with these options:
> /usr/sbin/dnsmasq -k --cache-size=50 --log-facility=- --user=nobody
> --group=nobody --no-hosts --neg-ttl=60 --max-ttl=240 --max-cache-ttl=300
>
> No local dnsmasq config file so that's literally all the config other
> than defaults applied by dnsmasq
>
> dnsmasq -v
> Dnsmasq version 2.78 Copyright (c) 2000-2017 Simon Kelley
> Compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6
> no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
>
> This is something running running in Kubernetes, they run as sidekick
> containers to the main application container. I have multiple of the
> same deployment running in the cluster, so they're all at the same
> versions and receiving equal amounts of traffic via load balancing. They
> all talk to the same endpoints outbound and do the same work load etc.
> I've sent sigusr1 signal to all of the pods individually (all pods have
> been running for approx 48 hours bar pod4 which has been running less
> hours):
>
> pod1
> I0608 15:11:34.998127 1 nanny.go:116] dnsmasq[19]: time 1528470694
> I0608 15:11:34.998169 1 nanny.go:116] dnsmasq[19]: cache size 50,
> 0/2267416 cache insertions re-used unexpired cache entries.
> I0608 15:11:34.998175 1 nanny.go:116] dnsmasq[19]: queries
> forwarded 3218560, queries answered locally 3182486
> I0608 15:11:34.998180 1 nanny.go:116] dnsmasq[19]: queries for
> authoritative zones 0
> I0608 15:11:34.998184 1 nanny.go:116] dnsmasq[19]: server
> 10.227.240.10#53: queries sent 3218560, retried or failed 16
>
> pod2
> I0608 15:11:35.909168 1 nanny.go:116] dnsmasq[18]: time 1528470695
> I0608 15:11:35.909206 1 nanny.go:116] dnsmasq[18]: cache size 50,
> 0/197465 cache insertions re-used unexpired cache entries.
> I0608 15:11:35.909211 1 nanny.go:116] dnsmasq[18]: queries
> forwarded 240843, queries answered locally 6159789
> I0608 15:11:35.909216 1 nanny.go:116] dnsmasq[18]: queries for
> authoritative zones 0
> I0608 15:11:35.909219 1 nanny.go:116] dnsmasq[18]: server
> 10.227.240.10#53: queries sent 240843, retried or failed 4
>
> pod3
> I0608 15:11:36.948015 1 nanny.go:116] dnsmasq[20]: time 1528470696
> I0608 15:11:36.948083 1 nanny.go:116] dnsmasq[20]: cache size 50,
> 0/63648 cache insertions re-used unexpired cache entries.
> I0608 15:11:36.948138 1 nanny.go:116] dnsmasq[20]: queries
> forwarded 46004, queries answered locally 6347223
> I0608 15:11:36.948188 1 nanny.go:116] dnsmasq[20]: queries for
> authoritative zones 0
> I0608 15:11:36.948219 1 nanny.go:116] dnsmasq[20]: server
> 10.227.240.10#53: queries sent 46004, retried or failed 1
>
> pod4
> I0608 15:11:38.032330 1 nanny.go:116] dnsmasq[24]: time 1528470698
> I0608 15:11:38.032374 1 nanny.go:116] dnsmasq[24]: cache size 50,
> 0/1359727 cache insertions re-used unexpired cache entries.
> I0608 15:11:38.032382 1 nanny.go:116] dnsmasq[24]: queries
> forwarded 1939395, queries answered locally 742411
> I0608 15:11:38.032388 1 nanny.go:116] dnsmasq[24]: queries for
> authoritative zones 0
> I0608 15:11:38.032394 1 nanny.go:116] dnsmasq[24]: server
> 10.227.240.10#53: queries sent 1939395, retried or failed 7
>
>
> The problem I notice, is some pods (pod1, pod2, pod4) are forwarding far
> more requests than the other pods (an indication of what other is, would
> be pod3). I'm not sure what's caused this seeing as the application is
> doing the same across all pods. The only thing I notice here, is that
> pods 1/2/4 all have a number of "retried or failed", which the other
> pods don't. Therefore I wonder if the reason that those pods have sent
> so many more requests upstream instead of hitting the cache, is because
> of something to do with a "retried or failed", which then stops the
> cache from working for a decent period of time? I've not been able to
> find anything (google) relating to this scenario but it's the only thing
> that makes sense to me right now. I can accept a couple of failures for
> lookup here and there, but one failure (if i'm onto something that is),
> seems to then cause no cache hits for a large period of time?
>
There's no reason why a retry would affect the cache, so I think you're
jumping to conclusions there. The "retried or failed" counts are in the
noise: don't forget that DNS uses UDP transport by default, so those
numbers arise from very rare dropped UDP packets - nothing to worry about.
As for the differences in cache hit rates - it could be subtle
synchronisation effects, especialy as all the dnsmasq instances point to
the same upstream. If one pod hits a particular DNS name first, then
that name will end up with a longer TTL in that pod than the others,
which get the record later from the upstream after some of the TTL has
run down. If something makes that happen a lot, that would affect the stats.
TL;DR
There doesn't appear to be a problem.
Simon.
More information about the Dnsmasq-discuss
mailing list