[Dnsmasq-discuss] [PATCH] Retry queries only after giving the upstream server some time to respond

Mon Apr 5 19:38:57 UTC 2021

On 05/04/2021 16:46, Dominik Derigs wrote:
> Hey all,
> 
> I've seeing a notable increase in upstream traffic with the current
> dnsmasq release candidate. Some investigations have revealed that the
> reason for this is the modified forwarding philosophy that *always*
> triggers a retry whenever a query is received before the upstream was
> able to answer (which may take long on slow networks).
> 
> This patch adds a timeout to stop such forward destination flooding.
> Before the timeout is reached, identical queries are just put on the
> list form where they will get replied to when the response to the first
> forwarded query arrives. The difference added by the patch is that such
> queries do not trigger another forwarding within the configured
> interval.
> If we still received nothing, the next query *after* the timeout is
> again forwarded to avoid hanging because the original query got lost.
> 
> Th default for this interval is 3 seconds, it can be changed using a
> setting and even be disabled (by setting to zero) which restores the
> behavior we have right now. The default of 3 seconds has been chosen
> such that we will retry when other software considers this a good idea
> (retry timeout is 5 seconds in Linux, see RES_TIMEOUT in <resolv.h>).
> 
> I confirmed the intended effect in my local tests: Reduced unnecessary
> forwarding traffic without the danger of failing when the first query
> is lost (or whatever).
> 
> Let me know if you need something more/else. It should be easy to
> review this one.
> 
> Best,
> Dominik
> 

The analysis here doesn't quite ring true. pre-2.83 a retry of a query
from the same source would cause a retry to all servers, whilst a second
identical query from a different source would be treated completely
independently.

Post 2.83, a the second query would be combined with the first, which
can only reduce upstream traffic. The change in 2.85 is that the second
query triggers a retry, so closer to the original situation. BUT the
retry is sent to all servers.

So pre-2.83 the same query from two sources would be forwarded twice, to
a single server each time. In 2.85 the second query would trigger a
broadcast to all servers, in the same way as a repeated query from the
same source. That's where the extra upstream traffic is coming from.

Given that, the fix is much easier: in the case of a repeated query from
a second source, forward it upstream to a single server. Only forward to
all servers when the same query arrives twice from the same source. That
should restore the amount of upstream traffic to the pre2.83 level.

I think this is safer, since it avoid the possibility of throwing away
query retries in the expectation that another one will be along later,
when that's not a guarantee.

I'll put up a patch in the next hour or so. Dominik, please could you
see if it improves the upstream traffic rate? If I've misunderstood or
mis-analysed this, I'll certainly look at you approach.

Simon.