[Dnsmasq-discuss] [PATCH] Retry queries only after giving the upstream server some time to respond

Mon Apr 5 20:31:23 UTC 2021

On 05/04/2021 21:16, Dominik Derigs wrote:
> Hey Simon,
> 
> On Mon, 2021-04-05 at 20:38 +0100, Simon Kelley wrote:
>> Post 2.83, a the second query would be combined with the first, which
>> can only reduce upstream traffic. The change in 2.85 is that the second
>> query triggers a retry, so closer to the original situation. BUT the
>> retry is sent to all servers.
> 
> Yeah, sorry for not being precise enough, my comparison was 2.84 to
> 2.85(rc2) not to the pre-2.83 era. I much appreciated the reduction in
> upstream traffic in 2.83 + 2.84 and hoped we can keep this up.
> 
> In my situation your proposed change wouldn't make any difference as
> there is only one upstream server that is a local unbound recursive
> resolver.
> 
> On Mon, 2021-04-05 at 20:38 +0100, Simon Kelley wrote:
>> Only forward to
>> all servers when the same query arrives twice from the same source.
> 
> This is the issue I'm concerned about. Some clients send the same query
> multiple times (they don't seem to have a local cache). In addition,
> other clients happen to send a query at the same time. This all
> triggers a re-forwarding. As I get your idea, they would still all
> produce forwarded queries. This is what I wanted to prevent with my
> patch. Basically a v2.84-like behavior but with reduced likeliness of
> failing because we eventually allow retrying if the first query died.
> Just not immediately.
> 
> Hope that makes my idea clearer.
> 

It does. But I think your fix is fragile, consider a client which sends
a query, and the query or the answer gets dropped.  After a timeout it
send the query again, with a different ID or source port. No answer to
that and it gives up. If the client's timeout is shorter than dnsmasq's
timeout, then the second query will just be dropped and the whole thing
will fail. Sure, dnsmasq's timeout will expire, but dnsmasq doesn't have
the option to retry the query then, since it doesn't have a copy of it:
it has to wait for the next retry, which will never come.

The new retry timeout parameter in your patch enters the set of stuff
which people don't understand but which have to be set right or subtle
breakage happens. That set should be as small as possible.

To do this properly, dnsmasq needs to store the query, and be able to
autonomously retry it. That would be good, but it's a bigger patch and
has resource implications. Also, the time before retry has to be shorter
than any conceivable client, to avoid the client-gives-up-before-server
scenario. That's not a recipe for reduced upstream traffic.

Simon.