[Dnsmasq-discuss] Client retries broken in 2.84

Thu Mar 11 13:13:19 UTC 2021

Hey all,

I support Petr's idea of putting the responsibility for upstream
retires into dnsmasq's hands. All of his points seem very valid.

I also agree about the necessity of a lower limit to prevent users from
configuring dnsmasq for retrying every, say, 5 msec (unless they really
want it and modify the source). I'm saying this because I've seen more
than one user heavily hitting their dnsmasq because they set the retry
timeout to something as low as 10 msec... Clearly a bad idea.

I also want to bring up that a default retry timeout should be well
thought about. In my experience, 200 msec is too short. Imagine a local
recursive resolver that needs to walk an entire path if you request
some a.b.c.d domain for the first time and haven't even requested the d
TLD before. This may easily take up to a few seconds and is still
perfectly expected. Retrying would not be of any value here.

`man resolv.conf` tells us that Linux will typically wait for 5 seconds
before retrying. Our default should be similar to avoid unnecessary
traffic.

Best regards,
Dominik

On Thu, 2021-03-11 at 12:19 +0100, Petr Menšík wrote:
> Hi Simon and Nicholas,
> 
> I think dnsmasq relying on driving retries by clients is not great
> design. When clients starts bombarding dnsmasq with requests, dnsmasq
> will in turn bombard upstream server(s) too. It seems unnecessary to
> me.
> And even wrong.
> 
> I think dnsmasq should drive retries itself, periodically checking
> existing frec, comparing last sent request timestamp. Every few
> miliseconds try sending a new packet, let's say each 200 ms. After no
> reply for long enough, it should send SERVFAIL reply to all clients
> requested that query.
> 
> It would also solve fairness between clients. It might however count
> each retry per client request, so they each wait similar time. That
> would solve reply for client 1 after 5 retries, when client 2
> requested
> it just after 3rd retry. It might wait also 5 retried before
> returning
> failure to the second. Frec could be cleared once no more clients are
> waiting, freeing resources for the failed query. Now it is recycled
> only
> when forward records limit is reached, but client is not notified on
> timeout.
> 
> No patch (yet), still just an idea. I think some minimal time between
> queries should be imposed to clients. If client floods dnsmasq, it
> should not flood the upstream the same way.
> 
> Cheers,
> Petr
> 
> On 2/23/21 12:20 AM, Simon Kelley wrote:
> > On 22/02/2021 23:04, Nicholas Mu wrote:
> > > Hi Simon,
> > > 
> > > The commit fixes all the issues we were seeing. Thanks for
> > > getting the
> > > fix out so quickly. 
> > 
> > Excellent. I just pushed my tree, which has a further small update
> > to
> > this. I've been dogfooding it here and all seems well.
> > 
> > > I had one follow up. So now it seems that for all clients retries
> > > will
> > > use the same SP/QID. Would it be possible to have a way/config to
> > > vary
> > > SP on retries or are we stuck with a single SP due to the CVE?
> > > The
> > > reason we'd prefer varying SPs is mostly due to flow hashing. Say
> > > dnsmasq is configured with a single upstream nameserver. That
> > > means any
> > > retries will use the same 5-tuple and retries will follow the
> > > same
> > > network path. If some paths in the network have an outage then we
> > > are
> > > stuck on that path for all retries.  In general, we find better
> > > DNS
> > > availability when SP varies across retries and we can traverse
> > > different
> > > paths on the network. Wondering if you had any thoughts on this? 
> > > 
> > 
> > Rock and hard place. The CVE aims to avoid exactly what you want,
> > because the more different SP/QID combinations that are valid for a
> > given DNS query, the easier it is for an attacker to get an answer
> > accepted into the cache when spraying large numbers of random
> > SP/QID
> > combos at the DNS server. Using different SP/QID on retries  allows
> > the
> > attacker to send lots of identical queries/retries and therefore
> > make
> > his life easier.
> > 
> > <thinks> I guess you could assign a new SP for retries, IF you
> > stopped
> > accepting answers on the old one. There are interesting fairness
> > problems there:
> > 
> > client1 asks for example.com, forwarded upstream, reply about to
> > return,
> > when client2 asks for example.com, that gets forwarded upstream,
> > the
> > SP/QID used with client1 gets abandoned and client1 has to await
> > client2s reply. In the meantime client3 ask for example.com......
> > 
> > 
> > Cheers,
> > 
> > Simon.
> > 
> > 
> > > Thanks,
> > > Nick
> > > 
> > > On Wed, Feb 17, 2021 at 4:03 PM Simon Kelley <
> > > simon at thekelleys.org.uk
> > > <mailto:simon at thekelleys.org.uk>> wrote:
> > > 
> > >     On 16/02/2021 00:42, Nicholas Mu wrote:
> > >     > Hi, 
> > >     >
> > >     > I noticed a low level increase in DNS errors after
> > > upgrading to 2.84.
> > >     > After doing some packet diving, it seems that retries
> > > behave
> > >     differently
> > >     > in the new version. For my testing, I'm using dnspython but
> > > I believe
> > >     > this issue would affect any client that uses different
> > > source
> > >     ports and
> > >     > query ids for retries. As a result, dnspython will attempt
> > > retries for
> > >     > up to 30 seconds and will eventually timeout as only a
> > > single
> > >     packet is
> > >     > ever sent and retries are rendered ineffective. 
> > >     >
> > >     > On 2.82, multiple packets are sent as dnspython retries.
> > > Note the
> > >     > retries are using different source ports and query ids:
> > >     >
> > >     > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-
> > > 2.82
> > >     > 19:59:03.826638 IP 172.31.44.29.44547 > 172.31.0.2.53:
> > > 51880+ NS?
> > >     > somedomain. (64)
> > >     > 19:59:05.928335 IP 172.31.44.29.33363 > 172.31.0.2.53:
> > > 41382+ NS?
> > >     > somedomain. (64)
> > >     > 19:59:08.130620 IP 172.31.44.29.21177 > 172.31.0.2.53:
> > > 36073+ NS?
> > >     > somedomain. (64)
> > >     > 19:59:10.532792 IP 172.31.44.29.57223 > 172.31.0.2.53:
> > > 50309+ NS?
> > >     > somedomain. (64)|
> > >     > |
> > >     > |
> > >     > |On 2.84, only a single packet is sent:|
> > >     > |
> > >     > |
> > >     > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-
> > > 2.84
> > >     > 19:53:12.189849 IP 172.31.44.29.5335 > 172.31.0.2.53: 826+
> > > NS?
> > >     > somedomain. (64)|
> > >     > |
> > >     > |
> > >     > I also tested using dig, nslookup, and host which all use
> > > the same
> > >     > source port and query id on retries. The behavior works as
> > > intended on
> > >     > both versions. I would suspect the following commit is
> > > responsible for
> > >     > this behavior change:
> > >     >
> > >     >       Handle multiple identical near simultaneous DNS
> > > queries better.
> > >     >       Previously, such queries would all be forwarded
> > >     >       independently. This is, in theory, inefficent but in
> > > practise
> > >     >       not a problem, _except_ that is means that an answer
> > > for any
> > >     >       of the forwarded queries will be accepted and cached.
> > >     >       An attacker can send a query multiple times, and for
> > > each
> > >     repeat,
> > >     >       another {port, ID} becomes capable of accepting the
> > > answer he is
> > >     >       sending in the blind, to random IDs and ports. The
> > > chance of a
> > >     >       succesful attack is therefore multiplied by the
> > > number of
> > >     repeats
> > >     >       of the query. The new behaviour detects repeated
> > > queries and
> > >     >       merely stores the clients sending repeats so that
> > > when the
> > >     >       first query completes, the answer can be sent to all
> > > the
> > >     >       clients who asked. Refer: CVE-2020-25686.
> > >     >
> > >     > Is this intended? Seems to me any clients with retry
> > > behavior
> > >     similar to
> > >     > dnspython are now broken. Clients will hang until their
> > > configured
> > >     > timeouts are reached on any single DNS failure.
> > >     >
> > >     > Thanks,
> > >     >
> > >     > Nick
> > >     >
> > > 
> > > 
> > >     Your analysis is spot-on.
> > > 
> > >     I think it's possible to satisfy both the security and
> > > robustness
> > >     requirements here.
> > > 
> > >     Pre 2.84, a retry for the same query with different query-id
> > > and/or
> > >     source port would be treated as an independent query and
> > > forwarded
> > >     again, with a new source-port and query-id. This gives the
> > > attacker the
> > >     ability to increase the attack surface for cache pollution by
> > > sending
> > >     many repeat queries.
> > > 
> > >     In 2.84 a repeat with the same SP/QID gets treated as it
> > > always has: as
> > >     a retry, and the query gets forwarded again, and this time to
> > > all
> > >     available servers.  The same query but with different SP/QID
> > > now gets
> > >     piggy-backed onto the existing query, as you noted.
> > > 
> > >     The solution here is for a repeat query with different SP/QID
> > > to trigger
> > >     the same retry behaviour as a repeat query with the same
> > > SP/QID, but
> > >     also to still piggy back the existing query, so that no new
> > > SP/QID
> > >     tuples are generated going upstream.
> > > 
> > >     I just pushed
> > >     
> > > http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=141a26f979b4bc959d8e866a295e24f8cf456920
> > >     which should implement this. Please test!
> > > 
> > > 
> > >     Cheers,
> > > 
> > >     Simon.
> 
> _______________________________________________
> Dnsmasq-discuss mailing list
> Dnsmasq-discuss at lists.thekelleys.org.uk
> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss