[Dnsmasq-discuss] Client retries broken in 2.84

Thu Mar 11 11:19:51 UTC 2021

Hi Simon and Nicholas,

I think dnsmasq relying on driving retries by clients is not great
design. When clients starts bombarding dnsmasq with requests, dnsmasq
will in turn bombard upstream server(s) too. It seems unnecessary to me.
And even wrong.

I think dnsmasq should drive retries itself, periodically checking
existing frec, comparing last sent request timestamp. Every few
miliseconds try sending a new packet, let's say each 200 ms. After no
reply for long enough, it should send SERVFAIL reply to all clients
requested that query.

It would also solve fairness between clients. It might however count
each retry per client request, so they each wait similar time. That
would solve reply for client 1 after 5 retries, when client 2 requested
it just after 3rd retry. It might wait also 5 retried before returning
failure to the second. Frec could be cleared once no more clients are
waiting, freeing resources for the failed query. Now it is recycled only
when forward records limit is reached, but client is not notified on
timeout.

No patch (yet), still just an idea. I think some minimal time between
queries should be imposed to clients. If client floods dnsmasq, it
should not flood the upstream the same way.

Cheers,
Petr

On 2/23/21 12:20 AM, Simon Kelley wrote:
> On 22/02/2021 23:04, Nicholas Mu wrote:
>> Hi Simon,
>>
>> The commit fixes all the issues we were seeing. Thanks for getting the
>> fix out so quickly. 
> 
> Excellent. I just pushed my tree, which has a further small update to
> this. I've been dogfooding it here and all seems well.
> 
>>
>> I had one follow up. So now it seems that for all clients retries will
>> use the same SP/QID. Would it be possible to have a way/config to vary
>> SP on retries or are we stuck with a single SP due to the CVE? The
>> reason we'd prefer varying SPs is mostly due to flow hashing. Say
>> dnsmasq is configured with a single upstream nameserver. That means any
>> retries will use the same 5-tuple and retries will follow the same
>> network path. If some paths in the network have an outage then we are
>> stuck on that path for all retries.  In general, we find better DNS
>> availability when SP varies across retries and we can traverse different
>> paths on the network. Wondering if you had any thoughts on this? 
>>
> 
> Rock and hard place. The CVE aims to avoid exactly what you want,
> because the more different SP/QID combinations that are valid for a
> given DNS query, the easier it is for an attacker to get an answer
> accepted into the cache when spraying large numbers of random SP/QID
> combos at the DNS server. Using different SP/QID on retries  allows the
> attacker to send lots of identical queries/retries and therefore make
> his life easier.
> 
> <thinks> I guess you could assign a new SP for retries, IF you stopped
> accepting answers on the old one. There are interesting fairness
> problems there:
> 
> client1 asks for example.com, forwarded upstream, reply about to return,
> when client2 asks for example.com, that gets forwarded upstream, the
> SP/QID used with client1 gets abandoned and client1 has to await
> client2s reply. In the meantime client3 ask for example.com......
> 
> 
> Cheers,
> 
> Simon.
> 
> 
>> Thanks,
>> Nick
>>
>> On Wed, Feb 17, 2021 at 4:03 PM Simon Kelley <simon at thekelleys.org.uk
>> <mailto:simon at thekelleys.org.uk>> wrote:
>>
>>     On 16/02/2021 00:42, Nicholas Mu wrote:
>>     > Hi, 
>>     >
>>     > I noticed a low level increase in DNS errors after upgrading to 2.84.
>>     > After doing some packet diving, it seems that retries behave
>>     differently
>>     > in the new version. For my testing, I'm using dnspython but I believe
>>     > this issue would affect any client that uses different source
>>     ports and
>>     > query ids for retries. As a result, dnspython will attempt retries for
>>     > up to 30 seconds and will eventually timeout as only a single
>>     packet is
>>     > ever sent and retries are rendered ineffective. 
>>     >
>>     > On 2.82, multiple packets are sent as dnspython retries. Note the
>>     > retries are using different source ports and query ids:
>>     >
>>     > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-2.82
>>     > 19:59:03.826638 IP 172.31.44.29.44547 > 172.31.0.2.53: 51880+ NS?
>>     > somedomain. (64)
>>     > 19:59:05.928335 IP 172.31.44.29.33363 > 172.31.0.2.53: 41382+ NS?
>>     > somedomain. (64)
>>     > 19:59:08.130620 IP 172.31.44.29.21177 > 172.31.0.2.53: 36073+ NS?
>>     > somedomain. (64)
>>     > 19:59:10.532792 IP 172.31.44.29.57223 > 172.31.0.2.53: 50309+ NS?
>>     > somedomain. (64)|
>>     > |
>>     > |
>>     > |On 2.84, only a single packet is sent:|
>>     > |
>>     > |
>>     > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-2.84
>>     > 19:53:12.189849 IP 172.31.44.29.5335 > 172.31.0.2.53: 826+ NS?
>>     > somedomain. (64)|
>>     > |
>>     > |
>>     > I also tested using dig, nslookup, and host which all use the same
>>     > source port and query id on retries. The behavior works as intended on
>>     > both versions. I would suspect the following commit is responsible for
>>     > this behavior change:
>>     >
>>     >       Handle multiple identical near simultaneous DNS queries better.
>>     >       Previously, such queries would all be forwarded
>>     >       independently. This is, in theory, inefficent but in practise
>>     >       not a problem, _except_ that is means that an answer for any
>>     >       of the forwarded queries will be accepted and cached.
>>     >       An attacker can send a query multiple times, and for each
>>     repeat,
>>     >       another {port, ID} becomes capable of accepting the answer he is
>>     >       sending in the blind, to random IDs and ports. The chance of a
>>     >       succesful attack is therefore multiplied by the number of
>>     repeats
>>     >       of the query. The new behaviour detects repeated queries and
>>     >       merely stores the clients sending repeats so that when the
>>     >       first query completes, the answer can be sent to all the
>>     >       clients who asked. Refer: CVE-2020-25686.
>>     >
>>     > Is this intended? Seems to me any clients with retry behavior
>>     similar to
>>     > dnspython are now broken. Clients will hang until their configured
>>     > timeouts are reached on any single DNS failure.
>>     >
>>     > Thanks,
>>     >
>>     > Nick
>>     >
>>
>>
>>     Your analysis is spot-on.
>>
>>     I think it's possible to satisfy both the security and robustness
>>     requirements here.
>>
>>     Pre 2.84, a retry for the same query with different query-id and/or
>>     source port would be treated as an independent query and forwarded
>>     again, with a new source-port and query-id. This gives the attacker the
>>     ability to increase the attack surface for cache pollution by sending
>>     many repeat queries.
>>
>>     In 2.84 a repeat with the same SP/QID gets treated as it always has: as
>>     a retry, and the query gets forwarded again, and this time to all
>>     available servers.  The same query but with different SP/QID now gets
>>     piggy-backed onto the existing query, as you noted.
>>
>>     The solution here is for a repeat query with different SP/QID to trigger
>>     the same retry behaviour as a repeat query with the same SP/QID, but
>>     also to still piggy back the existing query, so that no new SP/QID
>>     tuples are generated going upstream.
>>
>>     I just pushed
>>     http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=141a26f979b4bc959d8e866a295e24f8cf456920
>>     which should implement this. Please test!
>>
>>
>>     Cheers,
>>
>>     Simon.

-- 
Petr Menšík
Software Engineer
Red Hat, http://www.redhat.com/
email: pemensik at redhat.com
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/attachments/20210311/b9af1340/attachment.sig>