[Dnsmasq-discuss] Client retries broken in 2.84
Petr Menšík
pemensik at redhat.com
Thu Mar 11 16:09:24 UTC 2021
On 3/11/21 2:13 PM, Dominik wrote:
> Hey all,
>
> I support Petr's idea of putting the responsibility for upstream
> retires into dnsmasq's hands. All of his points seem very valid.
>
> I also agree about the necessity of a lower limit to prevent users from
> configuring dnsmasq for retrying every, say, 5 msec (unless they really
> want it and modify the source). I'm saying this because I've seen more
> than one user heavily hitting their dnsmasq because they set the retry
> timeout to something as low as 10 msec... Clearly a bad idea.
>
> I also want to bring up that a default retry timeout should be well
> thought about. In my experience, 200 msec is too short. Imagine a local
> recursive resolver that needs to walk an entire path if you request
> some a.b.c.d domain for the first time and haven't even requested the d
> TLD before. This may easily take up to a few seconds and is still
> perfectly expected. Retrying would not be of any value here.
Sure, I am aware 200 ms is short interval. It might be adaptive, first
retry should be shorter, then retries with longer interval. Both
resolv.conf and dig defaults to 5 seconds before retry, so that might be
good starting point. Okay, timeout in miliseconds is too short. Bad
idea, sorry for that number.
>
> `man resolv.conf` tells us that Linux will typically wait for 5 seconds
> before retrying. Our default should be similar to avoid unnecessary
> traffic.
Okay, I thought default is about 1 second, thanks for correction. We
should stick to the best practices. It seems common is 15 seconds in
total for name resolution, initial and two retries, each with 5 sec
delay. We may make it a bit shorter, for example each 4.5 seconds, to
deliver SERVFAIL to end clients, before it stops listening. I will try
to find, how unbound or bind calculate maximum time and retry times.
Client timing might not be relevant for a server retries. I think bigger
servers always deliver some reply, when the problem lies in their forwarder.
>
> Best regards,
> Dominik
>
> On Thu, 2021-03-11 at 12:19 +0100, Petr Menšík wrote:
>> Hi Simon and Nicholas,
>>
>> I think dnsmasq relying on driving retries by clients is not great
>> design. When clients starts bombarding dnsmasq with requests, dnsmasq
>> will in turn bombard upstream server(s) too. It seems unnecessary to
>> me.
>> And even wrong.
>>
>> I think dnsmasq should drive retries itself, periodically checking
>> existing frec, comparing last sent request timestamp. Every few
>> miliseconds try sending a new packet, let's say each 200 ms. After no
>> reply for long enough, it should send SERVFAIL reply to all clients
>> requested that query.
>>
>> It would also solve fairness between clients. It might however count
>> each retry per client request, so they each wait similar time. That
>> would solve reply for client 1 after 5 retries, when client 2
>> requested
>> it just after 3rd retry. It might wait also 5 retried before
>> returning
>> failure to the second. Frec could be cleared once no more clients are
>> waiting, freeing resources for the failed query. Now it is recycled
>> only
>> when forward records limit is reached, but client is not notified on
>> timeout.
>>
>> No patch (yet), still just an idea. I think some minimal time between
>> queries should be imposed to clients. If client floods dnsmasq, it
>> should not flood the upstream the same way.
>>
>> Cheers,
>> Petr
>>
>> On 2/23/21 12:20 AM, Simon Kelley wrote:
>>> On 22/02/2021 23:04, Nicholas Mu wrote:
>>>> Hi Simon,
>>>>
>>>> The commit fixes all the issues we were seeing. Thanks for
>>>> getting the
>>>> fix out so quickly.
>>>
>>> Excellent. I just pushed my tree, which has a further small update
>>> to
>>> this. I've been dogfooding it here and all seems well.
>>>
>>>> I had one follow up. So now it seems that for all clients retries
>>>> will
>>>> use the same SP/QID. Would it be possible to have a way/config to
>>>> vary
>>>> SP on retries or are we stuck with a single SP due to the CVE?
>>>> The
>>>> reason we'd prefer varying SPs is mostly due to flow hashing. Say
>>>> dnsmasq is configured with a single upstream nameserver. That
>>>> means any
>>>> retries will use the same 5-tuple and retries will follow the
>>>> same
>>>> network path. If some paths in the network have an outage then we
>>>> are
>>>> stuck on that path for all retries. In general, we find better
>>>> DNS
>>>> availability when SP varies across retries and we can traverse
>>>> different
>>>> paths on the network. Wondering if you had any thoughts on this?
>>>>
>>>
>>> Rock and hard place. The CVE aims to avoid exactly what you want,
>>> because the more different SP/QID combinations that are valid for a
>>> given DNS query, the easier it is for an attacker to get an answer
>>> accepted into the cache when spraying large numbers of random
>>> SP/QID
>>> combos at the DNS server. Using different SP/QID on retries allows
>>> the
>>> attacker to send lots of identical queries/retries and therefore
>>> make
>>> his life easier.
>>>
>>> <thinks> I guess you could assign a new SP for retries, IF you
>>> stopped
>>> accepting answers on the old one. There are interesting fairness
>>> problems there:
>>>
>>> client1 asks for example.com, forwarded upstream, reply about to
>>> return,
>>> when client2 asks for example.com, that gets forwarded upstream,
>>> the
>>> SP/QID used with client1 gets abandoned and client1 has to await
>>> client2s reply. In the meantime client3 ask for example.com......
>>>
>>>
>>> Cheers,
>>>
>>> Simon.
>>>
>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> On Wed, Feb 17, 2021 at 4:03 PM Simon Kelley <
>>>> simon at thekelleys.org.uk
>>>> <mailto:simon at thekelleys.org.uk>> wrote:
>>>>
>>>> On 16/02/2021 00:42, Nicholas Mu wrote:
>>>> > Hi,
>>>> >
>>>> > I noticed a low level increase in DNS errors after
>>>> upgrading to 2.84.
>>>> > After doing some packet diving, it seems that retries
>>>> behave
>>>> differently
>>>> > in the new version. For my testing, I'm using dnspython but
>>>> I believe
>>>> > this issue would affect any client that uses different
>>>> source
>>>> ports and
>>>> > query ids for retries. As a result, dnspython will attempt
>>>> retries for
>>>> > up to 30 seconds and will eventually timeout as only a
>>>> single
>>>> packet is
>>>> > ever sent and retries are rendered ineffective.
>>>> >
>>>> > On 2.82, multiple packets are sent as dnspython retries.
>>>> Note the
>>>> > retries are using different source ports and query ids:
>>>> >
>>>> > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-
>>>> 2.82
>>>> > 19:59:03.826638 IP 172.31.44.29.44547 > 172.31.0.2.53:
>>>> 51880+ NS?
>>>> > somedomain. (64)
>>>> > 19:59:05.928335 IP 172.31.44.29.33363 > 172.31.0.2.53:
>>>> 41382+ NS?
>>>> > somedomain. (64)
>>>> > 19:59:08.130620 IP 172.31.44.29.21177 > 172.31.0.2.53:
>>>> 36073+ NS?
>>>> > somedomain. (64)
>>>> > 19:59:10.532792 IP 172.31.44.29.57223 > 172.31.0.2.53:
>>>> 50309+ NS?
>>>> > somedomain. (64)|
>>>> > |
>>>> > |
>>>> > |On 2.84, only a single packet is sent:|
>>>> > |
>>>> > |
>>>> > |[ec2-user at ip-172-31-44-29 src]$ grep cell-1 /tmp/dnsmasq-
>>>> 2.84
>>>> > 19:53:12.189849 IP 172.31.44.29.5335 > 172.31.0.2.53: 826+
>>>> NS?
>>>> > somedomain. (64)|
>>>> > |
>>>> > |
>>>> > I also tested using dig, nslookup, and host which all use
>>>> the same
>>>> > source port and query id on retries. The behavior works as
>>>> intended on
>>>> > both versions. I would suspect the following commit is
>>>> responsible for
>>>> > this behavior change:
>>>> >
>>>> > Handle multiple identical near simultaneous DNS
>>>> queries better.
>>>> > Previously, such queries would all be forwarded
>>>> > independently. This is, in theory, inefficent but in
>>>> practise
>>>> > not a problem, _except_ that is means that an answer
>>>> for any
>>>> > of the forwarded queries will be accepted and cached.
>>>> > An attacker can send a query multiple times, and for
>>>> each
>>>> repeat,
>>>> > another {port, ID} becomes capable of accepting the
>>>> answer he is
>>>> > sending in the blind, to random IDs and ports. The
>>>> chance of a
>>>> > succesful attack is therefore multiplied by the
>>>> number of
>>>> repeats
>>>> > of the query. The new behaviour detects repeated
>>>> queries and
>>>> > merely stores the clients sending repeats so that
>>>> when the
>>>> > first query completes, the answer can be sent to all
>>>> the
>>>> > clients who asked. Refer: CVE-2020-25686.
>>>> >
>>>> > Is this intended? Seems to me any clients with retry
>>>> behavior
>>>> similar to
>>>> > dnspython are now broken. Clients will hang until their
>>>> configured
>>>> > timeouts are reached on any single DNS failure.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Nick
>>>> >
>>>>
>>>>
>>>> Your analysis is spot-on.
>>>>
>>>> I think it's possible to satisfy both the security and
>>>> robustness
>>>> requirements here.
>>>>
>>>> Pre 2.84, a retry for the same query with different query-id
>>>> and/or
>>>> source port would be treated as an independent query and
>>>> forwarded
>>>> again, with a new source-port and query-id. This gives the
>>>> attacker the
>>>> ability to increase the attack surface for cache pollution by
>>>> sending
>>>> many repeat queries.
>>>>
>>>> In 2.84 a repeat with the same SP/QID gets treated as it
>>>> always has: as
>>>> a retry, and the query gets forwarded again, and this time to
>>>> all
>>>> available servers. The same query but with different SP/QID
>>>> now gets
>>>> piggy-backed onto the existing query, as you noted.
>>>>
>>>> The solution here is for a repeat query with different SP/QID
>>>> to trigger
>>>> the same retry behaviour as a repeat query with the same
>>>> SP/QID, but
>>>> also to still piggy back the existing query, so that no new
>>>> SP/QID
>>>> tuples are generated going upstream.
>>>>
>>>> I just pushed
>>>>
>>>> http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=141a26f979b4bc959d8e866a295e24f8cf456920
>>>> which should implement this. Please test!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Simon.
>>
>> _______________________________________________
>> Dnsmasq-discuss mailing list
>> Dnsmasq-discuss at lists.thekelleys.org.uk
>> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss
>
>
> _______________________________________________
> Dnsmasq-discuss mailing list
> Dnsmasq-discuss at lists.thekelleys.org.uk
> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss
>
--
Petr Menšík
Software Engineer
Red Hat, http://www.redhat.com/
email: pemensik at redhat.com
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/attachments/20210311/54a531bc/attachment.sig>
More information about the Dnsmasq-discuss
mailing list