[Dnsmasq-discuss] dnsmasq 2.86 seems to stop reading from one of its dns sockets after a period of time under load

Mon May 16 17:34:43 UTC 2022

Hi Simon,

Thanks for your response.  I don't have the detailed logs but it's a noisy
qa wireless environment where clients are coming and going a lot.  eg. In
syslog I could see instances where we would get a DHCP request and then a
L2 wireless disassociate message would appear immediately afterwards, that
response isn't going to be deliverable as unicast (although for dhcp it
might fall back to broadcast eventually).

As we know, DNS isn't logged in such a manner but you could see the same
scenario unfolding where we get a bunch of dns requests, the client drops
off immediately afterwards and the responses can't be delivered.  When
there's a lot of requests or a lot of clients you can see how the socket
buffer would fill.

Increasing the socket buffers as I described below allowed the test to run
for the required 96 hours, without it we weren't making it past the 48 hour
mark.

A dynamic solution might work provided it was carefully bound to prevent
DoS.  If you have something you'd like us to test I probably arrange a time
slot, it's a busy setup that needs lots of hardware though.

Thanks,
Tom Keddie

ps. this is a controlled environment (as much as you can control wifi),
there are no malicious actors nor intent in this scenario.  It's a soak
test with a large variety of clients all doing busy work like video
streaming etc.

On Fri, May 13, 2022 at 12:48 PM Simon Kelley <simon at thekelleys.org.uk>
wrote:

>
>
> On 10/05/2022 16:40, Tom Keddie via Dnsmasq-discuss wrote:
> > Hi All,
> >
> >     I think you're saying that it's not surprising that dnsmasq is not
> >     reading from the socket because the send queue is also full.
> >
> >
> > As per this thread on netdev
> > (
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/
> > <
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/>)
>
> > it seems we were consuming the socket send buffer with pending packets
> > waiting for ARP responses that were never coming.  This was causing
> > failures sending to devices that were still live.
> >
> > As per that thread we increased the /proc/sys/net/core/wmem_default
> > value so all sockets will have larger send buffers (the device has very
> > few sockets in use). It might be useful to add dnsmasq config options to
> > increase SO_SNDBUF on the dhcp and dns sockets to allow more granular
> > control.
> >
> > Thanks, Tom Keddie
>
> So queries are being received, and answered, but the reply is being
> dropped by the kernel because the send queue is full of replies to dead
> hosts? If the hosts are dead, where are the queries coming from to
> generate these blocked replies?
>
> It might be sensible to automatically increase the send queue length
> when a packer send gets EAGAIN. at least the first time, but I'd like to
> understand exactly what's going on first.
>
>
> Simon.
>
> >
> > _______________________________________________
> > Dnsmasq-discuss mailing list
> > Dnsmasq-discuss at lists.thekelleys.org.uk
> > https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/attachments/20220516/54b3a788/attachment.htm>