[Dnsmasq-discuss] dnsmasq 2.86 seems to stop reading from one of its dns sockets after a period of time under load

Tue May 17 14:08:02 UTC 2022

> What value did you use?

I went brute force and used 1M.  The default on this arm based device was
also 212992.

root at MH7601:~# cat  /proc/sys/net/core/wmem_default
1048576

I agree that is a lot but given the arp queue length has 101 entries, that
is a lot of packets (especially if that might mean 101 hosts - not sure if
the arp/neigh queue is per host or per request).

root at MH7601:~# cat /proc/sys/net/ipv4/neigh/default/unres_qlen
101

This is a very controlled environment, there are only about 30 sockets open
at any time.  This approach won't suit most people but it saved me from
crafting a patch into openwrt.

Thanks,
Tom

On Mon, May 16, 2022 at 10:52 AM Simon Kelley <simon at thekelleys.org.uk>
wrote:

> What value did you use?
>
> On my Ubuntu desktop, /proc/sys/net/core/wmem_default and wmem_max are
> both 212992 which is a fair few DNS replies.
>
>
> Simon.
>
>
> On 16/05/2022 18:34, Tom Keddie wrote:
> > Hi Simon,
> >
> > Thanks for your response.  I don't have the detailed logs but it's a
> > noisy qa wireless environment where clients are coming and going a lot.
> > eg. In syslog I could see instances where we would get a DHCP request
> > and then a L2 wireless disassociate message would appear immediately
> > afterwards, that response isn't going to be deliverable as unicast
> > (although for dhcp it might fall back to broadcast eventually).
> >
> > As we know, DNS isn't logged in such a manner but you could see the same
> > scenario unfolding where we get a bunch of dns requests, the client
> > drops off immediately afterwards and the responses can't be delivered.
> > When there's a lot of requests or a lot of clients you can see how the
> > socket buffer would fill.
> >
> > Increasing the socket buffers as I described below allowed the test to
> > run for the required 96 hours, without it we weren't making it past the
> > 48 hour mark.
> >
> > A dynamic solution might work provided it was carefully bound to prevent
> > DoS.  If you have something you'd like us to test I probably arrange a
> > time slot, it's a busy setup that needs lots of hardware though.
> >
> > Thanks,
> > Tom Keddie
> >
> > ps. this is a controlled environment (as much as you can control wifi),
> > there are no malicious actors nor intent in this scenario.  It's a soak
> > test with a large variety of clients all doing busy work like video
> > streaming etc.
> >
> >
> > On Fri, May 13, 2022 at 12:48 PM Simon Kelley <simon at thekelleys.org.uk
> > <mailto:simon at thekelleys.org.uk>> wrote:
> >
> >
> >
> >     On 10/05/2022 16:40, Tom Keddie via Dnsmasq-discuss wrote:
> >      > Hi All,
> >      >
> >      >     I think you're saying that it's not surprising that dnsmasq
> >     is not
> >      >     reading from the socket because the send queue is also full.
> >      >
> >      >
> >      > As per this thread on netdev
> >      >
> >     (
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/
> >     <
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/
> >
> >
> >      >
> >     <
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/
> >     <
> https://lore.kernel.org/netdev/CABUuw65R3or9HeHsMT_isVx1f-7B6eCPPdr+bNR6f6wbKPnHOQ@mail.gmail.com/
> >>)
> >
> >      > it seems we were consuming the socket send buffer with pending
> >     packets
> >      > waiting for ARP responses that were never coming.  This was
> causing
> >      > failures sending to devices that were still live.
> >      >
> >      > As per that thread we increased the
> /proc/sys/net/core/wmem_default
> >      > value so all sockets will have larger send buffers (the device
> >     has very
> >      > few sockets in use). It might be useful to add dnsmasq config
> >     options to
> >      > increase SO_SNDBUF on the dhcp and dns sockets to allow more
> >     granular
> >      > control.
> >      >
> >      > Thanks, Tom Keddie
> >
> >     So queries are being received, and answered, but the reply is being
> >     dropped by the kernel because the send queue is full of replies to
> dead
> >     hosts? If the hosts are dead, where are the queries coming from to
> >     generate these blocked replies?
> >
> >     It might be sensible to automatically increase the send queue length
> >     when a packer send gets EAGAIN. at least the first time, but I'd
> >     like to
> >     understand exactly what's going on first.
> >
> >
> >     Simon.
> >
> >      >
> >      > _______________________________________________
> >      > Dnsmasq-discuss mailing list
> >      > Dnsmasq-discuss at lists.thekelleys.org.uk
> >     <mailto:Dnsmasq-discuss at lists.thekelleys.org.uk>
> >      >
> >
> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss
> >     <
> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/attachments/20220517/87889023/attachment.htm>