[Dnsmasq-discuss] dnsmasq + Cloudflare's DoH service + netflix = stalled

Simon Kelley simon at thekelleys.org.uk
Mon Apr 16 22:45:17 BST 2018


On 16/04/18 14:29, Dominik DL6ER wrote:
> Hey Simon,
> 
> I hope you have also seen my other mail (providing some more details
> about the TCP issue). It' actually likely that the debugger test was
> with "-d" to prevent dnsmasq from forking. Would running it with "-k" be
> better in the debugger?

Yes, -k doesn't inhibit forking, which makes it behave better, but it's
more difficult to debug, because you have to follow the action into
subsidiary processes.
> 
> The main concern is that dnsmasq even hangs when invoked via a standard
> "service dnsmasq start" and not running at all in a debugger or with
> debugging options set anywhere. The hint that dnsmasq will forward TCP
> queries also via TCP seems very interesting. Could it be that there is
> any place in here where blocking of dnsmasq could happen when the
> upstream doesn't answer properly? I will try the dig +vc ... test soon
> and report back.

What is supposed to happen is this:


The main dnsmasq process creates listening sockets for UDP and TCP. When
a TCP connection comes in it calls accept() to create the TCP
connection, and then forks. The forked process then reads queries from
the TCP connection and answers them until either TCP_MAX_QUERIES have
been answered, or CHILD_LIFETIME is reached or the client closes the
connection. By default TCP_MAX_QUERIES is 100 and CHILD_LIFETIME is
150s. These can be altered in src/config.h and recompiled.

Note that there's another tunable: MAX_PROCS, which defaults to 20. Once
20 child TCP processes exist, the main process will cease accepting new
TCP connections until one of them exits. If your client is making TCP
connections and then leaving them open, after it's done that 20 times
within 150 seconds, it won't be able to make any more connections.

If the upstream doesn't answer properly, then the child process will be
stuck for 150s (and after 20 get stuck, new client connections will be
blocked) at no time should the other functions of dnsmasq be affected:
it will still answer DNS over UDP, DHCP, TFTP etc.


Looking at your backtrace, the process is blocked reading a query from
the client. That's what I'd expect if the client doesn't close the TCP
connection when it's done: dnsmasq hangs around for 150s waiting for
another query which never comes. In this case you used the -d flag, so
that blocks everything else, but it shouldn't block everything without
the -d flag.


Now looking at the log you posted: note that the process-id is always
the same: dnsmasq is not forking. If that log came from the production
instance and not the debugging one, I'd say someone has used -d instead
of -k to start the dnsmasq daemon. They wouldn't be the first to confuse
the two. That, and the propensity of your client to keep TCP connections
open, would be enough to explain what you're seeing.


Cheers,

Simon.








> 
> Best regards,
> Dominik
> 
> 
> On 16.04.2018 14:06, Simon Kelley wrote:
>> When you're running under the debugger, are you starting dnsmasq with
>> the -d flag? If so, then I'd expect it to stop serving any other
>> requests whilst it's blocked doing TCP stuff. Normally, dnsmasq spawns a
>> new process for each TCP connection, but that is suppressed by the -d
>> flag, to make debugging easier.
>>
>> Dnsmasq will always use TCP connections to upstream servers to get
>> answers to queries it receives over TCP, Have you tried the query direct
>> to your upstream server over TCP, to see how it copes (dig +vc command)?
>>
>> Cheers,
>>
>> Simon.
>>
>>
>> On 08/04/18 14:21, Dominik wrote:
>>> Dear dnsmasq enthusiasts,
>>>
>>> we encountered a rather strange issue with dnsmasq, the new Cloudflare
>>> DNS over HTTPS service
>>> (https://bendews.com/posts/implement-dns-over-https/) and Netflix (only
>>> in this combination).
>>>
>>> Once Netflix is started, it queries ichnaea.netflix.com. This query gets
>>> answered by dnsmasq, but dnsmasq becomes altogether unresponsive
>>> thereafter. These are the last lines logged (extra logging enabled):
>>>
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 query[A]
>>> api-global.netflix.com from 192.168.0.6
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 forwarded
>>> api-global.netflix.com to 127.0.0.1
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.netflix.com is <CNAME>
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.geo.netflix.com is <CNAME>
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.5.237.4
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.44.197.215
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.45.38.64
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.45.118.37
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.54.15.52
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.54.22.121
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.54.174.212
>>> Apr  8 13:30:49 dnsmasq[16025]: 3 192.168.0.6/53524 reply
>>> api-global.latency.prodaa.netflix.com is 52.54.242.21
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 query[A]
>>> api-global.netflix.com from 192.168.0.6
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.netflix.com is <CNAME>
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.geo.netflix.com is <CNAME>
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.54.242.21
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.54.174.212
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.54.22.121
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.54.15.52
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.45.118.37
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.45.38.64
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.44.197.215
>>> Apr  8 13:30:49 dnsmasq[16025]: 4 192.168.0.6/50363 cached
>>> api-global.latency.prodaa.netflix.com is 52.5.237.4
>>>
>>> For seeing where it hangs, we compiled dnsmasq 2.79 from source and
>>> added debug symbols (-g3). The crash is the same. When I interrupt
>>> dnsmasq while it is stalled, I see this backtrace:
>>>
>>> #0  0x00007ffff7b15700 in __read_nocancel () at
>>> ../sysdeps/unix/syscall-template.S:84
>>> #1  0x0000555555566533 in read_write (fd=fd at entry=13,
>>> packet=packet at entry=0x7fffffffe334 "\002[\001", size=size at entry=1,
>>> rw=rw at entry=1) at util.c:654
>>> #2  0x000055555557029c in tcp_request (confd=13, now=1523191598,
>>> local_addr=0x7fffffffe410, netmask=..., auth_dns=0) at forward.c:1713
>>> #3  0x0000555555575999 in check_dns_listeners (now=1523191598) at
>>> dnsmasq.c:1745
>>> #4  0x000055555555d019 in main (argc=<optimized out>, argv=<optimized
>>> out>) at dnsmasq.c:1061
>>>
>>> Please tell me if you would like to see any further information for
>>> debugging.
>>>
>>> Best regards,
>>> Dominik
>>>
>>>
>>> _______________________________________________
>>> Dnsmasq-discuss mailing list
>>> Dnsmasq-discuss at lists.thekelleys.org.uk
>>> http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
>>>
>> _______________________________________________
>> Dnsmasq-discuss mailing list
>> Dnsmasq-discuss at lists.thekelleys.org.uk
>> http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
> 




More information about the Dnsmasq-discuss mailing list