[Dnsmasq-discuss] CPU spinning bug, possibly related to SSHFP queries

Geert Stappers stappers at stappers.nl
Fri Nov 29 21:40:40 GMT 2019


On Fri, Nov 29, 2019 at 09:11:45PM +0100, Tore Anderson wrote:
> * Geert Stappers
> 
> >> Here are the corresponding log lines from Dnsmasq:
> >>
> >> nov. 29 07:15:53.964856 sloth.fud.no dnsmasq[48069]: query[A] l1-g9-osl2.n.bitbit.net from 127.0.0.1
> >> nov. 29 07:15:53.965060 sloth.fud.no dnsmasq[48069]: forwarded l1-g9-osl2.n.bitbit.net to 87.238.33.1
> >> nov. 29 07:15:53.965155 sloth.fud.no dnsmasq[48069]: query[AAAA] l1-g9-osl2.n.bitbit.net from 127.0.0.1
> >> nov. 29 07:15:53.965273 sloth.fud.no dnsmasq[48069]: forwarded l1-g9-osl2.n.bitbit.net to 87.238.33.1
> >> nov. 29 07:15:54.039407 sloth.fud.no dnsmasq[48069]: reply l1-g9-osl2.n.bitbit.net is <CNAME>
> >> nov. 29 07:15:54.039461 sloth.fud.no dnsmasq[48069]: reply eth0.l1-g9-osl2.n.bitbit.net is 10.20.120.102
> >> nov. 29 07:15:54.039666 sloth.fud.no dnsmasq[48069]: reply l1-g9-osl2.n.bitbit.net is <CNAME>
> >> nov. 29 07:15:54.039700 sloth.fud.no dnsmasq[48069]: reply eth0.l1-g9-osl2.n.bitbit.net is NODATA-IPv6
> >> nov. 29 07:15:54.964042 sloth.fud.no dnsmasq[48069]: query[type=44] l1-g9-osl2.n.bitbit.net from 127.0.0.1
> >> (CPU starts spinning at this point, no further log lines appear)
> > 
> > The pcap file says / shows
> > 
> > 07:15:53.964639 IP localhost.58896 > localhost.domain: 26883+ [1au] A?
> > l1-g9-osl2.n.bitbit.net. (52)
> > 07:15:53.964696 IP localhost.58896 > localhost.domain: 9742+ [1au] AAAA?
> > l1-g9-osl2.n.bitbit.net. (52)
> 
> 1+2: A and AAAA queries from libc stub resolver to Dnsmasq
> 
> > 07:15:53.964955 IP 100.66.5.6.46462 > dns.i.bitbit.net.domain: 45004+
> > [1au] A? l1-g9-osl2.n.bitbit.net. (52)
> > 07:15:53.965208 IP 100.66.5.6.46462 > dns.i.bitbit.net.domain: 42294+
> > [1au] AAAA? l1-g9-osl2.n.bitbit.net. (52)
> 
> 3+4: Dnsmasq forwarding the queries in packets 1+2 to upstream DNS server
> 
> > 07:15:54.039175 IP dns.i.bitbit.net.domain > 100.66.5.6.46462: 45004*
> > 2/3/7 CNAME eth0.l1-g9-osl2.n.bitbit.net., A 10.20.120.102 (284)
> > 07:15:54.039273 IP dns.i.bitbit.net.domain > 100.66.5.6.46462: 42294*
> > 1/1/1 CNAME eth0.l1-g9-osl2.n.bitbit.net. (145)
> 
> 5+6: Upstream DNS server answers Dnsmasq's queries in packets 3+4
> 
> > 07:15:54.039525 IP localhost.domain > localhost.58896: 26883* 2/3/7
> > CNAME eth0.l1-g9-osl2.n.bitbit.net., A 10.20.120.102 (284)
> > 07:15:54.039735 IP localhost.domain > localhost.58896: 9742* 1/1/1 CNAME
> > eth0.l1-g9-osl2.n.bitbit.net. (145)
> 
> 7+8: Answers to queries 1+2 from Dnsmasq to libc stub resolver
> 
> (At this point, the SSH client proceeds to connect to the SSH server, and approx. 1 second later it has received a host key that wants to verify.)
> 
> > 07:15:54.963758 IP localhost.49114 > localhost.domain: 35496+ [1au]
> > SSHFP? l1-g9-osl2.n.bitbit.net. (52)
> 
> 9: SSHFP query from libc stub resolver to Dnsmasq (unanswered but logged). Dnsmasq breaks and starts spinning on the CPU.
> 
> > 07:15:58.440050 IP localhost.60721 > localhost.domain: 17459+ [1au] A?
> > mattermost.redpill-linpro.com. (58)
> > 07:15:58.440064 IP localhost.60721 > localhost.domain: 44236+ [1au]
> > AAAA? mattermost.redpill-linpro.com. (58)
> > 07:15:59.968059 IP localhost.49114 > localhost.domain: 35496+ [1au]
> > SSHFP? l1-g9-osl2.n.bitbit.net. (52)
> > 07:16:02.394821 IP localhost.56180 > localhost.domain: 61011+ AAAA?
> > fedoraproject.org. (35)
> > 07:16:03.445058 IP localhost.60721 > localhost.domain: 17459+ [1au] A?
> > mattermost.redpill-linpro.com. (58)
> > 07:16:03.445071 IP localhost.60721 > localhost.domain: 44236+ [1au]
> > AAAA? mattermost.redpill-linpro.com. (58)
> > 07:16:07.399014 IP localhost.56180 > localhost.domain: 61011+ AAAA?
> > fedoraproject.org. (35)
> 
> 10-16: Various queries from libc stub resolver to Dnsmasq (unanswered and not logged)
> 
> > The 07:15:54.039273 packet is last where source host and destination
> > host differ.  After that packet are the remaining packets only between
> > localhost and localhost.
> > 
> > Looking closer reveals that the first two packets are also between
> > localhost.
> > 
> > Strange, but probably not the problem.
> 
> I am not sure how this is strange?
 
I'm not used to see computers talking to them self.

The netwerksniffs I do are at the wire.

I glad learned about the "-i  any"  option.   Thanks.


> To me, the packet capture annotated above is exactly what I'd expect,
> given the symptom of «SSHFP query makes Dnsmasq break».
> 
> > | $ host l1-g9-osl2.n.bitbit.net
> > | l1-g9-osl2.n.bitbit.net is an alias for eth0.l1-g9-osl2.n.bitbit.net.
> > | eth0.l1-g9-osl2.n.bitbit.net has address 10.20.120.102
> > | $ host -t unknowntype l1-g9-osl2.n.bitbit.net
> > | host: invalid type: unknowntype
> > |
> > | $ host -t sshfp l1-g9-osl2.n.bitbit.net
> > | l1-g9-osl2.n.bitbit.net is an alias for eth0.l1-g9-osl2.n.bitbit.net.
> > | $ dig -t sshfp  l1-g9-osl2.n.bitbit.net
> > |
> > | ; <<>> DiG 9.11.5-P4-5.1-Debian <<>> -t sshfp l1-g9-osl2.n.bitbit.net
> > | ;; global options: +cmd
> > | ;; Got answer:
> > | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13365
> > | ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
> > |
> > | ;; OPT PSEUDOSECTION:
> > | ; EDNS: version: 0, flags:; udp: 4096
> > | ; COOKIE: 905e71c3631f8577e2d927615de0d63aaa0a6211eb0ec2e1 (good)
> > | ;; QUESTION SECTION:
> > | ;l1-g9-osl2.n.bitbit.net.    IN    SSHFP
> > |
> > | ;; ANSWER SECTION:
> > | l1-g9-osl2.n.bitbit.net. 86103    IN    CNAME    eth0.l1-g9-osl2.n.bitbit.net.
> > |
> > | ;; AUTHORITY SECTION:
> > | n.bitbit.net.        3314    IN    SOA    ns-foo.i.bitbit.net.  hostmaster.redpill-linpro.com. 2018033152 10800 3600 2419200 3600
> > |
> > | ;; Query time: 0 msec
> > | ;; SERVER: 192.168.42.24#53(192.168.42.24)
> > | ;; WHEN: vr nov 29 09:26:34 CET 2019
> > | ;; MSG SIZE  rcvd: 173
> > |
> > | $
> > 
> > 
> > I think the problem is not getting a "sshfp response" upon a "sshfp query".
> 
> I don't think that is possible, given the fact that Dnsmasq never
> gets around to forwarding the SSHFP query to the upstream DNS server
> in the first place.
> 
> In other words,
> it doesn't get an SSHFP response because it doesn't issue an SSHFP query.
> 
> It breaks immediately after receiving (and logging)
> the *incoming* SSHFP query from the libc stub resolver.

All true.    (and sorry for not having that so clear as O.P. has)

Could caching be involved?  That we are seeing only when it fails,
not seeing in the libpcap file what led to the fail?


Groeten
Geert Stappers

P.S.
My earlier test today on `host -t sshfp l1-g9-osl2.n.bitbit.net` was not
on Dnsmasq DNS.  A test on an old Dnsmasq shows that several `host -t
sshfp l1-g9-osl2.n.bitbit.net` are possible.
-- 
Leven en laten leven



More information about the Dnsmasq-discuss mailing list