[Dnsmasq-discuss] dnsmasq using 100% of cpu

Geert Stappers stappers at stappers.nl
Mon Mar 2 21:58:36 GMT 2020


On Thu, Feb 20, 2020 at 10:49:35PM +0000, Simon Kelley wrote:
> On 17/02/2020 14:37, Geert Stappers wrote:
> > On 17-02-2020 14:31, Donald Sharp wrote:
> > 
> >> Running:
> >>
> >> sharpd at eva:~/dnsmasq$ /sbin/dnsmasq --version
> >> Dnsmasq version 2.80  Copyright (c) 2000-2018 Simon Kelley
> > 
> > 2018,  no  short-git-hashes nor simular indicators on source version.
> > 
> > 
> >> Compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua
> >> TFTP conntrack ipset auth DNSSEC loop-detect inotify dumpfile
> >> ----
> >>
> >> When I install several hundred thousand routes into the kernel and
> >> remove them( or some variation thereof ), dnsmasq eventually ends up
> >> running 100% cpu:
> >>
> >> top - 18:45:18 up 1 day,  7:44,  1 user,  load average: 2.70, 2.65, 2.34
> >> Tasks: 424 total,   3 running, 421 sleeping,   0 stopped,   0 zombie
> >> %Cpu(s): 12.1 us,  6.9 sy,  0.0 ni, 80.2 id,  0.0 wa,  0.0 hi,  0.7
> >> si,  0.0 st
> >> MiB Mem :  32131.3 total,  19483.6 free,   6620.3 used,   6027.4
> >> buff/cache
> >> MiB Swap:  32718.0 total,  31693.0 free,   1025.0 used.  24698.2 avail Mem
> >>
> >>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
> >> COMMAND                            
> >>  293183 nobody    20   0   11040   2040   1688 R  99.7   0.0 148:48.40
> >> dnsmasq       
> > 
> > 
> > The "CPU 100%" made me do  `git log` and a "find" on 'CPU'.  I found
> > 
> > 
> > commit df6636bff61aa53ed7ad4b34d940805193c0bc74
> > Author: Florent Fourcot <florent.fourcot at wifirst.fr>
> > Date:   Mon Feb 11 17:04:44 2019 +0100
> > 
> >     lease: prune lease as soon as expired
> >    
> >     We detected a performance issue on a dnsmasq running many dhcp sessions
> >     (more than 10 000). At the end of the day, the server was only releasing
> >     old DHCP leases but was consuming a lot of CPU.
> >    
> >     It looks like curent dhcp pruning:
> >      1) it's pruning old sessions (iterate on all current leases). It's
> >      important to note that it's only pruning session expired since more
> >      than one second
> >      2) it's looking for next lease to expire (iterate on all current leases
> >      again)
> >      3) it launchs an alarm to catch next expiration found in step 2). This
> >      value can be zero for leases just expired (but not pruned).
> >    
> >     So, for a second, dnsmasq could fall in a "prune loop" by doing:
> >      * Not pruning anything, since difftime() is not > 0
> >      * Run alarm again with zero as argument
> >    
> >     On a server with very large number of leases and releasing often
> >     sessions, that can waste a very big CPU time.
> >    
> >     Signed-off-by: Florent Fourcot <florent.fourcot at wifirst.fr>
> > 
> > 
> > 
> > 
> >>
> >> strace output:
> >>
> >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5,
> >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8,
> >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLERR}])
> >>     ....
> >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5,
> >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8,
> >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLERR}])
> >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5,
> >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8,
> >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=PO^Cstrace: Process
> >> 293183 detached
> >>
> >> I can pretty much make this happen at will.  What can I provide to
> >> help debug this?
> > 
> > Start with stating how recent the source is that you are using.
> > 
> > 
> >>
> >> As a side note, I was not placing these routes into the default linux
> >> routing table.  Does dnsmasq need to be paying attention to these routes?
> > 
> > Side notes in a separate thread  please.
> > 
> > 
> >>
> >> donald
> >>
> > 
> > Regards
> > 
> > Geert Stappers
> > 
> 
> Geert, you're confusing things.

Sorry for matching  CPU load  with CPU load.


> It's perfectly clear that the process is
> running 100% CPU beacuse the poll() calls are returning an error which
> the code is not expecting and doesn't handle. It just calls poll()
> again, and because the error wasn't cleared, poll returns immediately
> again, rinse and repeat.
> 
> The solution is to handle the error (it's not obvious to me how to do
> that) or to avoid creating the error condition in the first place.
> 
> To get further, we need to know which socket is erroring. It's file
> descriptor four in the strace, but is that the netlink socket, or a DHCP
> socket or a socket used to talk DNS upstream, or DNS downstream. We
> don't know  without further information.


Geert Stappers
-- 
Silence is hard to parse



More information about the Dnsmasq-discuss mailing list