[Dnsmasq-discuss] CPU at 100%

Simon Kelley simon at thekelleys.org.uk
Wed Jan 18 14:37:46 GMT 2012


On 18/01/12 14:12, Christopher Moore ( Linux Epos) wrote:
> -----Original message-----
> To:	Christopher Moore ( Linux Epos)<chris at linuxepos.com>;
> CC:	dnsmasq-discuss at lists.thekelleys.org.uk;
> From:	Simon Kelley<simon at thekelleys.org.uk>
> Sent:	Wed 18-01-2012 14:00
> Subject:	Re: [Dnsmasq-discuss] CPU at 100%
>> On 18/01/12 13:35, Christopher Moore ( Linux Epos) wrote:
>>
>>>
>>> Thanks for the quick reply.
>>>
>>> Dnsmasq is being started via:
>>>
>>> nice -n 0 initlog -q -c /usr/local/sbin/dnsmasq --cache-size=500
>> --dns-forward-max=150
>>>
>>> Here's the output of lsof -c dnsmasq (This output was taken when the process
>> is using 100% of the CPU) :
>>>
>>> COMMAND   PID   USER   FD   TYPE     DEVICE SIZE/OFF     NODE NAME
>>> dnsmasq 16818 nobody  cwd    DIR        9,5     4096        2 /
>>> dnsmasq 16818 nobody  rtd    DIR        9,5     4096        2 /
>>> dnsmasq 16818 nobody  txt    REG        9,5   164296    22152
>> /usr/local/sbin/dnsmasq
>>> dnsmasq 16818 nobody  mem    REG        9,5    42496   775855
>> /lib/libnss_files-2.10.1.so
>>> dnsmasq 16818 nobody  mem    REG        9,5  1327456   775734
>> /lib/libc-2.10.1.so
>>> dnsmasq 16818 nobody  mem    REG        9,5   117348   355541
>> /lib/ld-2.10.1.so
>>> dnsmasq 16818 nobody    0u   CHR        1,3      0t0    19603 /dev/null
>>> dnsmasq 16818 nobody    1u   CHR        1,3      0t0    19603 /dev/null
>>> dnsmasq 16818 nobody    2u   CHR        1,3      0t0    19603 /dev/null
>>> dnsmasq 16818 nobody    3u  IPv4   20851151      0t0      UDP *:domain
>>> dnsmasq 16818 nobody    4u  IPv4   20851152      0t0      TCP *:domain
>> (LISTEN)
>>> dnsmasq 16818 nobody    5r  FIFO        0,6      0t0 20851159 pipe
>>> dnsmasq 16818 nobody    6w  FIFO        0,6      0t0 20851159 pipe
>>> dnsmasq 16818 nobody    7u  unix 0xf6ed9a80      0t0 20851162 socket
>>>
>>>
>>> Dnsmasq configuration is :
>>>
>>> domain-needed
>>> bogus-priv
>>> resolv-file=/var/igaware/local/nameservers
>>> user=nobody
>>> group=nobody
>>> interface=eth0
>>> interface=eth1
>>> interface=eth2
>>> no-dhcp-interface=eth0
>>> no-dhcp-interface=eth1
>>> no-dhcp-interface=eth2
>>> cache-size=500
>>> local-ttl=3600
>>>
>>>
>>> I have just realised that the eth2 interface doesn't actually exist on the
>> machine, would that cuse a problem?
>>>
>>
>>
>> Lack of eth2 won't cause a problem.
>>
>> If would be useful to see the output of lsof _before_ the 100% CPU
>> phase. What's obvious from the information we already have is that:
>>
>> 1) The netlink socket which should be open, isn't.
>>
>> 2) dnsmasq believes that the netlink socket is open and that it's file
>> descriptor zero.
>>
>> File descriptor zero is actually open to /dev/null (that's OK). My guess
>> is that something, somewhere on the machine writing to /dev/null is
>> enough to make it ready for reading in select() and that's when the 100%
>> CPU thing starts.
>>
>> I assume because it's in /usr/local/sbin that this is a locally-compiled
>> binary. It's not locally-modified code, is it?
>>
>>
>> Cheers,
>>
>> Simon.
>
> Hi,
>
> Here's lsof when things are OK:
>
> COMMAND   PID   USER   FD      TYPE     DEVICE SIZE/OFF     NODE NAME
> dnsmasq 12073 nobody  cwd       DIR        9,5     4096        2 /
> dnsmasq 12073 nobody  rtd       DIR        9,5     4096        2 /
> dnsmasq 12073 nobody  txt       REG        9,5   164296    22152 /usr/local/sbin/dnsmasq
> dnsmasq 12073 nobody  mem       REG        9,5    42496   775855 /lib/libnss_files-2.10.1.so
> dnsmasq 12073 nobody  mem       REG        9,5  1327456   775734 /lib/libc-2.10.1.so
> dnsmasq 12073 nobody  mem       REG        9,5   117348   355541 /lib/ld-2.10.1.so
> dnsmasq 12073 nobody    0u      CHR        1,3      0t0    19603 /dev/null
> dnsmasq 12073 nobody    1u      CHR        1,3      0t0    19603 /dev/null
> dnsmasq 12073 nobody    2u      CHR        1,3      0t0    19603 /dev/null
> dnsmasq 12073 nobody    3u  netlink                 0t0 20962043 ROUTE
> dnsmasq 12073 nobody    4u     IPv4   20962047      0t0      UDP *:domain
> dnsmasq 12073 nobody    5u     IPv4   20962048      0t0      TCP *:domain (LISTEN)
> dnsmasq 12073 nobody    6r     FIFO        0,6      0t0 20962055 pipe
> dnsmasq 12073 nobody    7w     FIFO        0,6      0t0 20962055 pipe
> dnsmasq 12073 nobody    8u     unix 0xe2ca4a40      0t0 20962058 socket
>
> ls -l /dev/null
>
> crwxrwxrwx 1 root root 1, 3 Jul 18  2001 /dev/null
>
> The binary is locally compiled, but the source isn't modified.
>

OK, that has the netlink socket, on file descriptor 3.

  dnsmasq 12073 nobody    3u  netlink                 0t0 20962043 ROUTE

It's very difficult to see how a process in that state could get to the 
broken state: if the netlink socket has been closed, then file 
descriptor 3 would simply be missing. Instead the UDP  and TCP listen 
sockets have shuffled up from  4 and 5 to 3 and 4. That sort of implies 
that the netlink socket never existed.

In addition, dnsmasq's idea of which socket is the netlink socket has 
been reset from three to zero: memory corruption in the dnsmasq process 
could explain that, but not the file descriptor changes: that's all in 
the kernel.


Are you sure that the model of "dnsmasq processes start with a netlink 
socket and then changes state to lose it and start using 100% CPU" is 
correct? An alternative is that "dnsmasq processes sometimes start 
without a netlink socket, and those that do then go on to use 100% CPU"

The simplest explanation for all the data we have so far is that the 
socket() call which creates the netlink socket sometimes fails to do so 
and returns zero (not -1, which is the error case, and trapped).

What kernel are you using?

Cheers,

Simon.









More information about the Dnsmasq-discuss mailing list