[Dnsmasq-discuss] CPU at 100%
Simon Kelley
simon at thekelleys.org.uk
Wed Jan 18 14:37:46 GMT 2012
On 18/01/12 14:12, Christopher Moore ( Linux Epos) wrote:
> -----Original message-----
> To: Christopher Moore ( Linux Epos)<chris at linuxepos.com>;
> CC: dnsmasq-discuss at lists.thekelleys.org.uk;
> From: Simon Kelley<simon at thekelleys.org.uk>
> Sent: Wed 18-01-2012 14:00
> Subject: Re: [Dnsmasq-discuss] CPU at 100%
>> On 18/01/12 13:35, Christopher Moore ( Linux Epos) wrote:
>>
>>>
>>> Thanks for the quick reply.
>>>
>>> Dnsmasq is being started via:
>>>
>>> nice -n 0 initlog -q -c /usr/local/sbin/dnsmasq --cache-size=500
>> --dns-forward-max=150
>>>
>>> Here's the output of lsof -c dnsmasq (This output was taken when the process
>> is using 100% of the CPU) :
>>>
>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>>> dnsmasq 16818 nobody cwd DIR 9,5 4096 2 /
>>> dnsmasq 16818 nobody rtd DIR 9,5 4096 2 /
>>> dnsmasq 16818 nobody txt REG 9,5 164296 22152
>> /usr/local/sbin/dnsmasq
>>> dnsmasq 16818 nobody mem REG 9,5 42496 775855
>> /lib/libnss_files-2.10.1.so
>>> dnsmasq 16818 nobody mem REG 9,5 1327456 775734
>> /lib/libc-2.10.1.so
>>> dnsmasq 16818 nobody mem REG 9,5 117348 355541
>> /lib/ld-2.10.1.so
>>> dnsmasq 16818 nobody 0u CHR 1,3 0t0 19603 /dev/null
>>> dnsmasq 16818 nobody 1u CHR 1,3 0t0 19603 /dev/null
>>> dnsmasq 16818 nobody 2u CHR 1,3 0t0 19603 /dev/null
>>> dnsmasq 16818 nobody 3u IPv4 20851151 0t0 UDP *:domain
>>> dnsmasq 16818 nobody 4u IPv4 20851152 0t0 TCP *:domain
>> (LISTEN)
>>> dnsmasq 16818 nobody 5r FIFO 0,6 0t0 20851159 pipe
>>> dnsmasq 16818 nobody 6w FIFO 0,6 0t0 20851159 pipe
>>> dnsmasq 16818 nobody 7u unix 0xf6ed9a80 0t0 20851162 socket
>>>
>>>
>>> Dnsmasq configuration is :
>>>
>>> domain-needed
>>> bogus-priv
>>> resolv-file=/var/igaware/local/nameservers
>>> user=nobody
>>> group=nobody
>>> interface=eth0
>>> interface=eth1
>>> interface=eth2
>>> no-dhcp-interface=eth0
>>> no-dhcp-interface=eth1
>>> no-dhcp-interface=eth2
>>> cache-size=500
>>> local-ttl=3600
>>>
>>>
>>> I have just realised that the eth2 interface doesn't actually exist on the
>> machine, would that cuse a problem?
>>>
>>
>>
>> Lack of eth2 won't cause a problem.
>>
>> If would be useful to see the output of lsof _before_ the 100% CPU
>> phase. What's obvious from the information we already have is that:
>>
>> 1) The netlink socket which should be open, isn't.
>>
>> 2) dnsmasq believes that the netlink socket is open and that it's file
>> descriptor zero.
>>
>> File descriptor zero is actually open to /dev/null (that's OK). My guess
>> is that something, somewhere on the machine writing to /dev/null is
>> enough to make it ready for reading in select() and that's when the 100%
>> CPU thing starts.
>>
>> I assume because it's in /usr/local/sbin that this is a locally-compiled
>> binary. It's not locally-modified code, is it?
>>
>>
>> Cheers,
>>
>> Simon.
>
> Hi,
>
> Here's lsof when things are OK:
>
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> dnsmasq 12073 nobody cwd DIR 9,5 4096 2 /
> dnsmasq 12073 nobody rtd DIR 9,5 4096 2 /
> dnsmasq 12073 nobody txt REG 9,5 164296 22152 /usr/local/sbin/dnsmasq
> dnsmasq 12073 nobody mem REG 9,5 42496 775855 /lib/libnss_files-2.10.1.so
> dnsmasq 12073 nobody mem REG 9,5 1327456 775734 /lib/libc-2.10.1.so
> dnsmasq 12073 nobody mem REG 9,5 117348 355541 /lib/ld-2.10.1.so
> dnsmasq 12073 nobody 0u CHR 1,3 0t0 19603 /dev/null
> dnsmasq 12073 nobody 1u CHR 1,3 0t0 19603 /dev/null
> dnsmasq 12073 nobody 2u CHR 1,3 0t0 19603 /dev/null
> dnsmasq 12073 nobody 3u netlink 0t0 20962043 ROUTE
> dnsmasq 12073 nobody 4u IPv4 20962047 0t0 UDP *:domain
> dnsmasq 12073 nobody 5u IPv4 20962048 0t0 TCP *:domain (LISTEN)
> dnsmasq 12073 nobody 6r FIFO 0,6 0t0 20962055 pipe
> dnsmasq 12073 nobody 7w FIFO 0,6 0t0 20962055 pipe
> dnsmasq 12073 nobody 8u unix 0xe2ca4a40 0t0 20962058 socket
>
> ls -l /dev/null
>
> crwxrwxrwx 1 root root 1, 3 Jul 18 2001 /dev/null
>
> The binary is locally compiled, but the source isn't modified.
>
OK, that has the netlink socket, on file descriptor 3.
dnsmasq 12073 nobody 3u netlink 0t0 20962043 ROUTE
It's very difficult to see how a process in that state could get to the
broken state: if the netlink socket has been closed, then file
descriptor 3 would simply be missing. Instead the UDP and TCP listen
sockets have shuffled up from 4 and 5 to 3 and 4. That sort of implies
that the netlink socket never existed.
In addition, dnsmasq's idea of which socket is the netlink socket has
been reset from three to zero: memory corruption in the dnsmasq process
could explain that, but not the file descriptor changes: that's all in
the kernel.
Are you sure that the model of "dnsmasq processes start with a netlink
socket and then changes state to lose it and start using 100% CPU" is
correct? An alternative is that "dnsmasq processes sometimes start
without a netlink socket, and those that do then go on to use 100% CPU"
The simplest explanation for all the data we have so far is that the
socket() call which creates the netlink socket sometimes fails to do so
and returns zero (not -1, which is the error case, and trapped).
What kernel are you using?
Cheers,
Simon.
More information about the Dnsmasq-discuss
mailing list