[Dnsmasq-discuss] crash on double free

Mon Sep 20 18:30:53 BST 2010

Simon Kelley <simon at thekelleys.org.uk> writes:

> Ferenc Wagner wrote:
>
>> Simon Kelley <simon at thekelleys.org.uk> writes:
>> 
>>> On 15/09/10 12:07, Ferenc Wagner wrote:
>>>
>>>> However, I also got a different crash with the original binary.  I hope
>>>> it's a different realisation of the same problem, can you confirm?
>>>
>>> I can't see any other reason for this problem, I'm pretty sure it's
>>> down to heap corruption from an earlier double-free.
>> 
>> It's a rather narrow chance, as I was running under electric fence...
>
> It was late..... I'll try again :-)
>
> At the point of the crash, 0xb7184f8c had already been freed and
> therefore mapped out by efence. Hence when 0xb7184f8c  gets deferenced
> by memcpy, it segfaults. This is consistent with the known and fixed bug.

Yes, if the segfault comes from the first byte moved, not from a later
out-of-range one, caused by the bogus value of "len".  Why, I've still
got the core, let's check...

(gdb) x/i $eip
0xb7599d5a <memcpy+26>:	movsw  %ds:(%esi),%es:(%edi)
(gdb) p/x $edi
$1 = 0xb7184f8c

You're right, it's the first access.

> The value of "len" must be a optimisation artifact, there is no way that
> add_extradata_opt() could generate  that value.

I've never seen such an artifact (pretty much nothing but a value being
optimized out), but my C is admittedly rusty.

(gdb) x $esp+0xc
0xbfda0e98:	0x0000000e

So memcpy() was called for 14 bytes, nothing like that crazy number.

>>>> I'm continuing testing the fix.  It usually took me tens of minutes to
>>>> reproduce the crash, but with the change it already survived more than
>>>> an hour.  Unfortunately, it isn't fully automatic (because of other bugs
>>>> in other software).
>>>
>>> To trigger this bug, there needs to be a dhcp-script, obviously. But
>>> also the rate of DHCP transactions needs to be fast enough and/or the
>>> script needs to be slow enough so that a second DHCP transaction
>>> happens on a lease before the first one has been sent to the
>>> DHCP-script. This is pretty rare, hence no-one has seen this bug, as
>>> far as I know, even though it has been lurking for some time (years).
>> 
>> Well, this doesn't fully match my test setup, which contained a single
>> netbooted Linux continuously rebooting in Qemu.  The exotic part is that
>> the PXE ROM used the network interface natively, while the Linux system
>> with an added 802.1q tag.  So a single lease was ping-ponging between
>> two different subnets.
>
> How much work was you dhcp-script doing.

It's nothing but a call to an SGE utility to add the new host to a
hostlist.  The script itself is nothing, and qconf shouldn't take long
either.  Occasionally it encounters a DNS problem (unable to resolve
host, cf. other thread), but even that's fast, not some 5 sec timeout.

> By coincidence I had another report of this bug yesterday which
> triggered only when the DHCP transaction rate is high.

Lucky you! :)
-- 
Cheers,
Feri.