[Dnsmasq-discuss] Why is dnsmasq handing out the same IP to different MACs?

Simon Kelley simon at thekelleys.org.uk
Mon Apr 12 20:51:54 BST 2010


Paul Smith wrote:
> Hi guys.  I have a strange problem.  I'm running Red Hat EL 5.3 with
> dnsmasq 2.45 (Red Hat's package dnsmasq-2.45-1.el5_2.1 to be precise) on
> a server to which a lot of blades are attached: there are 96 blades with
> 2 NICs per blade, for a total of 192 different IP addresses.  I've got a
> dnsmasq config like:
> 
> dhcp-lease-max=255
> dhcp-range=10.0.0.17,10.0.15.254,infinite
> 
> 
> There's a very odd thing happening when I stop dnsmasq, remove my leases
> file, then restart dnsmasq, then I restart all the blades at once: I'm
> seeing dnsmasq hand out the same IP address to >1 different MAC address.
> For example:
> 
> Apr 12 12:18:18 NZ80123-H1 dnsmasq[14036]: started, version 2.45 cachesize 150
> Apr 12 12:18:18 NZ80123-H1 dnsmasq[14036]: compile time options: IPv6 GNU-getopt no-ISC-leasefile no-DBus no-I18N TFTP
> Apr 12 12:18:18 NZ80123-H1 dnsmasq[14036]: DHCP, IP range 10.0.2.0 -- 10.0.15.254, lease time infinite
>     ...
> Apr 12 12:20:05 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:08:05 
> Apr 12 12:20:05 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:08:05 
>     ...
> Apr 12 12:20:11 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b 
> Apr 12 12:20:11 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:02:0b
>     ...
> Apr 12 12:20:14 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:06:07
>     ...
> Apr 12 12:20:15 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:0a:03
>     ...
> Apr 12 12:20:15 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:0c:01
>     ...
> Apr 12 12:20:15 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:04:09
> 
> 
> Note that it's offered 10.0.5.15 to six different IP addresses... there
> are plenty of IP's in the range so why do we overload them like this?
> 
> Then, later on, we get a problem when the first one registers, which
> works fine, but then all the others get a NAK:
> 
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:08:05
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPACK(bond2) 10.0.5.15 00:06:72:00:08:05
>     ...
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:02:0b
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:02:0b address in use
>     ...
> Apr 12 12:20:23 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:06:07
> Apr 12 12:20:23 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:06:07 address in use
>     ...
> Apr 12 12:20:23 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:0a:03
> Apr 12 12:20:23 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:0a:03 address in use
>     ...
> Apr 12 12:20:24 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:0c:01
> Apr 12 12:20:24 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:0c:01 address in use
>     ...
> Apr 12 12:20:24 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:04:09
> Apr 12 12:20:24 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:04:09 address in use
> 
> After this, the other interfaces re-acquire a new IP address, but these
> also end up being used already, until finally we get one that works for
> us; for example here's the 02:0b MAC:
> 
> Apr 12 12:20:11 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:20:11 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:02:0b
> Apr 12 12:20:16 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:20:16 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:02:0b
> Apr 12 12:20:18 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:20:18 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.15 00:06:72:00:02:0b
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.15 00:06:72:00:02:0b
> Apr 12 12:20:22 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.15 00:06:72:00:02:0b address in use
>     ...
> Apr 12 12:20:51 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:20:51 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.16 00:06:72:00:02:0b
> Apr 12 12:20:55 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.16 00:06:72:00:02:0b
> Apr 12 12:20:55 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.16 00:06:72:00:02:0b address in use
>     ...
> Apr 12 12:21:39 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:21:39 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.17 00:06:72:00:02:0b
> Apr 12 12:22:36 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:22:36 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.17 00:06:72:00:02:0b
> Apr 12 12:22:51 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.17 00:06:72:00:02:0b
> Apr 12 12:22:51 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.17 00:06:72:00:02:0b address in use
>     ...
> Apr 12 12:22:51 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:22:51 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.18 00:06:72:00:02:0b
> Apr 12 12:22:55 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.18 00:06:72:00:02:0b
> Apr 12 12:22:55 NZ80123-H1 dnsmasq[14036]: DHCPNAK(bond2) 10.0.5.18 00:06:72:00:02:0b address in use
>     ...
> Apr 12 12:23:03 NZ80123-H1 dnsmasq[14036]: DHCPDISCOVER(bond2) 00:06:72:00:02:0b
> Apr 12 12:23:03 NZ80123-H1 dnsmasq[14036]: DHCPOFFER(bond2) 10.0.5.19 00:06:72:00:02:0b
> Apr 12 12:23:07 NZ80123-H1 dnsmasq[14036]: DHCPREQUEST(bond2) 10.0.5.19 00:06:72:00:02:0b
> Apr 12 12:23:07 NZ80123-H1 dnsmasq[14036]: DHCPACK(bond2) 10.0.5.19 00:06:72:00:02:0b
> 
> Finally we find a free one... but note that this takes us 3 minutes!!
> 
> By that time the monitor programs that I use to verify that all 96
> blades are up, has timed out: it's not expecting to have to wait all
> this extra time for the DHCP to complete.
> 
> Note that once the system is up, then if I stop and restart it it works
> fine, since the leases file gives a valid set of IP addresses.  But if I
> delete my leases file and start from scratch, the problem re-occurs.
> 
> Is this expected behavior?  Although I suppose it does not violate the
> standard since until both sides accept either side can change its mind
> (IIRC from the DHCP handshake spec) it seems... sub-optimal :-).  Is
> this a known issue with this version of dnsmasq?  Is it resolved in
> newer versions?  Is there something I can do to work around it without
> rolling my own newer version of dnsmasq?
> 
> Unfortunately for reasons that are too complicated to go into, I run
> into the above situation a good bit during my internal testing and it's
> causing heartburn.
> 
> 

You've hit an unfortunate set of circumstances. What happens is that,
for the first phase of the DHCP trasnsaction (DHCPDISCOVER/DHCPOFFER)
dnsmasq picks an address to offer based on a hash of the MAC address.
That means it doesn't have to record any information about what
addresses it is offering to which host. This makes the database simpler,
but an individual host will still (almost always) be offered the same IP
address. Once a host does the DHCPREQUEST/DHCPACK phase, the IP address
goes into the database and is claimed, it won't then be offered to
another host. Rare has collisions are handled by the DHCPNAK mechanism
you observed.

You are seeing problems because you are running lots of hosts through
the address-aquisition process simultaneously and their MAC addresses
are all very similar because they have the same manufacturer. This is
causing the rather unsophisticated hash function to generate lots of
collisions.

All non-ancient versions of dnsmasq use the same hash function so you
can't improve things by changing versions. I'm not clear why you need to
delete the lease database, the problem would be fixed by leaving it in
place and using long leases. You would only need to take the pain of
address-allocation once, and if you batched the blades you could
probably avoid the collisions.

You could try fiddling with the hash: it's in address_allocate() in
src/dhcp.c and the current code looks like:

/* hash hwaddr */
  for (j = 0, i = 0; i < hw_len; i++)
    j += hwaddr[i] + (hwaddr[i] << 8) + (hwaddr[i] << 16);

The value of j is later used modulo the size of the address range.

Looking at that code, it really only adds up the values of the octets of
the MAC address: something like

for (j = 0, i = 0; i < hw_len; i++)
    j += i * hwaddr[i] + (hwaddr[i] << 8) + (hwaddr[i] << 16);

Might be enough to fix things.

If you can fiddle with that, and get something which works better, I'd
be interested to see the patch.


HTH

Simon.









More information about the Dnsmasq-discuss mailing list