[Dnsmasq-discuss] Slow DHCP Performance with 7500 Static Reservations

Simon Kelley simon at thekelleys.org.uk
Wed Jun 15 09:29:13 BST 2011


Mike Ruiz wrote:
> Hello All,
> 
> We use dnsmasq as a very effective replacement for the ISC software,
> thus far with great success. However, we have run into occasional
> performance problems at scale.
> 
> These seem to manifest as a general slowdown of the request -> reply
> process, which can sometimes exceed 90 seconds from request to reply.
> 
> The problems only cause issues when there is a penalty for missing a
> request, e.g. when we boot an entire rack of servers. In the default
> configuration, many servers will not attempt to retry the boot sequence,
> forcing us to detect hung machines and issue a remote reboot. We often
> have to do this several times in the frequent case of building more than
> a few hundred servers in a single batch.
> 
> Here are the symptoms:
> 
> 1. We never have any issue with even mass installation as long as the
> configuration contains less than a few thousand static entries and a few
> hundred subnets.
> 
> 2. At some point, dhcp will slow down: this sequence was taken with 7383
> static host entries defined:
> 
> ---------------------------------------------------------------------------
>   TIME: 19:43:47.480189
>     IP: > (00:01:e8:92:fd:41) >  (00:21:9b:a2:e4:47)
>     OP: 1 (BOOTPREQUEST)
>  HTYPE: 1 (Ethernet)
>   HLEN: 6
>   HOPS: 1
>    XID: 2472c541
> 
> <SNIP>
> 
> ---------------------------------------------------------------------------
>   TIME: 19:45:35.963209
>     IP: > (00:21:9b:a2:e4:47) >  (00:00:5e:00:01:8d)
>     OP: 2 (BOOTPREPLY)
>  HTYPE: 1 (Ethernet)
>   HLEN: 6
>   HOPS: 1
>    XID: 2472c541
> 
> <SNIP>
> 
> 3. If I reduce the number of static entries to, for example, 2408,
> response time returns to sub-second.
> 
> General configuration notes:
> 
> This request is handled by a relay, and we have the following config
> options in play:
> 
> no-ping
> no-hosts
> no-resolv
> cache-size=0
> dhcp-lease-max=20000
> dhcp-authoritative
> conf-dir=/etc/dnsmasq.d
> domain=sekret.zynga.com
> port=0   
> 
> Fast:
> 
> # ls -la /etc/dnsmasq.conf 
> -rw-r--r-- 1 root root 408222 Jun 14 20:20 /etc/dnsmasq.conf
> # grep dhcp-range /etc/dnsmasq.conf | wc
>      80      80    6340
> # grep dhcp-host /etc/dnsmasq.conf | wc
>    2408    2410  209798
> 
> Slow:
> 
> # ls -la /etc/dnsmasq.conf 
> -rw-r--r-- 1 root root 1205132 Jun 14 20:31 /etc/dnsmasq.conf
> # grep dhcp-range /etc/dnsmasq.conf | wc
>      80      80    6340
> # grep dhcp-host /etc/dnsmasq.conf | wc
>    7383    7387  650118
> 
> Are there any obvious inflection points that would cause the server to
> drop in performance by a few orders of magnitude with lots of hosts
> defined? 

Not obviously. Most of the code is O(no of dhcp-hosts), or at worst
O(dhcp-hosts) * O(dhcp-ranges). There is one nested loop which is
O(dhcp-hosts squared) but that only runs once at start up.


Are you sure that the effect is as clear-cut as you are describing, and
has no input from the number of active _leases_ and/or the rate at which
hosts are hitting the server? The most obvious mechanism for this is
that the packet queue in the kernel hits the limit and packets get
dropped, that slows things right down waiting the client to time out and
retransmit.

Jan's valgrind suggestion is a good one, as is using netstat to look at
the size of socket receive queues.

Simon.




> Are there any recommendations for tuning, beyond introducing
> more dnsmasq servers?
> 


We need to get to the root of the problem first, I think.


Simon.



More information about the Dnsmasq-discuss mailing list