[Dnsmasq-discuss] Dnsmasq with Gigantic hosts file

Wed Jan 11 15:09:57 GMT 2012

On 11/01/12 14:23, Jan Seiffert wrote:

>> If you're curious, the second file is a copy of the NIS hosts file,
>
> NIS ...
> hmmm, all hosts of the same domain?
>
> Simon, i already forgot how the cache works, could this be a massive
> hash collision?

It's possible, but I have another hypothesis: the performance fixes back 
in 2007 are predicated on most entries in massive host files for the 
same IP address. That's reasonable because all of the really big 
hostsfiles we were seeing were for ad-blocking, and mapped nearly all 
the domains to 127.0.0.1 or 0.0.0.0

Part of the read-in process does a reverse lookup on the previously-read 
entries. This is needed because only the first occurrence of an IP 
address is used for reverse mapping, so

domain1.com 1.2.3.4
domain2.com 1.2.3.4

both domain1.com and domian2.com forward-map to 1.2.3.4, but 1.2.3.4 
only backwards maps to domain1.com

To implement this, as each line is read in a host file, the address is 
looked up in the cache to determine if it's been seen before. As the 
cache is NOT hashed for by-address lookups, this is an O(n) operation, 
and reading a hostfile is O(n^2) worst-case.

The 2007 speedup simply checks to see if the address on this line is the 
same as the address in the previous line. If so we know it's a duplicate 
in O(1) time and avoid the O(n) lookup. Works brilliantly when almost 
all addresses are the same, but won't help for a file full of distinct 
addresses.
>
> [snip]
>> (It's a
>> new high-end laptop with tons of memory, fast CPU, and an SSD, so it I
>> shouldn't hit any issues there.)

memory and disk are almost certainly not relevant, and any CPU in the 
universe will be overwhelmed eventually by an O(n^2) algorithm.

>>  From what I've read, including a thread with the same subject line in the
>> archives, the startup time should be just a second or two, not a minute,
>> though that was five years ago:
>> http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2007q1/001134.html
>>

>>
>> Any suggestions for how to fix this?

Try commenting out the code around line 650 in cache.c which starts with 
a big comment block explaining the tweak I mention above, starting

/* Ensure there is only one address -> name mapping (first one trumps)

If goes fast then I've guessed right. Assuming I have, there are various 
solutions: add hashing for by-adddress cache lookups, stipulate that big 
hostfiles should be sorted by address so that we can rely on the 
adjacent-line test without falling back to the slow exhaustive search, 
abandon the "first-appearance-of-address is used for reverse lookups" 
semantics, declare it a non-problem.

We can argue about that once we know it's the correct issue.

HTH

Simon.