[Dnsmasq-discuss] Dnsmasq with Gigantic hosts file
Simon Kelley
simon at thekelleys.org.uk
Wed Jan 11 15:09:57 GMT 2012
On 11/01/12 14:23, Jan Seiffert wrote:
>> If you're curious, the second file is a copy of the NIS hosts file,
>
> NIS ...
> hmmm, all hosts of the same domain?
>
> Simon, i already forgot how the cache works, could this be a massive
> hash collision?
It's possible, but I have another hypothesis: the performance fixes back
in 2007 are predicated on most entries in massive host files for the
same IP address. That's reasonable because all of the really big
hostsfiles we were seeing were for ad-blocking, and mapped nearly all
the domains to 127.0.0.1 or 0.0.0.0
Part of the read-in process does a reverse lookup on the previously-read
entries. This is needed because only the first occurrence of an IP
address is used for reverse mapping, so
domain1.com 1.2.3.4
domain2.com 1.2.3.4
both domain1.com and domian2.com forward-map to 1.2.3.4, but 1.2.3.4
only backwards maps to domain1.com
To implement this, as each line is read in a host file, the address is
looked up in the cache to determine if it's been seen before. As the
cache is NOT hashed for by-address lookups, this is an O(n) operation,
and reading a hostfile is O(n^2) worst-case.
The 2007 speedup simply checks to see if the address on this line is the
same as the address in the previous line. If so we know it's a duplicate
in O(1) time and avoid the O(n) lookup. Works brilliantly when almost
all addresses are the same, but won't help for a file full of distinct
addresses.
>
> [snip]
>> (It's a
>> new high-end laptop with tons of memory, fast CPU, and an SSD, so it I
>> shouldn't hit any issues there.)
memory and disk are almost certainly not relevant, and any CPU in the
universe will be overwhelmed eventually by an O(n^2) algorithm.
>> From what I've read, including a thread with the same subject line in the
>> archives, the startup time should be just a second or two, not a minute,
>> though that was five years ago:
>> http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2007q1/001134.html
>>
>>
>> Any suggestions for how to fix this?
Try commenting out the code around line 650 in cache.c which starts with
a big comment block explaining the tweak I mention above, starting
/* Ensure there is only one address -> name mapping (first one trumps)
If goes fast then I've guessed right. Assuming I have, there are various
solutions: add hashing for by-adddress cache lookups, stipulate that big
hostfiles should be sorted by address so that we can rely on the
adjacent-line test without falling back to the slow exhaustive search,
abandon the "first-appearance-of-address is used for reverse lookups"
semantics, declare it a non-problem.
We can argue about that once we know it's the correct issue.
HTH
Simon.
More information about the Dnsmasq-discuss
mailing list