[Dnsmasq-discuss] Dnsmasq with Gigantic hosts file

Tue Jan 30 16:40:53 GMT 2007

Jan 'RedBully' Seiffert wrote:
> Simon Kelley wrote:
>> Jan 'RedBully' Seiffert wrote:
>>
>>> While it is common that most of these entries will point to one address
>>> (be it 127.0.0.1 or a local LAN machine) and F_REVERSE not set for them,
>>> it's a pity it will still slow down reverse lookups...
>> Reverse lookups currently iterate over every cache entry, including (in
>> the case) the 700k from /etc/hosts which don't have F_REVERSE set. It
>> will ignore them, of course.
>>
>> Actually the observation that most of these entries don't set F_REVERSE
>> is a valuable insight; it's also case that they will have the F_IMMORTAL
>> bit set.
>>
> At least for the "common" 10k+ entrys from /etc/hosts case. Maybe a
> large installation can also generate a lot of reverse entries, but that
> should be more in the 100'th, which a server should be able to search.
> Everything else could be simply over engineering.

Agreed.

> 
>> This allows a very simple optimisation: just put the reverse entries at
>> the start of the hash-chain. Reverse lookups still need to check every
>> hash chain, but only the first few entries which have F_REVERSE set, all
>> the thousands of entries from /etc/hosts further down the list can just
>> be skipped.
>>
>> The IMMORTAL bit allows a similar optimisation for garbage-collection
>> (the cache_scan_free() function.) IMMORTAL entries never need to be
>> garbage collected, so  by putting them at the _end_ of the hash chain,
>> iterating over every entry to find expired ones is no longer impacted by
>> having lots of /etc/hosts names.
>>
> Hmmmm, OK, thats a impressive simple way to go.
> This (re)sorting (last hit gets placed in front of chain?) and the
> garbage-collection are things which scared me away not to touch the
> complete rest of the code ;)
> 
>> I've not tested this yet, but I think the following patch should be enough.
> [snip - patch]
> 
> But doesn't this sorting clash with the resorting of cache_find_by_name
> (according to line 487)?

The cache entries are on two different linked lists, the
least-recently-used one doubly-linked on ->next and ->prev and the
hash-list, linked on ->hash_next. You are thinking about the LRU list
but I'm talking about the hash-list.

This code has evolved slowly over a long time and has all sorts of scary
stuff like that. I'd do it very differently from scratch but what's
there 1) works well and fast. 2) had the bugs shaken out long ago. so I
don't feel inclined to toss it.

> Is it necessary to traverse the lists to the end of the "region" on
> insert?

Yes, so there's an extra cost on insert, but that scales with
hash-bucket size, not total cache size, and it isn't affected by huge
/etc/hosts files since almost all the entries from those will be in the
third "immortal" region, which never needs to be traversed.

> 
> I will also test it, time to update to the last version...
> 

Thanks. 2.36 didn't change cache.c, AFAIR, so the patch should apply to
2.35 or 2.36.

> I only tried to make it least intrusive because the lack of in deep
> knowledge of the original cache code (and how every little bit works
> together in every corner case). Also this way you can switch off the
> tree to save like 2k RAM for memory constrained system like the OpenWRT.
> Thats maybe not the optimal approach (and may not work this way) but as
> i said, i tried to make it least intrusive, and from my look over the
> code, these places looked like the best place to hook in.

I'd do it the same way in the same situation.

(but I need to worry on concurrent access, coding on dnsmasq is
> mind relaxing in this regard ;)

a select-loop is a wonderful thing ;-)

Cheers,

Simon.