[Dnsmasq-discuss] Partial denial of service with dnsmasq on resource constrained systems

Thu Apr 1 07:40:13 UTC 2021

Hey Tony,

On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote:
> You're right that text segments are fairly small and shared; memory usage
> was dominated by storage for blocklists read from file. This makes the
> problem more general than just tiny systems, since people tend to size
> their blocklists proportional to system memory size.

I wounldn't say this. Users try to squeeze too-large files also when they
do not have enough memory for them...

On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote:
> You're also right that actual memory footprint increases only minimally
> with each fork() thanks to copy-on-write; I'm certain these OOM systems
> aren't really exhausting memory. But I do think there's confusion around
> memory usage optimizations like COW vs. memory accounting used for OOM.

OOM is just severely broken IMO. As a concept. Linux should likely not
allow overcommitment at all, there is just no way at all for software to
account for memory not being available it successfully allocated some time
ago.

On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote:
> I recall looking at dnsmasq process statistics on OOM invocation, and
> noticed their VM set sizes were usually close to total system memory,
> i.e.
> COW wasn't relevant. And from a dnsmasq proc memory map, the large
> segment
> storing the blocklist was marked read-write. I suspect that despite COW,
> since that memory is *potentially* writable it's being accounted for at
> fork() time.

The fork technically needs to allocate as much memory as the program is
currently using but /proc/[pid]/maps won't tell you if the memory is copy-
on-write or not. It is for sure read-write as, otherwise, when the fork
would write to it, it would be sent SIGSEGV. Instead, when trying to write
to a copy-on-write page, you will trigger a page-fault, the page will be
duplicated and you can continue happily as if nothing would have happened.
Also the "p" (private) doesn't help much here because it is just
distinguishing from "s" (shared) at this point.

It *should* be possible to extract the relevant information from
/proc/[pid]/pagemap and then check the details of the page(s) in
/proc/kpageflags for KPF_SWAPBACKED (page is backed by swap/RAM). This is
the only way I'm aware of to check if this is a copy-on-write page existing
in multiple places.

If you know a simpler way to do this, I'd be happy to learn.

On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote:
> A possible fix I'd suggest is to update dnsmasq's memory handling. IIRC,
> we use the same cache structure and memory allocation for both DNS cache
> and storing static server lists read from file. Perhaps use a separate,
> page-aligned memory pool to store these lists, then after initialization
> (and before forking) use mprotect() to set the region as read-only.
> 
> Assuming it works, this would have the advantage of being a no-knobs
> solution vs. setting kludgey process or connection limits.

I like the idea of splitting the cache in two parts. Say a static and a
dynamic cache. Using mprotect() shouldn't even be necessary but helps to
ensure we're not writing to the static part of the cache anywhere in the
code.

KSM (kernel samepage merging) comes to my mind as well, but this seems to
be the wrong tool for the job. Figured I should mention it nonetheless.

On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote:
> One other thing I saw while testing with large blocklists was a
> noticeable
> latency increase, likely related to lookup times. I recall some
> discussion
> on the ML where you mentioned work on a hash/tree solution was in
> progress. Were those changes completed?

Yes, dnsmasq uses hash buckets to minimize the amount of memory it has to
loop over when trying to find a name.

Best,
Dominik