[Dnsmasq-discuss] [PATCH] fix for netlink ENOBUF problem

Simon Kelley simon at thekelleys.org.uk
Thu Jul 7 20:16:52 BST 2016


Great, many thanks. Is this patch on top of the original one, or an
alternative?

Once it's all resolved, I'm happy to take the final patch.

Cheers,

Simon.

On 07/07/16 19:23, Ivan Kokshaysky wrote:
> On Thu, Jul 07, 2016 at 03:32:11PM +0100, Simon Kelley wrote:
>> On 06/07/16 14:55, Ivan Kokshaysky wrote:
>>> On Mon, Jul 04, 2016 at 01:58:43PM -0400, wkitty42 at gmail.com wrote:
>>>> On 07/04/2016 11:29 AM, Ivan Kokshaysky wrote:
>>>>> To fix that we need to purge the netlink buffer on ENOBUF error. With the
>>>>> appended patch dnsmasq is running flawlessly for about a month.
>>>>
>>>> why are the messages not removed from the buffer when they are processed? or are 
>>>> they and there's simply too many messages coming in to handle?
>>>
>>> Good questions, thanks.
>>>
>>> It's certainly possible to drop these messages during normal processing -
>>> if the message process ID is correct, but sequence number is wrong,
>>> the message must be dropped (instead of putting it on async queue
>>> like it happens now, and it's a real bug, BTW). 
>>
>> So that real bug relates to this:
>>
>>    if (h->nlmsg_seq != seq ||
>>        h->nlmsg_pid != netlink_pid ||
>>        h->nlmsg_type == NLMSG_ERROR)
>> 	  {
>> 	    /* May be multicast arriving async */
>> 	    nl_async(h);
>> 	  }
>>
>> Which should be something like
>>
>> if (h->nlmsg_pid != netlink_pid ||
>>     h->nlmsg_type == NLMSG_ERROR)
>>      {
>>         /* May be multicast arriving async */
>>         nl_async(h);
>>      }
>> else if (h->nlmsg_seq != seq)
>>      {
>>         /* drop message */
>>      }
>> else
>>      /* process message */
> 
> Yes, precisely.
> 
>> Also, I wonder if the loop which drops messages should check for async
>> stuff and handle it - that won't get re-tried by the kernel, will it?
> 
> Indeed that won't, I guess...
> 
>> Also, also, you still have the sleep(1) in there. Is that actually
>> necessary? Handling things this way looks better than just sleeping and
>> hoping?
> 
> I tend to agree. Since I don't need debugging anymore, this way looks simpler
> and cleaner.
> 
> Well, I've just patched dnsmasq like this, and restarted it on one of
> our routers. I'm pretty certain it works, but I'd like to let it run, say,
> until monday, just to be absolutely sure. If something goes wrong,
> I'll know immediately - we've set a cron job that checks dnsmasq CPU load
> every ten minutes, and if it exceeds 80% I'll get email with a backtrace.
> In either case I'll let you know.
> 
> Thanks,
> Ivan
> 
> diff --git a/src/netlink.c b/src/netlink.c
> index 049247b..8cd51af 100644
> --- a/src/netlink.c
> +++ b/src/netlink.c
> @@ -188,11 +188,17 @@ int iface_enumerate(int family, void *parm, int (*callback)())
>  	}
>  
>        for (h = (struct nlmsghdr *)iov.iov_base; NLMSG_OK(h, (size_t)len); h = NLMSG_NEXT(h, len))
> -	if (h->nlmsg_seq != seq || h->nlmsg_pid != netlink_pid || h->nlmsg_type == NLMSG_ERROR)
> +	if (h->nlmsg_pid != netlink_pid || h->nlmsg_type == NLMSG_ERROR)
>  	  {
>  	    /* May be multicast arriving async */
>  	    nl_async(h);
>  	  }
> +	else if (h->nlmsg_seq != seq)
> +	  {
> +	    /* May be part of incomplete response to previous request after
> +	       ENOBUFS. Drop it. */
> +	    continue;
> +	  }
>  	else if (h->nlmsg_type == NLMSG_DONE)
>  	  return callback_ok;
>  	else if (h->nlmsg_type == RTM_NEWADDR && family != AF_UNSPEC && family != AF_LOCAL)
> 




More information about the Dnsmasq-discuss mailing list