[Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Albert ARIBAUD albert.aribaud at free.fr
Mon Oct 24 21:55:40 BST 2016


Hi John,

Yes, you can submit patches to the list.

However, 2.55 is quite old with respect to the current release of
dnsmasq, which is 2.76 IIRC.

Amicalement,
Albert.

Le Mon, 24 Oct 2016 17:57:03 +0000
John Knight <John.Knight at belkin.com> a écrit:

> Hi Albert,
> 
> I have finished making my changes to dnsmasq 2.55 and I have a patch
> file.  However, I am not sure how to submit it... do I send it to the
> discussion list?
> 
> Thanks,
> 
> John Knight
> 
> 
> -----Original Message-----
> From: John Knight
> Sent: Wednesday, October 19, 2016 12:57 PM
> To: 'Albert ARIBAUD'
> Cc: dnsmasq-discuss at lists.thekelleys.org.uk
> Subject: RE: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an
> hour
> 
> Hi Albert,
> 
> My comments inline.
> 
> John
> 
> > Hi All,  
> 
> > The main while(1) loop uses select() to determine if it has work to
> > do.  In most cases, it appears to use timeout of 0, which I believe
> > means just wait indefinitely for work on the file descriptors.
> > Other times, it appears that the timeout is set to a quarter second
> > when doing a tftp transfer or polling the dbus.
> >
> > Now what concerns me is that when a "retry later" condition occurs,
> > we may get stuck on the select() for a long period of time.  Alas,
> > I do not know how frequent one might expect to see work arrive on
> > the file descriptors that select is watching, so I don't really
> > know if this is a long time or not.  It seems though that in this
> > failure scenario, the poll_resolv() function does NOT get called
> > very often at all.  
> 
> Albert:  Actually, if dnsmasq does not receive any request from
> clients, it does not need to poll servers, so I would ask: does the
> select() include descriptors for client requests (either UDP
> datagrams received, or TCP connections opened)? If so, I think it
> will exit just when necessary and no tiemout is needed; otherwise,
> you are right that a timeout is required.
> 
> Albert: Also, it may be improbable that select() does not return for
> a whole hour; but then, is every return from select() followed by a
> resolv file poll, or can select() return and then be entered again
> without polling the resolv files? I am thinking, for instance, about
> cached answers which do not need servers if their TTL is long enough.
> 
> John: I have made a simple change that provides a one second timeout
> for select.  I have found that dnsmasq is much more responsive now to
> changes made to /etc/resolv.conf.  With code that calls poll_resolv,
> it rate limits the calls to once every two seconds, which I believe
> is fine and responsive enough.
> 
> John:  Given I am testing this in a lab situation and just me on the
> console and one idle PC connected to the router, there is little use
> of DNS.  In my experience since the initial failure, I believe I did
> see poll_resolv polled in one case at an interval of about 20
> minutes.  I don’t think this poll interval should be driven by how
> active the users are and how much they use dns; just my personal
> feeling about that.
> 
> John: It should be noted that if I had been doing a tftp transfer,
> the code would set the select timeout for 250ms.  I am not sure why
> the tftp transfer being active would warrant the much quicker
> timeout?  Anyhow, what I did was an else statement... if tftp
> transfer, set timeout to 250ms else set timeout to 1 second.
> 
> John: I don't know dnsmasq well enough to answer your other questions
> about select and what all of the file descriptors are associated
> with.  Perhaps someone more knowledgeable can chime in.  My change
> was made in response to the situation where a "retry later" situation
> was pending, and not getting poll_resolv was not getting polled again
> in a reasonable time period to do the retry.
> 
> John: I believe on our router, dhcp entries have an hour TTL and we
> do use dnsmasq for dhcp.  On an idle PC, would it have any reason to
> initiate a dnsmasq query?  Occasionally if the browser is up and
> running, I do see the browser query the address of its update server,
> but I haven't generally speaking had my browser running on the PC
> while doing my dnsmasq testing.  So it seems to me that the two
> possible sources to cause dnsmasq activity (ie. Browser and dhcp) may
> be idle for at least an hour... so this seems like a possibility that
> poll_resolv() may not be getting called in this scenario for a long
> time.
> 
> > My gut feeling is that there always needs to be a timeout on the
> > select call as the poll_resolv() should be called fairly frequently.
> > The code that exists today where poll_resolv() normally is called
> > from this loop suggests a poll rate of about once a second.  This
> > definitely does not happen today.  By just adding a my_syslog()
> > message to the top of poll_resolv(), it is very clear from the
> > logfile that it is not called often, and way to infrequently to
> > resolve the "retry later" condition in a timely manner.  
> 
> Albert: Can you compare when poll_resolv() is called wrt when the
> select() is exited -- and for what reason?
> 
> John: What I did to see relative times between select and calls to
> poll_resolv was to add calls to my_syslog() before the select and at
> the top of poll_resolv().  The timestamp in the dnsmasq logfile was
> used to see how much time between calls.  I don't know what the
> reason for exiting select is... indeed, for what I was doing, I
> really didn't care... I just needed to know when poll_resolv() was
> getting called and how often.
> 
> > Going forward, as the next thing for me to try, I am going to add a
> > timeout for the select... perhaps a modest once a second or two.  
> 
> Albert: I would personally investigate further on a gut feeling
> without changing the code behavior, because my changes might have
> unwanted effects which can actually hide the root cause I am looking
> for -- but to each his/her own.
> 
> John: My boss is on my case to getting this resolved asap.  Based on
> my trying of the select timeout, this appears to have at least solved
> part of the problem... poll_resolv() not getting called back in a
> reasonable timeframe after a "retry later" issue.  I need to keep
> moving forward; not sure I have the time for indepth investigation.
> I do know other code does set select timeout, so I do know this code
> path is not unprecedented, so risk should be low.
> 
> > But I would like to know what you all of think of this... does this
> > make sense to do?  Is there ever a case where we might not get any
> > work on the files select is monitoring for nearly an hour?  I am
> > trying to make sense of this issue.  
> 
> Albert: Not entirely sure what you mean with "Is there ever a case
> where we might not get any work on the files select is monitoring for
> nearly an hour"; I will assume you mean "Is there a normal case where
> dnsmasq would not poll for changes in resolv files for an hour". If
> so, then I would say it depends on how much traffic dnsmasq receives
> and how much of it can be answered from cache.
> 
> John: Your interpretation is correct. Thanks for the info and your
> help Albert.  I am glad I have someone listening.  When I am done, I
> will forward the diffs for the changes I have made to dnsmasq for
> your review.
> 
> > Thanks,
> >
> > John Knight  
> 
> Amicalement,
> --
> Albert.



More information about the Dnsmasq-discuss mailing list