[Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

John Knight John.Knight at belkin.com
Mon Oct 24 18:57:03 BST 2016


Hi Albert,

I have finished making my changes to dnsmasq 2.55 and I have a patch file.  However, I am not sure how to submit it... do I send it to the discussion list?

Thanks,

John Knight


-----Original Message-----
From: John Knight
Sent: Wednesday, October 19, 2016 12:57 PM
To: 'Albert ARIBAUD'
Cc: dnsmasq-discuss at lists.thekelleys.org.uk
Subject: RE: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Hi Albert,

My comments inline.

John

> Hi All,

> The main while(1) loop uses select() to determine if it has work to
> do.  In most cases, it appears to use timeout of 0, which I believe
> means just wait indefinitely for work on the file descriptors.  Other
> times, it appears that the timeout is set to a quarter second when
> doing a tftp transfer or polling the dbus.
>
> Now what concerns me is that when a "retry later" condition occurs, we
> may get stuck on the select() for a long period of time.  Alas, I do
> not know how frequent one might expect to see work arrive on the file
> descriptors that select is watching, so I don't really know if this is
> a long time or not.  It seems though that in this failure scenario,
> the poll_resolv() function does NOT get called very often at all.

Albert:  Actually, if dnsmasq does not receive any request from clients, it does not need to poll servers, so I would ask: does the select() include descriptors for client requests (either UDP datagrams received, or TCP connections opened)? If so, I think it will exit just when necessary and no tiemout is needed; otherwise, you are right that a timeout is required.

Albert: Also, it may be improbable that select() does not return for a whole hour; but then, is every return from select() followed by a resolv file poll, or can select() return and then be entered again without polling the resolv files? I am thinking, for instance, about cached answers which do not need servers if their TTL is long enough.

John: I have made a simple change that provides a one second timeout for select.  I have found that dnsmasq is much more responsive now to changes made to /etc/resolv.conf.  With code that calls poll_resolv, it rate limits the calls to once every two seconds, which I believe is fine and responsive enough.

John:  Given I am testing this in a lab situation and just me on the console and one idle PC connected to the router, there is little use of DNS.  In my experience since the initial failure, I believe I did see poll_resolv polled in one case at an interval of about 20 minutes.  I don’t think this poll interval should be driven by how active the users are and how much they use dns; just my personal feeling about that.

John: It should be noted that if I had been doing a tftp transfer, the code would set the select timeout for 250ms.  I am not sure why the tftp transfer being active would warrant the much quicker timeout?  Anyhow, what I did was an else statement... if tftp transfer, set timeout to 250ms else set timeout to 1 second.

John: I don't know dnsmasq well enough to answer your other questions about select and what all of the file descriptors are associated with.  Perhaps someone more knowledgeable can chime in.  My change was made in response to the situation where a "retry later" situation was pending, and not getting poll_resolv was not getting polled again in a reasonable time period to do the retry.

John: I believe on our router, dhcp entries have an hour TTL and we do use dnsmasq for dhcp.  On an idle PC, would it have any reason to initiate a dnsmasq query?  Occasionally if the browser is up and running, I do see the browser query the address of its update server, but I haven't generally speaking had my browser running on the PC while doing my dnsmasq testing.  So it seems to me that the two possible sources to cause dnsmasq activity (ie. Browser and dhcp) may be idle for at least an hour... so this seems like a possibility that poll_resolv() may not be getting called in this scenario for a long time.

> My gut feeling is that there always needs to be a timeout on the
> select call as the poll_resolv() should be called fairly frequently.
> The code that exists today where poll_resolv() normally is called from
> this loop suggests a poll rate of about once a second.  This
> definitely does not happen today.  By just adding a my_syslog()
> message to the top of poll_resolv(), it is very clear from the logfile
> that it is not called often, and way to infrequently to resolve the
> "retry later" condition in a timely manner.

Albert: Can you compare when poll_resolv() is called wrt when the select() is exited -- and for what reason?

John: What I did to see relative times between select and calls to poll_resolv was to add calls to my_syslog() before the select and at the top of poll_resolv().  The timestamp in the dnsmasq logfile was used to see how much time between calls.  I don't know what the reason for exiting select is... indeed, for what I was doing, I really didn't care... I just needed to know when poll_resolv() was getting called and how often.

> Going forward, as the next thing for me to try, I am going to add a
> timeout for the select... perhaps a modest once a second or two.

Albert: I would personally investigate further on a gut feeling without changing the code behavior, because my changes might have unwanted effects which can actually hide the root cause I am looking for -- but to each his/her own.

John: My boss is on my case to getting this resolved asap.  Based on my trying of the select timeout, this appears to have at least solved part of the problem... poll_resolv() not getting called back in a reasonable timeframe after a "retry later" issue.  I need to keep moving forward; not sure I have the time for indepth investigation.  I do know other code does set select timeout, so I do know this code path is not unprecedented, so risk should be low.

> But I would like to know what you all of think of this... does this
> make sense to do?  Is there ever a case where we might not get any
> work on the files select is monitoring for nearly an hour?  I am
> trying to make sense of this issue.

Albert: Not entirely sure what you mean with "Is there ever a case where we might not get any work on the files select is monitoring for nearly an hour"; I will assume you mean "Is there a normal case where dnsmasq would not poll for changes in resolv files for an hour". If so, then I would say it depends on how much traffic dnsmasq receives and how much of it can be answered from cache.

John: Your interpretation is correct. Thanks for the info and your help Albert.  I am glad I have someone listening.  When I am done, I will forward the diffs for the changes I have made to dnsmasq for your review.

> Thanks,
>
> John Knight

Amicalement,
--
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________


More information about the Dnsmasq-discuss mailing list