<div dir="ltr">Screenshot for my previous post:<div><a href="https://smirnov.la/Screenshot%202024-09-05%20at%2009.42.38.png" target="_blank">https://smirnov.la/Screenshot%202024-09-05%20at%2009.42.38.png</a><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 5, 2024 at 9:52 AM Danil Smirnov <<a href="mailto:danil.smirnov@gmail.com" target="_blank">danil.smirnov@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div>So I managed to isolate and reproduce the issue quite reliably.<div><br></div><div>Every day exactly at <span style="font-family:"Helvetica Neue";font-size:13px">06:10 UTC time my dnsmasq container stops responding. During the event, I can successfully query my external DNS servers but not dnsmasq:</span></div><div><font face="Helvetica Neue"><br></font></div><div>





<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal"><font face="monospace">dig domain.tld @<a href="http://172.18.0.250" target="_blank">172.18.0.250</a></font></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;min-height:15px"><font face="monospace"><br></font></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal"><font face="monospace">; <<>> DiG 9.16.23-RH <<>> domain.tld @<a href="http://172.18.0.250" target="_blank">172.18.0.250</a></font></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal"><font face="monospace">;; global options: +cmd</font></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal"><font face="monospace">;; connection timed out; no servers could be reached</font></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">I see hundreds of errors like this in the system log:</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">





</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal"><font face="monospace">Sep 05 06:10:58 <a href="http://mm4.lax.icann.org" target="_blank">mm4.lax.icann.org</a> dockerd[1150]: time="2024-09-05T06:10:58.464185887Z" level=error msg="[resolver] failed to query external DNS server" client-addr="udp:<a href="http://172.18.0.4:48552" target="_blank">172.18.0.4:48552</a>" dns-server="udp:<a href="http://172.18.0.250:53" target="_blank">172.18.0.250:53</a>" error="read udp 172.18.0.4:48552-><a href="http://172.18.0.250:53" target="_blank">172.18.0.250:53</a>: i/o timeout" question=";_dmarc.domain.tld.\tIN\t TXT"</font></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">However, there is nothing suspicious in the /var/log/messages and /var/log/cron that might explain what happened.</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">Before the container restarted at 06:15, I tried to collect stats via the "kill --signal=USR1" command but the stats weren't posted in the logs - obviously, dnsmasq was so stuck it couldn't even process the signal. (However, I don't think stats would be helpful since the time of the event doesn't change even if I restart dnsmasq in between 6:10 events.)</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">Resource-wise, it was an increase in memory consumption by dnsmasq when the issue started and then a spike in the middle of it (the time shown is 3 hours later than UTC):</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><img src="cid:ii_m0oxh4qv0" alt="Screenshot 2024-09-05 at 09.42.38.png" width="562" height="406"><br><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"> </p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px"><br></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font-size:13px;line-height:normal;font-family:"Helvetica Neue";min-height:15px">I'm using <a href="https://github.com/dockur/dnsmasq/blob/master/entry.sh#L14" target="_blank">these params</a> plus "fast-dns-retry". Also tried adding "<span style="font-family:Arial,Helvetica,sans-serif;font-size:small">no-negcache" and "</span><span style="font-family:Arial,Helvetica,sans-serif;font-size:small">all-servers" but it didn't fix the issue.</span></p><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">Any idea where to continue the investigation?</div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">Sincerely,</div><div style="color:rgb(34,34,34)">Danil Smirnov</div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Aug 25, 2024 at 7:45 PM Danil Smirnov <<a href="mailto:danil.smirnov@gmail.com" target="_blank">danil.smirnov@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)">Hi Dimitry,</div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Aug 25, 2024 at 7:36 PM Dimitry Andric <<a href="mailto:dimitry@unified-streaming.com" target="_blank">dimitry@unified-streaming.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Is there any way to reproduce this issue reliably? That is, some recipe that says: run this particular docker container, run some script that queries it, observe hang after N minutes?<br></blockquote><div><br></div><div>For now, I established a watchdog in my environment that will restart the container on freeze while collecting some stats. I'm going to monitor the issue for one more week (already spent a week debugging the issue). After seeing some useful data I'll try to reproduce it.</div><div><br></div><div><div>Sincerely,</div><div>Danil Smirnov</div></div></div></div>
</blockquote></div>
</blockquote></div>