[Dnsmasq-discuss] [PATCH] Refuse to start with EADDRINUSE in --bind-dynamic mode
Simon Kelley
simon at thekelleys.org.uk
Mon Nov 27 23:18:03 UTC 2023
On 25/11/2023 16:51, Petr Menšík wrote:
> Yes, the problem is 3) has a condition we wait until it changes then
> retry. But for a lot (most?) of errors we lack any indication from the
> system it has changed.
>
> For example insufficient memory or insufficient file descriptors. It may
> change, but unlike watching up and down interfaces, there is no hook
> which would retry listener creation. It fails once and then just maybe
> retries on explicit reload. That is why I think it is absolutely
> necessary to log any failure we pass somewhere, unless we know we will
> retry later.
You're right, the only error from bind() that should be ignored is
EADDRNOTAVAIL. everything else should be a fatal error during startup or
logged once the daemon is running.
I've just pushed a patch to that effect.
Cheers,
Simon.
>
> More below...
>
> On 11/23/23 13:47, Simon Kelley wrote:
>> That's a good point, but I don't think there needs to be any non-fatal
>> error logging. There are three situations during startup.
>>
>> 1) bind() succeeds.
>> 2) bind fails for a reason which won't change - fatal error.
>> 3) bind fails for a reason which may change - startup and wait until
>> it does change and try again.
>>
>> The canonical example of 3) is the one I gave before,
>> --listen-address=1.2.3.4 but not local interface has address 1.2.3.4.
>> The intention is that when a new interface comes up with address
>> 1.2.3.4 then a new socket will be created and bound. This is long
>> after startup, so the only option if it fails then is to log the event.
> Of course, this is very special case somehow well handled. I agree there
> is not much else to do. We could only make the error fatal, but I don't
> think that is desired.
>>
>> If the only situation where we want to wait is the one above, then the
>> solution to to make EADDRNOTAVAIL at startup the only one where we
>> keep waiting, and all the others are fatal. I think when I originally
>> wrote this I wasn't sure if that was the only non-fatal error which is
>> why the code is as it is.
>>
>> This is not a complete solution to your original problem of enforcing
>> only one dnsmasq daemon process in any case. For example if you
>> configure a single listen-address which doesn't exist on the machine,
>> then you can start as many dnsmasq processes as you like and they'll
>> all start up and be waiting for the interface with that address to be
>> created. Once it is, all will try and bind it, and all but one will
>> fail, but they'll all still exist. Managing daemon processes is really
>> the job of sysvinit or systemd, but the authors of the bug seem to
>> sant protection from just running the binary from the command line.
>
> We at Fedora support only services managed by systemd. But even for
> that, it needs to get some feedback of failure. If the process
> terminates with non-zero status code, unit will be marked failed. We
> *need* that. Alternative might be support for libsystemd with notify
> socket, which would work with Type=notify services. Now it will report
> failed startup only with Type=forking. Later failure is logged only as a
> warning regardless of type of the error. I think we want unexpected
> error types to be logged as errors, especially for insufficient
> resources errors like ENOMEM. Or made them fatal. With systemd unit
> Restart=on-failure, it might be able to recover from memory leaks if
> such errors were fatal. Not sure we want that, might break a lot of
> deployments, but also fix some.
>
>>
>> TLDR;
>>
>> We either pick a set of errors which are Ok to continue
>> (EADDRNOTAVAIL, what others?) and fail fatally at startup for all
>> others, or we pick a set of errors to fail fatally at startup
>> (EADDRINUSE, EACCESS, what others?) and continue for all others.
>>
>>
>> Cheers,
>>
>> Simon.
>
> I would say safer would be to fatal error everything except explicitly
> waived, for now just EADDRNOTAVAIL and EINTR? I think most of these
> errors means incomplete degraded service anyway, without reliable
> self-repair code present. If it had repeat timer with exponentially
> increasing time of retry (with some upper bound), then we might want it
> to start anyway. But I think it is safer to prevent half-initialized
> service. Systemd can provide autorecovery with smart settings. Do we
> have a way to specify I do not require TCP listening socket for DNS? It
> should be clearly discouraged, but for some kinds of tests it might be
> acceptable.
>
> Cheers,
> Petr
>
>>
>>
>> On 23/11/2023 11:13, Petr Menšík wrote:
>>> To fix problem with multiple instances correctly refusing running on
>>> the same machine and namespaces, yes, it would be sufficient.
>>>
>>> But I think part of the problem is hiding all problems during startup
>>> and not showing them at all, in any source. I think that is okay for
>>> EADDRNOTAVAIL to not be printed. But I think in other cases we want
>>> at least warning somewhere. This way you also get exact error message
>>> printed. For example selinux policy hardening may prevent your
>>> process to listen on port 53, even though it has NET_BIND_SERVICE.
>>>
>>> With my modification it will print errors for listeners used. Note
>>> 10.1.2.3 is hidden at that phase. You would not know it without
>>> strace analysis. I expect there can be different errors, for example
>>> running out of file descriptors or memory. Hiding something
>>> non-standard happened during startup is quite a bad design. Only some
>>> kinds of errors are okay during startup.
>>>
>>> $ sudo -u nobody fedora-like/dnsmasq -d --bind-dynamic
>>> --listen-address=10.1.2.3
>>> dnsmasq: failed to create listening socket for 127.0.0.1: Permission
>>> denied
>>> dnsmasq: failed to create listening socket for 127.0.0.1: Permission
>>> denied
>>> dnsmasq: failed to create listening socket for ::1: Permission denied
>>> dnsmasq: failed to create listening socket for ::1: Permission denied
>>>
>>> dnsmasq: process is missing required capability NET_BIND_SERVICE
>>>
>>> # Compare this with:
>>>
>>> $ sudo -u nobody fedora-like/dnsmasq -d --bind-interfaces
>>> --listen-address=10.1.2.3
>>> dnsmasq: failed to create listening socket for 127.0.0.1: Permission
>>> denied
>>>
>>> I think we want any errors printed, even if they are not made fatal.
>>> Except carefully chosen type of errors, which are expected and would
>>> raise just false alarms. Not sure how to trigger other types of
>>> errors, but I am sure I would like to see them, even if they did not
>>> cause the process to die. That is why I have used more complicated
>>> approach, which should print everything unexpected, even when dnsmasq
>>> is not stopped. In order to investigate you first have to know
>>> something unusual has happened.
>>>
>>> On 23. 11. 23 0:29, Simon Kelley wrote:
>>>> Isn't this sufficient to fix the problem?
>>>>
>>>> Not calling die() when bind-dynamic is set is intended to handle the
>>>> case that bind returns EADDRNOTAVAIL because you've configured
>>>> --listen-address=1.2.3.4 but there's not a local interface with that
>>>> address. dnsmasq runs anyway in the expectation that such an
>>>> interface will appear in the future and a socket will be bound then.
>>>>
>>>> I don't think there's a die()/syslog() conflict at all.
>>>>
>>>>
>>>> diff --git a/src/network.c b/src/network.c
>>>> index ca9fada..db1d528 100644
>>>> --- a/src/network.c
>>>> +++ b/src/network.c
>>>> @@ -924,7 +924,7 @@ static int make_sock(union mysockaddr *addr, int
>>>> type, int dienow)
>>>> {
>>>> /* failure to bind addresses given by --listen-address at
>>>> this point
>>>> is OK if we're doing bind-dynamic */
>>>> - if (!option_bool(OPT_CLEVERBIND))
>>>> + if (!option_bool(OPT_CLEVERBIND) || errno == EADDRINUSE)
>>>> die(s, daemon->addrbuff, EC_BADNET);
>>>> }
>>>> else
>>>>
>>>> Cheers,
>>>>
>>>> Simon.
>>>>
>>>> On 22/11/2023 19:27, Petr Menšík wrote:
>>>>> Hello everyone,
>>>>>
>>>>> I have received error report RHEL-16398 [1], which I think makes
>>>>> sense to fix even in the lastest version. I believe it allows
>>>>> non-intentional another instance running without error. What is
>>>>> worse, it does not even show any warning that initialization is
>>>>> incomplete.
>>>>>
>>>>> Of course the problem at start is those errors happen in time when
>>>>> no log is available. I think that can be fixed easily by using
>>>>> stderr at that time. That is patch #1.
>>>>>
>>>>> Second makes EADDRNOTAVAIL bind errors still hidden, but prints all
>>>>> other errors at least to stderr. On a system with systemd that
>>>>> should make it present in journalctl -u dnsmasq anyway. EADDRINUSE
>>>>> is made fatal, because that would not be usually handled by new
>>>>> addresses added later. If there is a need to start another dnsmasq
>>>>> instance without TCP listeners, I think that should be specified
>>>>> more explicitly. Makes EADDRINUSE fatal the same way as with
>>>>> --bind-interfaces.
>>>>>
>>>>> Would you find any other errors, which should be hidden or made
>>>>> fatal? What would you think of those changes?
>>>>>
>>>>> 1. https://issues.redhat.com/browse/RHEL-16398
>>>
More information about the Dnsmasq-discuss
mailing list