[Dnsmasq-discuss] A (possibly bad) idea: failover in dnsmasq

Simon Kelley simon at thekelleys.org.uk
Sat May 26 13:01:39 BST 2012

On 26/05/12 12:26, Vincent Cadet wrote:
> --- On Sat 26.5.12, Simon Kelley wrote : ...
>>> What if there be a heartbeat link in dnsmasq through
>> which the active
>>> dnsmasq would stream changes (or the whole block of
>> data) to the
>>> passive instance along with keep-alive probes?
>> That has attractions: Both dnsmasq instances could provide DNS
>> service at all times, and whichever was "master" could provide
>> DHCP, whilst the "slave" just keeps it's database up-to-date. The
>> main problem with this is the "split brain" scenario, where both
>> instances are up, but they can't talk to each other because the
>> network between them is partitioned. In that case both acting as
>> masters for their half of the network is fine, the problem comes
>> when connectivity returns and the lease databases have to be
>> reconciled....
> Hmmm... a failed dnsmasq could request all the changes that occurred
> since its last failure from its peer(s). Newer records overwrite
> older ones. Expired leases and records are to be removed [or
> overwritten according to the received data block that was
> requested].
> Since machines with a lease send their requests to only one dnsmasq
> instance, lease and record reconciliation should be rather straight
> forward IMHO and all records from all dnsmasq peers can be merged in
> decreasing order of expiry date.
> That would also suggest each dnsmasq instance maintains a "dirty"
> state flag until its database is completely in sync with others.
> What needs to be done, I guess, is that the "dirty" dnsmasq instance
> that recovers connection from his other peers must immediately switch
> to non-authoritative mode and return to passive mode, handing over
> (or forwarding) its [live] DNS requests to the "master" instance. No
> DHCP requests should be answered.
> If the network connectivity is restored before the failed dnsmasq
> instance runs again then the latter switches to "dirty" state and non
> authoritative mode, syncing its database with his other peers.
> This implies that a non master dnsmasq should still be able to
> receive DNS requests. There's a choice here. Either reply directly or
> forward them to the new dnsmasq master. Could be a mix of both:
> directly answer requests, which the slave knows aren't yet replicated
> with the master.
> The complete handshake protocol would require that a dnsmasq instance
> notifies the requesting peer that the sync is complete so that it can
> switch to "non-dirty and passive" state.
> I haven't thought thoroughly, it's just a rough idea for the moment.
OK, here's my back-of-envelope suggestion, with minimal reference to yours.

Dnsmasq instances can be configured as either primary or secondary.

Primary behaviour:

Work pretty much as usual except that we accept connections from 
secondaries. When a secondary connects, it sends its current  idea of 
the lease database to the primary. The primary merges that with its own 
lease database and sends the result back to the secondary. It then 
serves DHCP requests as normal and sends incremental changes to the 
lease database to any connected secondary.

Secondary behaviour.

At start up, load the lease database from local disk as usual, then 
attempt to connect to our configured primary. If this succeeds, do the 
lease database swap described above then enter secondary-passive mode 
where DNS  queries are answered but not DHCP requests. If the primary 
connection  cannot be established or fails, enter secondary-active mode 
where DHCP requests are answered. Try to contact the primary a regular 
intervals. When the link to the primary comes back, do the 
lease-database exchange, and then go back to secondary-passive mode.

The secondary-primary connections will be over TCP, or possibly SCTP.

Configuration on a primary looks like

--failover-listen= <port no>

Configuration on a secondary looks like

--failover-master=<IP of primary>,<port on primary>

Need to wonder about security, since connections to the primary can mess 
with things.

This only works with one primary and one secondary: if there are 
multiple secondaries they'll all become active when the primary dies, 
which is wrong.



More information about the Dnsmasq-discuss mailing list