[Dnsmasq-discuss] A (possibly bad) idea: failover in dnsmasq
simon at thekelleys.org.uk
Sat May 26 13:01:39 BST 2012
On 26/05/12 12:26, Vincent Cadet wrote:
> --- On Sat 26.5.12, Simon Kelley wrote : ...
>>> What if there be a heartbeat link in dnsmasq through
>> which the active
>>> dnsmasq would stream changes (or the whole block of
>> data) to the
>>> passive instance along with keep-alive probes?
>> That has attractions: Both dnsmasq instances could provide DNS
>> service at all times, and whichever was "master" could provide
>> DHCP, whilst the "slave" just keeps it's database up-to-date. The
>> main problem with this is the "split brain" scenario, where both
>> instances are up, but they can't talk to each other because the
>> network between them is partitioned. In that case both acting as
>> masters for their half of the network is fine, the problem comes
>> when connectivity returns and the lease databases have to be
> Hmmm... a failed dnsmasq could request all the changes that occurred
> since its last failure from its peer(s). Newer records overwrite
> older ones. Expired leases and records are to be removed [or
> overwritten according to the received data block that was
> Since machines with a lease send their requests to only one dnsmasq
> instance, lease and record reconciliation should be rather straight
> forward IMHO and all records from all dnsmasq peers can be merged in
> decreasing order of expiry date.
> That would also suggest each dnsmasq instance maintains a "dirty"
> state flag until its database is completely in sync with others.
> What needs to be done, I guess, is that the "dirty" dnsmasq instance
> that recovers connection from his other peers must immediately switch
> to non-authoritative mode and return to passive mode, handing over
> (or forwarding) its [live] DNS requests to the "master" instance. No
> DHCP requests should be answered.
> If the network connectivity is restored before the failed dnsmasq
> instance runs again then the latter switches to "dirty" state and non
> authoritative mode, syncing its database with his other peers.
> This implies that a non master dnsmasq should still be able to
> receive DNS requests. There's a choice here. Either reply directly or
> forward them to the new dnsmasq master. Could be a mix of both:
> directly answer requests, which the slave knows aren't yet replicated
> with the master.
> The complete handshake protocol would require that a dnsmasq instance
> notifies the requesting peer that the sync is complete so that it can
> switch to "non-dirty and passive" state.
> I haven't thought thoroughly, it's just a rough idea for the moment.
OK, here's my back-of-envelope suggestion, with minimal reference to yours.
Dnsmasq instances can be configured as either primary or secondary.
Work pretty much as usual except that we accept connections from
secondaries. When a secondary connects, it sends its current idea of
the lease database to the primary. The primary merges that with its own
lease database and sends the result back to the secondary. It then
serves DHCP requests as normal and sends incremental changes to the
lease database to any connected secondary.
At start up, load the lease database from local disk as usual, then
attempt to connect to our configured primary. If this succeeds, do the
lease database swap described above then enter secondary-passive mode
where DNS queries are answered but not DHCP requests. If the primary
connection cannot be established or fails, enter secondary-active mode
where DHCP requests are answered. Try to contact the primary a regular
intervals. When the link to the primary comes back, do the
lease-database exchange, and then go back to secondary-passive mode.
The secondary-primary connections will be over TCP, or possibly SCTP.
Configuration on a primary looks like
--failover-listen= <port no>
Configuration on a secondary looks like
--failover-master=<IP of primary>,<port on primary>
Need to wonder about security, since connections to the primary can mess
This only works with one primary and one secondary: if there are
multiple secondaries they'll all become active when the primary dies,
which is wrong.
More information about the Dnsmasq-discuss