[Dnsmasq-discuss] A (possibly bad) idea: failover in dnsmasq

Fri May 25 16:26:25 BST 2012

On 25/05/12 12:17, Jan-Piet Mens wrote:
> Starting just a few days before the day the machine running dnsmasq in
> my SOHO died, I was giving some thought to how I'd go about ensuring
> a backup copy of dnsmasq could take over if my only running instance
> died. Needless to say, the death of the machine left my small network in
> shambles, because I couldn't connect to anything to fix things without
> first configuring temporary static addresses; sans DHCP, stuff fails... :)
> 
> I'm anything but a DHCP specialist, but I want to bounce this idea off
> you anyway, even if you mind. ;-)
> 
> The trick, as I understand it, in setting up more than a single dnsmasq
> instance in a network, is to ensure that it uses --dhcp-script to STORE
> the leases and --leasefile-ro to force the script to produce a list of
> current leases ("init") from which a launching dnsmasq obtains its data
> before going on its usual business.
> 
> If we were able to ensure the "data store" (i.e. lease database) were
> available on two machines A and B (and up to date on both of course) the
> solution would be easy, except for the fact that dnsmasq does not LOOKUP
> (i.e. query) for a lease in the data store except upon startup.
> 
> I'm thinking along the lines of having a function lease_query() in
> lease.c which dnsmasq invokes to determine whether a lease exists before
> issuing a new lease for a device.
> 
> Being very lightweight, dnsmasq must not be bloated by having a huge
> MySQL or other database attached to it. I've been searching the
> Internets and finally landed upon Tokyo Tryant [1] which I've discussed a
> long time ago [2].
> 
> What I'm basically getting at is providing dnsmasq with an optional very
> lightweight replicating server which it (optionally) uses to ensure the
> lease database can be propagated to a second (or third or fourth)
> dnsmasq instance. The reason I'm suggesting Tryant is that, it too, is
> lightweight and offers multi-master setups.
> 
>      +------------+                       +-------------+
>      |   dnsmasq  |                       |  dnsmasq    |
>      |     A      |                       |     B       |
>      +-----+------+                       +-------------+
>            |                                     +
>            |                                     |
>            |                                     |
>      +-----v-------+                      +------v-------+
>      |   Tryant    |                      |   Tryant     |
>      |     A       |+--------------------->     B        |
>      |             |<---------------------+              |
>      +-------------+                      +--------------+
> 
>      +-------------+                      +---------------+
>      |   leases    |                      |    leases     |
>      |-------------|                      |---------------|
>      +-------------+                      +---------------+
> 
> In other words, dnsmasq (A) reads/writes leases from Tryant (A) and
> dnsmasq (B) read/writes from/to Tryant (B). If Tryant (A) and (B) can
> speak to eachother, the database is replicated, irrespective of which
> dnsmasq (A) or (B) has last written a lease.
> 
> I'll stop here, before boring you even more, but I'll gladly send you
> snippets of code and a short "howto" set up a multi-master system. Most
> important IMO is to keep things very light-weight in the spirit of
> dnsmasq.
> 
> Best regards,
> 
>         -JP

It's necessary to decide what you're trying to achieve for failover. If
you want a system which just transparently keeps working when a DHCP
server fails, then the ISC server is the best bet, without a doubt.
Let's assume you don't want that, but don't want to be dead in the water
when a machine running dnsmasq fails.

The first thing to note is that DHCP sort of keeps working anyway. Even
if the server goes down and the lease database is lost, the clients will
continue to work until the leases  expire. What's more, if they get
towards the end of the lease period without contacting the DHCP server
that gave them a lease, they'll broadcast and accept a renewal from any
server. This works now. If you set the lease time to 2 days, and then
take down the dnsmasq server, you have a day to bring up dnsmasq on
another machine before any client loses network connectivity, and once
that second server is up, its lease database will gradually populate
with the all the clients that were in the old database,
_at_the_same_IP_addresses_.

The problem with this, is that until a client talks to the new server
and appears in the new lease database, it effectively disappears from
the DNS. That's what will break things and why preserving  a copy of the
lease database is useful.

The above applies to active-passive. Active-active, as you suggest, is
more complex, because either server can talk to a client, so things like
lease times have to be co-ordinated. This is what the ISC failover
protocol does, I believe.

For dnsmasq, I can see that active-passive is easy to do. Take your
diagram above, and delete dnsmasq B. dnsmasq A keeps the tryant instance
A up-to-date with the lease database and that gets replicated to tyrant
B. If dnsmasq A fails, then dnsmasq B is started, intialises its lease
database from the tyrant B and is there for clients as they fail to talk
to dnsmasq A and start to broadcast. More important dnsmasq B can
provide a DNS service with all the clients in it  straight away.

This active-passive scheme shouldn't need any dnsmasq changes, and
arranging to monitor server instances and start a new one when an
existing one goes down is a solved problem: it's exactly what heartbeat
does.

Building a heartbeat harness to run dnsmasq active-passive and
replicated tyrant (or another database) sure looks like a useful thing
to try, IMHO.

Simon.