[Dnsmasq-discuss] RA support in dnsmasq

Sat Dec 1 17:19:39 GMT 2012

On 12/01/2012 10:08 AM, Simon Kelley wrote:
> On 01/12/12 14:54, Gene Czarcinski wrote:
>> On 11/30/2012 05:23 PM, Gene Czarcinski wrote:
>>> On 11/30/2012 04:18 PM, Simon Kelley wrote:
>>>> On 30/11/12 21:03, Gene Czarcinski wrote:
>>>>> On 11/30/2012 12:45 PM, Simon Kelley wrote:
>>>>>> On 30/11/12 17:20, Gene Czarcinski wrote:
>>>>>>> On 11/30/2012 11:32 AM, Simon Kelley wrote:
>>>>>>>> On 30/11/12 15:54, Gene Czarcinski wrote:
>>>>>>>>> On 11/29/2012 04:18 PM, Simon Kelley wrote:
>>>>>>>>>> On 29/11/12 20:31, Gene Czarcinski wrote:
>>>>>>>>>>
>>>>>>>>>>> I spoke too quickly.
>>>>>>>>>>>
>>>>>>>>>>> The cause of the problem is libvirt related but I am not sure
>>>>>>>>>>> what
>>>>>>>>>>> just
>>>>>>>>>>> yet.
>>>>>>>>>>>
>>>>>>>>>>> I was running a libvirt that had a lot of "stuff" on it but
>>>>>>>>>>> seemed to
>>>>>>>>>>> work OK. Then, earlier today I update to a point that appears
>>>>>>>>>>> to be
>>>>>>>>>>> somewhat beyond the leading edge and, although I was not
>>>>>>>>>>> getting any
>>>>>>>>>>> RTR-ADVERT messages, it turned out that there were/are big-time
>>>>>>>>>>> problems
>>>>>>>>>>> running qemu-kvm. So, back off/downgrade to the previous 
>>>>>>>>>>> version.
>>>>>>>>>>> Qemu-kvm now works but the RTR-ADVERT messages are back.
>>>>>>>>>>>
>>>>>>>>>>> This may be a bit time-consuming to debug!
>>>>>>>>>>>
>>>>>>>>>> Are you seeing the new log message in netlink.c?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> The good news is that libvirt is working again (I must have 
>>>>>>>>> done a
>>>>>>>>> git-pull in the middle of an update).  Thus, I am not seeing the
>>>>>>>>> large
>>>>>>>>> numbers of RTR-ADVERT.
>>>>>>>>>
>>>>>>>>> Yes, I am seeing the new log message and I have a question about
>>>>>>>>> that.
>>>>>>>>> Every time a new virtual network interface is started, something
>>>>>>>>> must be
>>>>>>>>> doing some type of broadcast because all of the dnsmasq
>>>>>>>>> instances (the
>>>>>>>>> new one and all the "old" ones) suddenly wake up and issue a
>>>>>>>>> flurry of
>>>>>>>>> RA packets and related syslog messages.  To kick the flurry off,
>>>>>>>>> there
>>>>>>>>> one of the new "unsolicited" syslog messages from each dnsmasq
>>>>>>>>> instance.
>>>>>>>>>
>>>>>>>>> Is this something you would expect?  Is this "normal?" The 
>>>>>>>>> libvirt
>>>>>>>>> folks they are not doing it.
>>>>>>>> I'd expect it. The code you instrumented gets run whenever a "new
>>>>>>>> address" event happens, which is whenever an address is added 
>>>>>>>> to an
>>>>>>>> interface. "Every time a new virtual network interface is
>>>>>>>> started" is a
>>>>>>>> good proxy for that.
>>>>>>>>
>>>>>>>> The dnsmasq code isn't very discriminating, it updates it's 
>>>>>>>> idea of
>>>>>>>> which interfaces hace which addresses, and then does a minute of
>>>>>>>> fast
>>>>>>>> advertisements on all of them. It might be possible to only do
>>>>>>>> the fast
>>>>>>>> advertisements on new interfaces, but implementing that isn't
>>>>>>>> totally
>>>>>>>> trivial.
>>>>>>>>
>>>>>>>>
>>>>>>> Yes, I doubt very much if it would be trivial. However, I do not
>>>>>>> believe that this is the basic problem.
>>>>>>>
>>>>>>> When the problem occurs, one of the networks "suddenly" attempts
>>>>>>> to work
>>>>>>> with the real NIC rather than the virtual one defined in its config
>>>>>>> file.  I slightly changed the IPv4 and IPv6 addresses defined for
>>>>>>> this
>>>>>>> network and the problem went away.  I have also "just" seen the
>>>>>>> problem
>>>>>>> happen on another system which also had that virtual address 
>>>>>>> defined.
>>>>>>>
>>>>>>> BTW, these configurations all use interface= and bind-dynamic 
>>>>>>> rather
>>>>>>> than the "old" bind-interface with listen-address= specified for 
>>>>>>> each
>>>>>>> specified IPv4 and IPv6 address.  I had not noticed the problem
>>>>>>> previously.  Why it occurs at all with just this specific 
>>>>>>> address is
>>>>>>> puzzling.
>>>>>>>
>>>>>>> The configuration in which causes problems is:
>>>>>>> ------------------------------------------
>>>>>>> # dnsmasq conf file created by libvirt
>>>>>>> strict-order
>>>>>>> domain-needed
>>>>>>> domain=net6
>>>>>>> expand-hosts
>>>>>>> local=/net6/
>>>>>>> pid-file=/var/run/libvirt/network/net6.pid
>>>>>>> bind-dynamic
>>>>>>> interface=virbr11
>>>>>>> dhcp-range=192.168.6.128,192.168.6.254
>>>>>>> dhcp-no-override
>>>>>>> dhcp-leasefile=/var/lib/libvirt/dnsmasq/net6.leases
>>>>>>> dhcp-lease-max=127
>>>>>>> dhcp-hostsfile=/var/lib/libvirt/dnsmasq/net6.hostsfile
>>>>>>> addn-hosts=/var/lib/libvirt/dnsmasq/net6.addnhosts
>>>>>>> dhcp-range=fd00:beef:10:6::1,ra-only
>>>>>>> -------------------------------------------------
>>>>>>>
>>>>>>> When I changed all the "6" to "160", the problem, disappeared. And
>>>>>>> there is another network defined almost the same with "8" instead
>>>>>>> of "6"
>>>>>>> and I have had no problems with it.
>>>>>>>
>>>>>>> The real NIC is configured as a DHCP client  for both IPv4 and
>>>>>>> IPv6. It
>>>>>>> is assigned "nailed" addresses of 192.168.17.2/24 and
>>>>>>> fd00:dead:beef:17::2.
>>>>>>>
>>>>>>> And I just discovered why crazy stuff is happening (but I do not 
>>>>>>> know
>>>>>>> what causes it) ... the P33p1 NIC has:
>>>>>>>    inet6 fd00:beef:10:6:3285:a9ff:fe8f:e982/64 scope global dynamic
>>>>>>
>>>>>> Is that the "real NIC"?
>>>>>>
>>>>> Yes, p33p1 is the real NIC.  This is going to be a real PITA to debug
>>>>> because I believe part of the problem is a race condition.
>>>>> NetworkManager has this really long dance it goes through to bring up
>>>>> the IPv6 interface.
>>>>>
>>>>> But, I do not have any proof of that and as I just proved to myself,
>>>>> getting things to repeat are going to be difficult.
>>>>>
>>>>> At this point I am not sure that bind-dynamic was related.  I went
>>>>> through the syslogs I still have and the first occurrence was on  8
>>>>> November.  That is well before bind-dynamic was integrated in.
>>>>>
>>>>> Attached are some limited copies of syslogs that I thought you might
>>>>> find of interest.  It seems like the "strangeness" seem to happen 
>>>>> right
>>>>> after I update libvirt and libvirtd is restarted which then gets
>>>>> dnsmasq
>>>>> started.
>>>>>
>>>>> If I cannot get this figured out and "fixed", I will need to disable
>>>>> use
>>>>> of dnsmasq for RA service and fall back on radvd.
>>>>>
>>>>> Frustrating .. so close and yet so far!
>>>>>
>>>>
>>>> I wonder if the virbr* interfaces are bridged to the "real" NICs,
>>>> such that when a prefix is advertised on the virbr interface, it
>>>> causes the real interface to add an address for that prefix. Because
>>>> dnsmasq is configured to advertise the prefix, that then causes the
>>>> advertisements via the real NIC.
>>>>
>>>> Just a thought.
>>>>
>>> If I had not done the ip addr to get the above, I would still be
>>> scratching my head.
>>>
>>> Anyway, here is ip addr:
>>> -----------------------------------------------
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     inet 127.0.0.1/8 scope host lo
>>>     inet6 ::1/128 scope host
>>>        valid_lft forever preferred_lft forever
>>> 2: p33p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>> state UP qlen 1000
>>>     link/ether 30:85:a9:8f:e9:82 brd ff:ff:ff:ff:ff:ff
>>>     inet 192.168.17.2/24 brd 192.168.17.255 scope global p33p1
>>>     inet6 fd00:dead:beef:17:1::2/128 scope global
>>>        valid_lft forever preferred_lft forever
>>>     inet6 fe80::3285:a9ff:fe8f:e982/64 scope link
>>>        valid_lft forever preferred_lft forever
>>> 10: virbr11: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
>>> noqueue state DOWN
>>>     link/ether 52:54:00:0b:84:5c brd ff:ff:ff:ff:ff:ff
>>>     inet 192.168.6.1/24 brd 192.168.6.255 scope global virbr11
>>>     inet6 fd00:beef:10:6::1/64 scope global
>>>        valid_lft forever preferred_lft forever
>>>     inet6 fe80::5054:ff:fe0b:845c/64 scope link
>>>        valid_lft forever preferred_lft forever
>>> 11: virbr11-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast
>>> master virbr11 state DOWN qlen 500
>>>     link/ether 52:54:00:0b:84:5c brd ff:ff:ff:ff:ff:ff
>>> ------------------------------------
>>>
>>> And here is brctl show:
>>> -----------------------------------------
>>> bridge name    bridge id        STP enabled    interfaces
>>> virbr11        8000.5254000b845c    yes        virbr11-nic
>>> ---------------------------------------
>>>
>>> I think I will give it a rest until tomorrow!
>>>
>> I ran yet another test and this time it happened.  I am getting a little
>> more info such as:
>> Dec  1 05:52:36 falcon dnsmasq-dhcp[23358]: ra_start_unsolicted(),
>> len=60, type=14, flags=0, pid=5b5e
>>
>> where most look like:
>> Dec  1 05:52:37 falcon dnsmasq-dhcp[23394]: ra_start_unsolicted(),
>> len=64, type=14, flags=0, pid=0
>>
>>
>> 1. Is there other information I can/should print out?
>
> No, it's clear what's happening, I think.
>>
>> 2. Is there anything I can do to identify why dnsmasq is "suddenly"
>> using interface p33p1 when it was specifically configured to use
>> interface virbr11?  I thought that this bind-interface and bind-dynamic
>> were support to lock that dnsmasq instance into only servicing the
>> specified interface.
>>
>
> I think that's known: it's because p33pl gets an address on 
> fd00:beef:10:6:: network, which dnsmasq is configured to advertise.
> The difficult question is why it's getting that address.
>
> (The fd00:beef:10:6:: addresss on p33pl is not shown in your latest 
> dump, but is is shown in earlier ones. The existance of that address 
> on p33pl should be a big red flag when you're trying to diagnose this.)
Yes, it is not in the syslog and I did not capture it with the Dec 1 
occurrence.  However, it was the same as I specified above  the 
fd00:beef:10:6 with a SLAAC address.  I wonder what would have happened 
if that dnsmasq with fd00:beef:10:6 was doing DHCPv6?

 From above: inet6 fd00:beef:10:6:3285:a9ff:fe8f:e982/64 scope global 
dynamic

You will notice the resemblance to:
inet6 fe80::3285:a9ff:fe8f:e982/64 scope link

So, why is a fd00:beef:10:6:: RA packet being sent out that interface.
>
>> Previously, libvirt's parameters to dnsmasq were bind-interface,
>> listen-address=.  This is now replaced with bind-dynamic, interface= to
>> fix a serious problem.  So, my question is whether the "correct"
>> configuration should be bind-interface, interface= ?
>>
>> The object is that no matter haw it is specified, the goal is that
>> dnsmasq ONLY service the networks defined on a specific interface and to
>> ignore anything from other interfaces whether that have the same network
>> defined or not.
>
> The RA code uses address matching to decide where to advertise. This 
> could arguably be augmented by filtering on --interface, but it isn't 
> at the moment.
The question is: does it need to check?  This is not something that 
occurs every time but I seem to have the "midas touch" to make it happen 
often if not always.
>
>
>>
>> Question, what do you think would happen if there were to different
>> processes on two different hardware platforms which share a common
>> hardware-network fabric were to run stateful RA on one system and
>> stateless RA on the other system?  Could that be happening here and, if
>> so, why?
>
> I don't understand this, what is stateful RA? Do you mean stateful DHCP?
Sorry.  State-full: DHCPv6 address, RA for route & prefix; State-less: 
SLAAC address, RA for route & prefix.

The reason I am asking is that there is a dnsmasq providing state-full 
RA for DHCPv6 on the "official" subnet on the p33p1 interface ( 
fd00:dead:beef:17::) and then there is this dnsmasq which is supposedly 
using virtual interface virbr11 for fd00:beef:10.6::

Going back to Laine Stump, CVE-2012-3411, and why the change was made to 
use bind-dynamic.

See https://bugzilla.redhat.com/show_bug.cgi?id=874702 but basically, 
one of libvirt's dnsmasq instances was responding to queries from a 
network with the same IP values but on another interface.  I believe the 
hope was to eliminating the listen-address= parameters do that.  Then, 
using bind-dynamic instead of bind-interface was intended to allow some 
(re)configuration to take place after dnsmasq had started.  IMO is that 
allowing dynamic configuration is very much a secondary consideration to 
working correctly.

If only I can predictably have a configuration which worked and one 
which did not, this might be easy but just because it worked once does 
not mean it will again.

Thus, I need some additional info inside dnsmasq.