[Dnsmasq-discuss] DHCPv6 doesn't work on Linux interfaces enslaved to a VRF

Mon Oct 9 21:17:23 UTC 2023

On 09/10/2023 11:40, Luci Stanescu wrote:
> Hi Simon,
> 
> Thank you for your response and your openness to this issue! My thoughts 
> below, inline (and apologies for the rather long email).
> 
>> On 9 Oct 2023, at 01:05, Simon Kelley <simon at thekelleys.org.uk> wrote:
>> 1) Even if this is a kernel bug, kernel bugs fixes take a long time to 
>> spread, so working around them in dnsmasq is a good thing to do, as 
>> long as it doesn't leave us with long-term technical debt. This 
>> wouldn't be the first time a kernel bug has been worked around.
> 
> I agree, that's why I've opened this discussion here.
> 
>> 2) https://docs.kernel.org/networking/vrf.html says:
>>
>> Applications that are to work within a VRF need to bind their socket 
>> to the VRF device:
>> setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
>> or to specify the output device using cmsg and IP_PKTINFO.
>>
>> Which kind of implies that this might not be a kernel bug, rather 
>> we're just not doing what's required to work with VRF.
> 
> I'm not convinced this isn't a kernel bug. The VRF implementation has 
> been developed in stages over several years. It is indeed the case that 
> initially the sockets had to be bound to the VRF device or to specify it 
> via IP_PKTINFO/IPV6_PKTINFO. But then came support for 
> net.ipv4.*_l3mdev_accept sysctls (which confusingly also affect AF_INET6 
> sockets) as well as a series of patches in 2018 that allowed specifying 
> a VRF slave device for several operations. Before that series of 
> patches, it certainly made sense for sin6_scope_id in msg_name for 
> recvmsg() to be the VRF device (it had to be) – but I'm not convinced it 
> shouldn't have been changed after the rules for connect() and sendmsg() 
> were relaxed. The thing is, as it stands, the kernel code works well for 
> everything except IPv6 link-local communication, so it wouldn't be 
> surprising for this to be a simple oversight.
> 
> I had tracked this down while trying to figure out what's going on here 
> and detailed a bit in the kernel bug report, which you can find here:
> 
> https://lore.kernel.org/netdev/06798029-660D-454E-8628-3A9B9E1AF6F8@safebits.tech/T/#u <https://lore.kernel.org/netdev/06798029-660D-454E-8628-3A9B9E1AF6F8@safebits.tech/T/#u>
> 
>> Setting the device to send to using IP_PKTINFO rather than relying on 
>> the flowinfo field in the destination address would be quite possible, 
>> and the above implies that it will work.
> 
> Apologies for being pernickety, but it's the scope_id field which is 
> relevant here, rather than flowinfo. And since we're talking AF_INET6, 
> shouldn't it be IPv6_PKTINFO?

My bad. I meant scope_id. It was late :(

> 
> I *think* it should work. I have been unable to find a situation where 
> the scope received in the IPV6_PKTINFO cmsg to recvmsg() cannot be used 
> to reliably send a response out the same interface (which I believe is 
> exactly what DHCPv6 code will always want to do), but my word is 
> certainly no guarantee. More about this towards the end of the email.
> 
> However, it'll only work as long as you do NOT specify a scope in the 
> destination of the sendmsg() call or that scope specified is exactly the 
> same as in the IPV6_PKTINFO ancillary message. Specifically, you cannot 
> specify the VRF master device index. I've adapted my earlier scripts to 
> test this and I've pasted them at the end of this email.
> 
>> This brings us on to
>>
>> 3) IPv4. Does DHCPv4 work with VRF devices? It would be nice to test, 
>> and fix any similar problems in the same patch. Interestingly, the 
>> DHCPv4 code already sets the outgoing device via IP_PKTINFO (there 
>> being no flowinfo field in an IPv4 sockaddr) so it stands a chance of 
>> just working.
> 
> DHCPv4 works just fine. My dnsmasq configuration uses 'interface' to 
> specify the VRF slave interface (which in my case is a bridge) and 
> DHCPv4 messages are sent out correctly.
> 
>> Copying the inferface index into the flowinfo of the destination or 
>> setting IP_PKTINFO are both easy patches to make and try. The 
>> difficult bit is being sure that they won't break existing installations.
> 
> My tests seem to imply that leaving the received scope_id field (which 
> is the VRF master device index) unchanged and setting IPV6_PKTINFO won't 
> work. Three options seem to work:
> 1. Overwrite scope_id of source address from recvmsg() with the 
> interface index from the received IPV6_PKTINFO.
>          2. When performing the sendmsg(), set the scope_id of the 
> destination to 0 and add IPV6_PKTINFO with the the empty address (since 
> the received IPV6_PKTINFO specifies the multicast address and that won't 
> do as a source) and the interface index from the received IPV6_PKTINFO.
> 3. If the socket is bound to an L3 interface (not the VRF master 
> device), just set the scope_id in the destination to 0 and IPV6_PKTINFO 
> is not required. I'm not sure this'll work for dnsmasq, but I thought of 
> including it for the sake of completeness.
> 

This is good information.

I've implemented option 1 here and it's currently running and dogfood on 
my home network. There are no VRF interfaces there: this is a test 
mainly to check that nothing breaks. So far, so good.

The patch I used is attached. It would be interesting to see if it 
solves the problem for you.

> I've adapted my scripts slightly to allow easier testing of the 
> behaviour. The receiver socket now binds to the VRF device instead (you 
> can even not bind to any device and just set the 
> net.ipv4.udp_l3mdev_accept sysctl to 1). The interface configuration is 
> as before:
> 
> ip link add myvrf type vrf table 42
> ip link set myvrf up
> ip link add veth1 type veth peer name veth2
> ip link set veth1 master myvrf up
> ip link set veth2 up
> 
> The receiver now echoes back any data that it gets:
> 
> import socket
> import struct
> 
> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
> s.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_RECVPKTINFO, 1)
> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b'myvrf')
> s.bind(('', 2000, 0, 0))
> mreq = struct.pack('@16sI', socket.inet_pton(socket.AF_INET6, 
> 'ff02::1:2'), socket.if_nametoindex('veth1'))
> s.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_JOIN_GROUP, mreq)
> 
> while True:
>      data, cmsg_list, flags, source = s.recvmsg(4096, 4096)
>      dest_scope = None
>      for level, type, cmsg_data in cmsg_list:
>          if level == socket.IPPROTO_IPV6 and type == socket.IPV6_PKTINFO:
>              dest_address, dest_scope = struct.unpack('@16sI', cmsg_data)
>              dest_address = socket.inet_ntop(socket.AF_INET6, dest_address)
>              print("PKTINFO destination {}%{}".format(dest_address, 
> socket.if_indextoname(dest_scope)))
>      source_address, source_port, source_flow, source_scope = source
>      print("receved message from {}%{}".format(source_address, 
> socket.if_indextoname(source_scope)))
>      ipinfo_cmsg = struct.pack('@16sI', 
> socket.inet_pton(socket.AF_INET6, '::'), dest_scope or 0)
>      s.sendmsg([data],
>                [(socket.IPPROTO_IPV6, socket.IPV6_PKTINFO, ipinfo_cmsg)],
>                0,
>                (source_address, source_port, source_flow, source_scope))
> 
> The sender now waits for a response:
> 
> import socket
> 
> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
> dest = ('ff02::1:2', 2000, 0, socket.if_nametoindex('veth2'))
> s.sendto(b'foo', dest)
> data, source = s.recvfrom(4096)
> source_address, source_port, source_flow, source_scope = source
> print("Received response from '{}%{}': {}".format(source_address, 
> socket.if_indextoname(source_scope), data))
> 
> If you run this example, the sendmsg() call from the receiver will fail 
> with EINVAL. That's because the IPV6_PKTINFO ancillary message specifies 
> the correct interfaces as a scope (the veth1 interface index), but 
> sendmsg() specifies the VRF master device as a scope for the destination 
> (as received from recvmsg()). You can get it to work by either:
> 1. Specify 'dest_scope' in the last argument to sendmsg(), which 
> corresponds to the sin6_scope_id field of the msg_name field. I guess 
> you can even specify 'dest_scope or source_scope' to have a fallback if 
> (for whatever reason) recvmsg() doesn't get a IPV6_PKTINFO ancillary 
> message. You can also not include the IPv6_PKTINFO ancillary message in 
> the sendmsg() call here, since it's now redundant.
> 2. Specify '0' as the last argument to sendmsg() instead of 
> source_scope, but now the IPV6_PKTINFO ancillary message in sendmsg() 
> becomes mandatory.
> 3. Bind the socket to 'veth1' instead of 'myvrf', don't include a 
> IPV6_PKTINFO ancillary message in sendmsg() and specify '0' as the last 
> argument to sendmsg() instead of source_scope.
> 
> These three changes correspond to my earlier three options, respectively.
> 
>> The difficult bit is being sure that they won't break existing 
>> installations.
> 
> Indeed. I think this boils down to:
> 1. Knowing that DHCPv6 should always send a message on the same L3 
> interface on which it received a request. I believe this holds true, but 
> you'll definitely know more about the spec and real-world scenarios than 
> I do.
> 2. Finding authoritative information that the interface index from 
> IPV6_PKTINFO is always set to the L3 interface on which a datagram was 
> received. The kernel mailing list might be start? I'd certainly be happy 
> to help think about and test various scenarios.
> 

I'm happy that 1. is true. Please enquire about 2.

Cheers,

Simon.

> Cheers,
> 
> Luci
> 
> 
> -- 
> Luci Stanescu
> Information Security Consultant
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vrf.patch
Type: text/x-patch
Size: 1267 bytes
Desc: not available
URL: <http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/attachments/20231009/024a0930/attachment.bin>