Implementing a Tradfri gateway in RIOT-OS

April 28, 2019

This is a different twist on an old Tradfri hack. Instead of talking to a Tradfri gateway via its community-documented CoAP API, what I wanted to do was implement a Tradfri gateway's CoAP API. (Actually, as you'll see, the idea in some sense is to delete the gateway altogether.) As we know, the Tradfri gateway talks CoAP over DTLS. Riot talks CoAP and DTLS. What could go wrong? Well, a lot could go wrong. That's what makes it interesting. That plus this bit about wanting to evaluate how close we're getting to fundamentally deleting or at least redefining one of the roles that an IoT gateway traditionally has served. More on that below.

The setup consists of a Raspberry Pi connected by uart to a kw41z-mini running Riot's gnrc-border-router example. The uart connection hosts a SLIP link between the Riot node and the pi so that the Riot node gets a world-routable IPv6 address and is routed to my home network through the pi's ethernet connection. The Riot node is then the router for a network of 802.15.4 wireless Riot nodes, each with its own world-routable IPv6 address. The function of the pi is really only to give the Riot node an ethernet connection. An obvious substitution would be an esp32, but as long as I'm doing this for academic purposes, a pi running linux is more convenient. More convenient still would of course be a Riot board which itself has ethernet or wifi, but there are approximately two problems with making such a board in proper open fashion which I'll rant about some other time.

Anyway, here the trouble has already begun, because the Tradfri app doesn't seem to support IPv6. Further, the way it finds a Tradfri gateway is via link-local mDNS. Link-local meaning the multicast won't get past the pi to the network of Riot nodes without adding more unwanted intelligence to the pi. Finally, the app is only meant to connect to one gateway at a time, which here means only one Riot node at a time.

IoT without the gateway

It's that last part that's really unfortunate, or you could say it would become the limitation I set out to find. See, the idea is that we want all of our IoT devices to be IP-reachable by the devices from which we control them and with as little as possible happening in between. That doesn't mean directly that it's bad for a phone app to connect to only one gateway at a time. What it means is that we don't want any "special sauce" running on the gateway. The border router (gateway) should only do network stack things for the Riot nodes in the same way that a wifi router only does network stack things for laptops. It seems that it will take no less than this to get to a point where we have one type of border router that works with all 802.15.4 devices just as we now have one type of wifi router that works with all 802.11 devices. This ultimately is one of the major goals of moving IoT networks off of commercial protocols like Zigbee and Z-Wave, and onto IETF standard protocols like IPv6, UDP, 6lowpan, CoAP, RPL, and so on.

To that end, the link-local mDNS queries that Tradfri uses aren't the worst thing to encounter. We can rebroadcast that to the 6lowpan network and that's still fairly standard networking stuff. We can also deal with translating between IPv4 and IPv6 obviously. And in fact I did succeed in getting the Tradfri app to talk directly to a Riot node. But only one Riot node at a time, because the app expects to talk to one gateway, not directly to each light and dimmer switch. So the only way to have it work with multiple Riot node lights and switches would be to set up the border router Riot node (the one connected to the pi) to do the same thing the Tradfri gateway does which is manage the network of end devices using special sauce.

Actually, what I was hoping to find was that the Tradfri app was already talking to devices directly and that the Tradfri gateway indeed did only network stack things. After all, the Tradfri gateway can easily have enough resources to do heavier protocols than CoAP, so I thought why would they use CoAP unless they're talking directly to the end devices. But it isn't so. Their api requires that the gateway implement a CoAP endpoint that returns a complete list of its connected devices, and although you can probably multicast that out to the 6lowpan network, there seems an unavoidable need to have ingelligence in the border router to cache and assemble the responses from the 6lowpan network because the Tradfri app expects to receive a single reply containing a list of all of the devices in (as far as I can tell, although I haven't tried otherwise) a single CoAP response.

So then if that's the case, is it even useful to have the server side of Tradfri's gateway implementation in a Riot node? I dunno. Some things just beg to be done, and sometimes you have to do a thing just to see if you can. There's also the fact that I wasn't sure what limitations I would find until I found them, and finding them was actually the point anyway. Certainly it is always useful to test different network stacks against each other in any case. And sure enough, I found bugs to fix. Incidentally, I also found bugs in the Tradfri Android app (including one which makes it trivially easy for a bad player to crash any Tradfri app present on the local wifi) which I can't fix, because the Tradfri app is closed-source. I know, big surprise there, right?

nanocoap and tinydtls in Riot

Riot has libraries for CoAP and DTLS and this was a good opportunity to test their interoperability with a stack they probably haven't been used with yet. I found that nanocoap and tinydtls are both usable with Tradfri, which is really awesome. Essentially all I had to do was copy Riot's dtls-echo example and drop in a few lines from the nanocoap_server example and then populate the CoAP resources according to the community-documented Tradfri CoAP endpoints. That's ignoring bugs that had to be fixed first, but I'll get to those later.

loose ends on the pi

On the pi, I configured avahi-daemon to publish an mDNS service for the Tradfri app to find, and I used socat to route the DTLS traffic from the pi's IPv4 address on my wifi network to the IPv6 address of the Riot node attached to the pi's uart.

cat > /etc/avahi/services/coap.service <<EOF && systemctl daemon-reload && service avahi-daemon restart
<?xml version="1.0" standalone='no'?><!--*-nxml-*-->
<!DOCTYPE service-group SYSTEM "avahi-service.dtd">

screen -dm socat UDP6-RECVFROM:5684,fork UDP6-SENDTO:[2001:470:4bb0:ffff:2167:632b:6241:6d76]:20220,sourceport=20221,reuseaddr

After a lot of fiddling, a moderate amount of tracking down bugs in Riot, and far too much searching for bugs in Riot before determining that they were in fact problems with my not-yet-working Tradfri implementation, the setup actually works.

finding the gateway

The Tradfri app finds gateways via an mDNS query for _coap._udp.local. It looks for responses that have a service name with a particular pattern. I saw queries going out on IPv6 but it seems to ignore any AAAA records in the responses, and it won't connect to a gateway that has no A record. The app does let you input an IPv4 address manually, but later on it in the setup process it will refuse to reconnect to the gateway if it fails to resolve it with mDNS.

It may accept a few patterns for the mDNS service name, but the one I stuck with is gw-aabbccddeeff, which is gw- followed by the mac address of the gateway. As for the hostname, in the end I found that it doesn't care what it is, although real gateways use (at least) TRADFRI-Gateway-aabbccddeeff.local. It also doesn't care whether you use the real mac address of the device. What's important is that the mac address in the mDNS service name matches the one you input in the app during setup as well as the one returned by one of the CoAP queries that the gateway must implement.

DTLS handshake

Upon finding a gateway to connect to (either via mDNS or manual input), the Tradfri app asks the user for a serial number (mac address) (I used 111111111111) and then a PSK for the initial DTLS connection. After you input those, it initiates a DTLS handshake with the gateway using an ID of Client_identity and the PSK you input, which in my case was 2222222222222222. Once that succeeds, the first CoAP query it makes is to provide a new ID and have the gateway generate a new PSK.

initial CoAP requests

As soon as the new PSK is exchanged, the app initiates a new DTLS handshake with the new ID/PSK pair and never uses the original pair again. Once the new handshake is done, the app asks the gateway for some configuration/status details. I don't know what all of the queries are for, but one that needs to be handled correctly is GET /15011/15012 which needs to return the serial number (mac address) of the gateway.

The authoritative source of information about devices and their state is the gateway, so the app queries it each time it reconnects as well as periodically. There are several more queries than shown here, for things like groups and presets.

ok blink the damn led already

The end result of all of this is that when you turn on a light in the Tradfri app on your phone, it sends a CoAP PUT request encrypted with DTLS using a unique PSK to the IPv4 address it thinks is a Tradfri gateway, then socat on the pi receives the packet at that address and forwards it to the Riot node's IPv6 address over the uart SLIP link, then Riot uses tinydtls to decrypt the DTLS payload and passes it to nanocoap which runs gpio_set() which is Riot's function for setting a gpio pin.

You can probably guess that what happened next was I saw an led turn on. Well actually what happened next was I didn't see an led turn on and then I remembered that the led on a kw41z-mini is active-low so I changed the code from gpio_set() to gpio_clear() and recompiled and reflashed and tried it again... and then an led turned on. Sometimes I have wondered whether dumb mistakes like that ever stop being a thing. I am convinced that they do not.

the rest of the story

Well, it wouldn't be interesting if everything had worked out of the box. Above was everything that went right. Below is a selection of that which did not. If you're wanting to use CoAP or DTLS with Riot, or just curious about the state of using Riot to talk to other IPv6 devices in general, you could find this interesting, as I ran in to a few real-world snags which were not expected. It's also not a bad look into the daily life of embedded software development.

wrong IPv6 source addresses

The first problem I encountered after getting mDNS resolution working and getting the Tradfri app to send a DTLS Client Hello packet to the Riot node was that Riot's DTLS Hello Verify Request reply packet wasn't reaching the app. I saw it coming over the SLIP link, but I didn't see it going out the pi's ethernet interface. This was because the source address of the packet was set to a link-local address while the destination address was global. That will not work.

I'm not certain exactly which layer should be responsible for setting the correct source address in this case, and for now I did not stop to find out. I patched gnrc_sock_udp.c to set the source address of a socket equal to the destination address of packets received on the socket. That resulted in the replies from tinydtls having the right source address, which allowed the DTLS handshake to progress further.

wrong IPv6 source addresses, again

Somewhere along the way I found myself messing with ntpdate so that I could reference timestamps between different nodes with millisecond resolution to aid debugging. Here I discovered that I wasn't through dealing with incorrect source addresses yet. I could run ntpdate on the link-local address of the pi, but it couldn't reach my laptop one more hop away. As it turns out, I couldn't even ping a global address from the Riot border router node. Wireshark shows it's because Riot is sending out packets with a global destination address yet a link-local source address again. In this case the pi is kind enough to send back ICMP Destination unreachable: Beyond scope of source address. That's handy.

This time I ended up in Riot's IP layer, specifically the part where gnrc_netif selects a source IP address for an outgoing packet in the case where an upper layer didn't specify one explicitly. It turns out that currently gnrc_netif only selects a source IP address from among the addresses assigned to the interface on which the packet is being sent. This can result in a situation where you're trying to send a packet to a global address but the stack sends the packet out with a link-local (fe80::) source address. This is a problem.

I've occasionally seen wrong source addresses coming out of Riot since a long time ago but I hadn't looked in to why until now. I see now that this problem arises only when Riot has more than one network interface. But that is exactly the configuration you have when you run Riot's gnrc-border-router example out-of-the-box. You have one wireless interface and one SLIP interface. And the global address that the node autoconfigures for itself from the prefix routed to it by the pi gets assigned (as it should) to the wireless interface. This means that the out-of-the-box experience of running Riot's gnrc-border-router is that the border router itself is unable to send proper packets out to the internet or anywhere else past the link-local SLIP link to the pi, meanwhile the Riot nodes downstream of the border router node are unaffected. I found that very unexpected and it took me a while to establish that I wasn't just doing something wrong and to commence looking for a bug in Riot. Eventually from reading the code comments I learned it was less of a bug and more of a known deficiency.

Fixing this turned into a bit of work. It looked like a small change but then that cascaded into refactoring several functions and I suppose that's why no one has gotten around to doing it yet. I didn't want to spend a lot of time here but I hacked together something usable enough for now, which can be found in this commit.

DTLS handshake hanging before completion

Now that I had synchronized timestamps, I returned to the DTLS handshake where at this point the Hello Verify Request packets were getting past the pi to the Tradfri app and the handshake was getting much further, but it still wasn't completing. The app would retry a few times before giving up, and the sequence each time looked like this in wireshark:

I started out not knowing anything about how DTLS is supposed to work, so I took this opportunity to learn. It seemed that Riot wasn't properly handling either the Client Key Exchange or Change Cipher Spec messages because it was failing to decrypt the first message sent in epoch 1 which is Handshake Finished. This was triggering it to send the Decrypt Error. It would actually turn out that the problem wasn't in the DTLS code at all but was somewhere else entirely.

Figuring out why this was happening was tricky. There's enough debug output in tinydtls to keep track of what's going on, except that enabling the output causes delays that prevent basically the whole thing from working. That's a common problem with embedded systems and there are a few ways to deal with it. One thing you can do is print your debug output into a ring buffer in ram instead of printing it directly to the uart. That way you can read it out later, when printing to uart won't interfere with whatever moment in time you're trying to debug. Several months ago I wrote something for Riot that helps me do exactly that. It traps most of the output that would normally go out the uart and stores it in ram instead. Then from the shell a dmesg command is used to read and clear the buffer. Only a small change to dtls_debug.h was necessary to intercept the debug output that tinydtls was sending to the uart.

This is a good starting point, but the problem still wasn't clear at first. What's not obvious from this output is that some of the packets are getting dropped at a lower layer and aren't reaching tinydtls at all. Having heavily verbose debug output turned on all the time isn't really practical unfortunately, but I have often and repeatedly encountered the need to know, preferably from the lowest layer possible, each time a packet is physically received or sent. Usually I'm already familiar with how to get that particular output from Riot because usually I either discovered or wrote it while getting the radio driver working for whatever new platform I'm using, however in this case the driver in question is not a radio driver but the ethos driver which is providing the uart SLIP interface to the pi, which I've not used before. I turned on the debug output from ethos.c and that revealed some error outputs that looked clearly related to the problem I was debugging, but that still didn't give me the "packet sent/received" outputs that I wanted. Adding the desired debug output to ethos.c would have worked fine, but I also happened to notice some output in gnrc_netif_ethernet.c for sent/received packets, and when I turned that on I saw even more error outputs that appeared relevant. This produced the "general overview of what's going on" log dump that I needed in order to narrow down what was happening and where:

Now finally it was possible to compare this debug output with the wireshark capture from the pi side of the uart SLIP link and establish with only a little effort that indeed the packets containing the Client Key Exchange and Change Cipher Spec messages are failing to reach tinydtls, and presumably the error outputs indicate the layer at which they're being lost. But why? Well, some rabbit holes you get to go down immediately, and some you have to postpone for another time. This time I decided to spend a few minutes poking around for a shortcut rather than committing to the hours it would take to properly understand the situation.

One of the first things I tried was reducing the baud rate of the uart SLIP link. That did not help the situation, however increasing it did. That turned out to be a rabbit hole too, but one that I thought worth going down. It didn't take too long, and I ended up switching the kw41z's uart clock source from CLOCK_MCGIRCLK to CLOCK_MCGFLLCLK and then I was able to run the uart SLIP link at 921600 baud, up from 115200 baud, which ended up making this particular packet loss go away. The underlying problem still exists, and I'm presuming it's due to incoming packets not being freed from ethos's buffer fast enough, but this change made it stop showing itself during the DTLS handshake with Tradfri and I finally saw a completed DTLS handshake and the first encrypted CoAP query coming from the Tradfri app:

what I was actually doing when I accidentally did this instead

Next I want to implement Tradfri end devices (lights and switches) in Riot nodes that can connect to legit Tradfri devices which I understand use Zigbee Light Link. I thought I would never touch Zigbee, but the IETF standards are just taking a long time to become mainstream and in the mean time the off-the-shelf IoT devices and the open-source IoT devices are existing in separate ecosystems. Recently I've gotten the open-source (GPL) ZBOSS stack working in Riot and I'm ready to get my hands on a pair of actual Tradfri devices so I can sit down and figure out just what kind of rabbit hole I've gotten myself in to this time.



Leave a comment