Google Summer of Code 2016: External netifd Device Handlers – Final Milestone

TL;DR

FINISHING THE PROJECT

The past weeks have felt very satisfying in terms of coding. With the main structure in place, all that I had to do was add features, test, iron out the wrinkles and prepare my code for submission to the maintainer. Adding features was fun because it felt like taking big steps forward every day with immediate feedback.
There were a few stubborn bugs, though. Despite taking the better part of a week to find, they were — luckily — easy to fix. They usually required little more than changing the order in which code executed which meant swapping a few lines of code.
Rebasing the netifd source code on to the current version went very fast and resulted in three separate commits; two small ones to prepare the existing code base for my additions and one pretty big one adding my ~2000 line .c-file.

As this is the blog post wrapping up my Google Summer of Code, I am going to summarize the entire project. This probably means that I will repeat some of the points I have already explained in my previous posts.

GOALS MET

I set out to implement a way to open up OpenWRT/LEDE’s network interface daemon (netifd) for new device classes.
Until now, netifd was only able to handle device classes for which a handler was hard-coded into the daemon. I added a way to generate the necessary device handler stub from JSON descriptions and interface them over ubus with another process that handles all device-related actions such as creation and configuration. Netifd is more open to experimentation and does not require maintenance from someone introducing a new device class.
Configuration of these devices still happens in the familiar /etc/config/network file.

Along with the ubus interface in netifd, I wrote ovsd, an external device handler for Open vSwitch. Together with ovsd, netifd can create and configure Open vSwitch bridges without any Open vSwitch-specific code added to it. It simply parses the configuration in /etc/config/network, sees that there is an interface on a device with type ‘Open vSwitch’, uses this name to look up the corresponding device handler from a list and calls the ‘create’ function that is part of the device handler interface.
The Open vSwitch device handler stub then relays the command along with the configuration blob to ovsd via ubus.

Ovsd processes the command asynchronously and answers using the ubus subscription mechanism. Netifd can then bring up the device as usual and attach protocol handlers and interfaces to it.

My work on netifd is not likely to get included upstream before GSoC ends which is why I have created a patch with my changes and put it in a repository here. The repo includes the LEDE source code as a git submodule and is easy to build.
The ovsd source code is hosted at GitHub along with instructions on how to install it.

DETAILS OF THE SUBMITTED WORK

This example file demonstrates all the available configuration options for Open vSwitch devices in a possible scenario featuring an Open vSwitch bridge with interface eth0 and a fake bridge on top of it:

# /etc/config/network
config interface ‘lan’
    option ifname ‘eth0’
    option type ‘Open vSwitch’
    option proto ‘static’
    option ipaddr ‘1.2.3.4’
    option gateway ‘1.2.3.1’
    option netmask ‘255.255.255.0’
    option ip6assign ’60’

    option ofcontrollers ‘tcp:1.2.3.2:9999’
    option controller_fail_mode ‘standalone’
    option ssl_cert ‘/root/cert.pem’
    option ssl_private_key ‘/root/key.pem’
    option ssl_ca_cert ‘/root/cacert_bootstrap.pem’
    option ssl_bootstrap ‘true’

config interface ‘guest’
    option type ‘Open vSwitch’
    option proto ‘static’
    option ipaddr ‘1.2.3.5’
    option netmask ‘255.255.255.0’

    option parent ‘ovs-lan’
    option vlan ‘2’
    option empty ‘true’

The lines set apart are Open vSwitch-specific options. They configure OpenFlow settings, control channel encryption and the nature of the device itself, which can either be a ‘real’ or a ‘fake’ bridge on a parent bridge and VLAN tagging enabled.

This is the JSON description from which the device handler stub is generated:

# /lib/netifd/ubusdev-config/ovsd.json
{
    “name” : “Open vSwitch”,
    “ubus_name” : “ovs”,
    “bridge” : “1”,
    “br-prefix” : “ovs”,
    “config” : [
        [“name”, 3],
        [“ifname”, 1],
        [“empty”, 7],
        [“parent”, 3],
        [“vlan”, 5],
        [“ofcontrollers”, 1],
        [“controller_fail_mode”, 3],
        [“ssl_private_key”, 3],
        [“ssl_cert”, 3],
        [“ssl_ca_cert”, 3],
        [“ssl_bootstrap”, 7]
    ],
    “info” : [
        [“ofcontrollers”, 1],
        [“fail_mode”, 3],
        [“ports”, 1],
        [“parent”, 3],
        [“vlan”, 5],
        [“ssl”, 2]
    ]
}

The first four fields tell netifd the type of the devices and the name of the external handler to connect to via ubus. “bridge” and “br-prefix” signal the bridge capabilities of the devices and give a short prefix to prepend to devices of type Open vSwitch creating devices named “ovs-lan” and “ovs-guest” for the interfaces “lan” and “guest”, respectively.
The “config” and “info” field detail the configuration parameters and their data types for device creation and for responses to the ‘dump_info’ method of the external device handler (see my earlier blog post).
I have created a manual on how to write a JSON description of an external device handler that goes into greater detail. You can find it here.

POSSIBLE LIMITATIONS AND OPEN WORK

Because of the nature of the device handler I implemented, I could only test scenarios that arise from using Open vSwitch. These are bridge-like devices by design and not every device class someone might want to integrate with netifd will behave like them. I had to anticipate much of the behavior for non-bridge-like devices trying to make the ubus interface device class agnostic.

Also, while I tested ovsd with all the complex setups I could think of using fake bridges and wireless interfaces, there are probably edge cases where my implementation is insufficient. Ovsd may have to become stateful at some time.

At the moment, the JSON files from which the device handler stubs are created are a bit “human-unfriendly”. Parsing the JSON objects fails unless they are written on a single line with proper EOF termination. Allowing pretty-printed JSON would greatly increase readability. What’s more, the parameters’ data types are cryptic enumeration values from libubox. Another translation layer turning numbers into readable types like “string” would also help users read and write description files.

IMPROVEMENTS TO MAKE

In the future, I would like to add a way for external device handlers to send detailed feedback about requests they receive back to netifd within the context of the request. This way authors of external device handlers could provide device-specific information such as specific error messages when a command fails.
All the building blocks for such a mechanism are there:
  – ubus replies that are logically connected to an ongoing request and
  – the possibility to tell netifd what messages to expect via an entry in the JSON description file.

This way, messages and their formats could be defined by the authors of external device handlers on a per-device-class — and even a per-command — base. Netifd would then be able to handle these messages and write them to the logs along with information about the context of the ongoing transaction.

MY EXPERIENCE WITH GSOC

All in all, GSoC was a blast. Although challenging and frustrating at times, I am happy to have done it. This was the first time I had to read other people’s code to this extent and at this level of — for lack of a better word — sophistication. It took me the better part of a year to really get a grip on the code base when I was working with it as part of the student project at TU Berlin but once I had a feeling for its structure, playing around with the features I added was fun. It feels rewarding to have worked on and contributed to “real world” free software.
From the organizational, side I am very happy with both Freifunk’s and Google’s way of managing the process. I always knew what was expected of me and was provided the necessary information in advance.

ACKNOWLEDGEMENTS

Now that my first GSoC is over, some thanks are due. First and foremost, I would like to thank my mentor and advisor Julius Schulz-Zander for introducing me to GSoC and for his counsel throughout the summer. Thanks to Felix Fietkau, original author of netifd and its maintainer, of whom I’ve learned a lot both indirectly by working with his code and directly when receiving feedback on my work. Finally, I want to thank Freifunk for letting me do this project and — of course — Google for organizing GSoC.

Google Summer of Code 2016: External netifd Device Handlers – Milestone 1

OVERVIEW OF THE LAST WEEKS
During the last 5 to 6 weeks I have implemented the possibility to include wireless interfaces in Open vSwitch bridges, rewritten a lot of the code for creating bridges with external device handlers and brought my development environment up to speed. I am now working with an up-to-date copy of the LEDE repository.
I have also implemented the possibility for users of external device handlers to define what information and statistics of their devices look like and to query the external device handler for that data through netifd.

CHANGES TO THE DEVELOPMENT ENVIRONMENT
So far, I have been using quilt to create and manage patches for my alterations of the netifd code.
One day they were completely broken. I still do not know what and how it happened but it was not the first time and this time recovery would have been way too tedious.
This is why I switched to a git-only setup. I now have a clone of the netifd repo on my development machine that is symlinked into the LEDE source tree using the nice ‘enable package source tree override’ option in the main Makefile. I used the oppportunity to update both the LEDE source tree and the netifd repository to the most recent versions.
Before, I was working on an OpenWRT Chaos Calmer tree, because of a bug causing the Open vSwitch package to segfault with more recent kernels.
Now, everything is up-to-date: LEDE, netifd and Open vSwitch.

MY PROGRESS IN DETAIL

More Dynamic Device Creation
In actual coding news, I have refined the callback mechanism for creating bridges with external device handlers and the way they are created and brought up.
Previously, a bridge and its ports were created and activated immediately when /etc/config/network was parsed. Now, the ubus call is postponed until the first port on the bridge is brought up.
Because of the added asynchronicity, I had to add a ‘timeout and retry’-mechanism to keep track of the state of the structures in netifd and the external device handler.

A few questions have come up regarding the device handler interface. As I have explained in my first blog post, I am working on Open vSwitch integration into LEDE writing an external device handler called ovsd. Obviously, this is very useful for testing as well.
I have come across the issue of wanting to disable an bridge without deleting it. This means bringing down the bridge and removing the L2 devices from it. The device handler interface that I mirror for my ubus methods doesn’t really have a method for this. The closest thing is ‘hotplug remove’, which using feels a bit like a dirty hack to me.
I have reached out no netifd’s maintainer about this issue. For the meantime, I stick to (ab)using the hotplug mechanism.

On the ovsd side I have added a pretty central feature: OpenFlow controllers. Obviously, someone who uses Open vSwitch bridges is likely to want to use its SDN capabilities.
Controllers can be configured directly in /etc/config/network with the UCI option ‘ofcontrollers’:

config interface ‘lan’
        option ifname ‘eth0’
        option type ‘Open vSwitch’
        option proto ‘static’
        option ipaddr ‘1.2.3.4’
        option gateway ‘1.2.3.1’
        option netmask ‘255.255.255.0’
        option ofcontrollers ‘tcp:1.2.3.4:5678’
        option controller_fail_mode ‘standalone’

The format in which the controllers are defined is exactly the one the ovs-vsctl command line tool expects.
The other new UCI option below ofcontrollers configures the bridge’s behavior in case the configured controller is unreachable. It is a direct mapping to the ovs-vsctl command ‘set-fail-mode’. The default behavior in case of controller absence is called ‘standalone’ which makes the Open vSwitch behave like a learning switch. ‘secure’ disables the adding of flows if no controller is present.

Function Coverage: Information and Statistics
Netifd device handlers have functions to dump information and statistics about devices in JSON format: ‘dump_info’ and ‘dump_stats’. Usually, these just collect data from the structures in netifd and the kernel but with my external device handlers, it is not as simple. I have to relay the query to an external device handler program and parse the response. Since the interface is generic, I cannot hard-code the fields and types in the response. This is why I relied once more on the JSON data type description mechanism that I have already used for dynamic creation of device configuration descriptions.
In addition to the mandatory ‘config’ description, users can now optionally provide ‘info’ and/or ‘stats’ fields. Just like the configuration descriptions they are stored in the stub device handler structs within netifd where they are available to serve as blueprints for how information and statistics coming from external device handlers have to be parsed.

For my Open vSwitch setup, it currently looks like this in /lib/netifd/ubusdev-config/ovsd.json:
{
    “name” : “Open vSwitch”,
    “ubus_name” : “ovs”,
    “bridge” : “1”,
    “br-prefix” : “ovs”, 
    “config” : [
        [“name”, 3],
        [“ifname”, 1],
        [“empty”, 7],
        [“parent”, 3],
        [“ofcontrollers”, 1],
        [“controller_fail_mode”, 3],
        [“vlan”, 6]
    ],
    “info” : [
        [“ofcontrollers”, 1],
        [“fail_mode”, 3],
        [“ports”, 1]
    ]
}

This is how it looks when I query the Open vSwitch bridge ‘ovs-lan’:

THE NEXT STEPS

During the weeks to come I want to look into some issues which occurred sometimes when I disabled and re-enabled a bridge: Some protocol-realated configuration went missing. This could mean that sometimes the configured IP address was gone. Something which could help me overcome the problem is also in need of some work: reloading/reconfiguring devices.
Along with this, I want to get started with the documentation to prepare for the publication of the source code.