Load-correlated distributed bandwidth analysis for LibreMesh networks – #4: Conclusions and further work

Here I describe everything I did for my Google Summer of Code project this year.

First of all, thanks to Freifunk and LibreMesh communities and developers for the opportunity!
The work I did is quite spread, from general documentation to bug fixing and actual coding, I’ll try to collect everything in a more-or-less ordered fashion.

Compiling the firmware: methods, fixes and documentation

At the beginning of my GSoC, I tested various methods for compiling the latest LibreMesh firmware.

OpenWrt buildroot

At first I tried using the LibreRouter organization fork of the OpenWrt source code repository. After updating a small thing here (merged PR) I decided to use directly OpenWrt repository on the openwrt-18.06 branch in order to have all the fixes which will enter the next OpenWrt release in the 18.06 family.

As explained in my second blog post I decided to compile all the LibreMesh packages but to not include them in the binary image, this allowed me to flash a safe image (plain OpenWrt) and to add the juicy bits using OPKG from a local-network packages repository. Looking back, maybe this was an overkill and including all the packages into the images would have been just fine.

The list of packages I selected and I suggest to use as default for the next LibreMesh release are:

check-date-http first-boot-wizard hotplug-initd-services lime-app lime-debug lime-hwd-ground-routing lime-hwd-openwrt-wan lime-proto-anygw lime-proto-babeld lime-proto-batadv lime-proto-wan lime-system shared-state shared-state-babeld_hosts shared-state-dnsmasq_hosts shared-state-bat_hosts shared-state-dnsmasq_leases shared-state-nodes_and_links lime-docs lime-docs-minimal libremap-agent

I documented the process here.

lime-sdk

lime-sdk was the recommended local compilation method for the last stable release LibreMesh 17.06. I fixed its master branch here (merged PR). But more problems persist, see my issues here and here. I didn’t try to fix them as the main developer decided to abandon the support to the latest stable release (see here) and for the next release it won’t be used anyway.

openwrt-metabuilder

I casually found Paul Spooren’s openwrt-metabuilder which has the potential to provide the same user experience as lime-sdk. I fixed a small thing in the examples here (merged PR) and created two new examples: one for compiling LibreMesh 17.06 and another for compiling the latest code, they can be found here (open PR). This system downloads and installs compiled packages, which for the latest LibreMesh code case are compiled by Travis continuous integration. Travis configuration was broken, I updated the configuration here (merged PR) and works again. The list of the packages being compiled was not complete so that some of the ones needed for the latest LibreMesh could not be installed, I added all of them to the to-be-compiled list here (open PR).

Documentation on compilation

What I concluded was used for updating the compilation instructions on the LibreMesh website, with plentiful of other updates and improvements, can be read here (open PR).

One thing that the documentation is still missing is how to use the network-profiles (introduced with LibreMesh 17.06 to be used with lime-sdk for having a network community wide customization) with the OpenWrt buildroot (while openwrt-metabuilder already supports it, simply indicating the network-profiles name as a package to install works). I started some discussion on the topic here.

Test network: supporting unsupported routers and unexpected bugs

Supporting more routers

LibreMesh default configuration creates three interfaces on each radio (two access points with different ESSID and one IEEE802.11s mesh). This works on a very limited set of routers, which are the officially LibreMesh-supported ones. I own many home routers from various ISP, which are perfectly supported by OpenWrt but not by LibreMesh and I wanted to expand LibreMesh support to these abundant and “free as in free beer” routers.

On LibreMesh, by default, the routing (BATMAN-adv and Babeld) happens on top of IEEE802.11s mesh interfaces. For using these routers I had to expand the configuration scope for AP and client interfaces and the result can be seen here (open PR).

Memory leak of YouHua WR1200JS on ethernet when using VLAN 802.1ad

While testing with the LibreMesh-supported routers I have, TP-Link WDR3600 and YouHua WR1200JS, I had some interesting trouble. The first router saw the routing peers also via ethernet cable connection while the second didn’t. Digging deep into the packets with tcpdump on various interfaces I realized that the YouHua WR1200JS leaks memory (I don’t know from which memory) into the packets’ content when using VLAN of type 802.1ad (the common VLAN 802.1q works just ok) breaking the packets and leaking information.

I reported this fact here and here and received no answer nor confirmation yet.

Data collection: lime-report and bandwidth-test

The objective of my GSoC included the development of reporting utilities and the smart scheduling of their execution.

Regarding the first part, I completed the development of lime-report (based on a draft by Paul Spooren) and developed from scratch bandwidth-test. The former can be seen here (open PR) and the latter here (open PR).

lime-report

lime-report is a very simple shell script outputting a set of debugging commands output and configuration files content. A few options allow the user to select the needed information type.

bandwidth-tool

bandwidth-test is a tool for estimating the maximum available download bandwidth from the internet. In order to work even on restricted connections, it just uses port 80 with HTTP connections. It has be designed for working also on a common Linux machine (requires lua, wget and pv), not only on OpenWrt.

By default, a few large files are downloaded during 20 seconds. After this timeout, the download gets interrupted and the speed estimated. The failed downloads gets ignored and more files gets downloaded until having 5 successful tests. At this point the outputted value is the median of the 5 results.

Tests scheduling at peak and night time

In order to have interesting information, the network status and performances have to be referenced to the network load. Active tests which risk to affect the users experience should be run during the night time, when the network is at rest, while passive tests can be safely done at the network usage peak time, when problems are more likely to show up. The tests results should be stored on the router for allowing the diagnosis of problems after an accident.

Each router determines the peak time based on three different commands giving an estimation of the network-wide connected clients. Once one day-time of load data is collected, each router starts scheduling the passive tests at the peak time, using the classical at command. The load time-profile is constantly updated considering both previous days and today’s loads.

The most heavy test to be run during the night time is the bandwidth test. In order to avoid cross-correlation between the tests, they have to be performed at different times. The synchronization is obtained using the shared-state routine and assuming that all the router’s clock are synchronized (we are performing bandwidth tests towards the internet, so it’s safe to assume that the clocks are synchronized, either via NTP or via check-date-http routine). The implemented strategy is: run the tests-scheduler routine at a randomized time, so that each router does it at a different time. Select the 6 hours in the day where the network load (number of clients) is minimum. Read the time the other routers announced they will run the tests at, this works via shared-state. Between these 6 hours chose the one which has less scheduled tests by other routers. Within this hour, group the other routers scheduled tests in 5 minutes groups and chose the less populated group. Randomize the execution time in this 5 minutes range.

The code is not yet tested enough to be considered ready, but can be seen at this commit. The actual PR will have a rewritten version of this, from another branch, but this link will be kept valid for GSoC reference.

More minor fixes and documentation

I reported here and proposed a fix here (open PR) for a problem noticed by an user. Some very minor errors I noticed and fixed are here (merged PR), here (merged PR) and here (open PR).

In this already mentioned pull request I also updated and expanded the lime-example file which is the most complete documentation on the LibreMesh configuration. Some more improvements to the website are here (merged PR), here (merged PR) and in this already mentioned pull request.

Further work

  • Complete the testing of tests-scheduler
  • Use LibLogNorm for normalizing the logs collected by lime-report and reducing their size
  • Make the tests results available to an external Prometheus monitor
  • Implement a strategy for saving the tests results on flash memory rather than on RAM (so that they are persistent over reboot): frequent writing has to be avoided for limiting the memory tearing, logs can be written on flash just when certain problems are detected (e.g. internet connection lost)
  • Implement a strategy for deleting old tests results when RAM or flash start getting full

Maaany hugs!
Ilario

Leave a Reply

Your email address will not be published. Required fields are marked *