Load-correlated distributed bandwidth analysis for LibreMesh networks – #4: Conclusions and further work

Here I describe everything I did for my Google Summer of Code project this year.

First of all, thanks to Freifunk and LibreMesh communities and developers for the opportunity!
The work I did is quite spread, from general documentation to bug fixing and actual coding, I’ll try to collect everything in a more-or-less ordered fashion.

Compiling the firmware: methods, fixes and documentation

At the beginning of my GSoC, I tested various methods for compiling the latest LibreMesh firmware.

OpenWrt buildroot

At first I tried using the LibreRouter organization fork of the OpenWrt source code repository. After updating a small thing here (merged PR) I decided to use directly OpenWrt repository on the openwrt-18.06 branch in order to have all the fixes which will enter the next OpenWrt release in the 18.06 family.

As explained in my second blog post I decided to compile all the LibreMesh packages but to not include them in the binary image, this allowed me to flash a safe image (plain OpenWrt) and to add the juicy bits using OPKG from a local-network packages repository. Looking back, maybe this was an overkill and including all the packages into the images would have been just fine.

The list of packages I selected and I suggest to use as default for the next LibreMesh release are:

check-date-http first-boot-wizard hotplug-initd-services lime-app lime-debug lime-hwd-ground-routing lime-hwd-openwrt-wan lime-proto-anygw lime-proto-babeld lime-proto-batadv lime-proto-wan lime-system shared-state shared-state-babeld_hosts shared-state-dnsmasq_hosts shared-state-bat_hosts shared-state-dnsmasq_leases shared-state-nodes_and_links lime-docs lime-docs-minimal libremap-agent

I documented the process here.

lime-sdk

lime-sdk was the recommended local compilation method for the last stable release LibreMesh 17.06. I fixed its master branch here (merged PR). But more problems persist, see my issues here and here. I didn’t try to fix them as the main developer decided to abandon the support to the latest stable release (see here) and for the next release it won’t be used anyway.

openwrt-metabuilder

I casually found Paul Spooren’s openwrt-metabuilder which has the potential to provide the same user experience as lime-sdk. I fixed a small thing in the examples here (merged PR) and created two new examples: one for compiling LibreMesh 17.06 and another for compiling the latest code, they can be found here (open PR). This system downloads and installs compiled packages, which for the latest LibreMesh code case are compiled by Travis continuous integration. Travis configuration was broken, I updated the configuration here (merged PR) and works again. The list of the packages being compiled was not complete so that some of the ones needed for the latest LibreMesh could not be installed, I added all of them to the to-be-compiled list here (open PR).

Documentation on compilation

What I concluded was used for updating the compilation instructions on the LibreMesh website, with plentiful of other updates and improvements, can be read here (open PR).

One thing that the documentation is still missing is how to use the network-profiles (introduced with LibreMesh 17.06 to be used with lime-sdk for having a network community wide customization) with the OpenWrt buildroot (while openwrt-metabuilder already supports it, simply indicating the network-profiles name as a package to install works). I started some discussion on the topic here.

Test network: supporting unsupported routers and unexpected bugs

Supporting more routers

LibreMesh default configuration creates three interfaces on each radio (two access points with different ESSID and one IEEE802.11s mesh). This works on a very limited set of routers, which are the officially LibreMesh-supported ones. I own many home routers from various ISP, which are perfectly supported by OpenWrt but not by LibreMesh and I wanted to expand LibreMesh support to these abundant and “free as in free beer” routers.

On LibreMesh, by default, the routing (BATMAN-adv and Babeld) happens on top of IEEE802.11s mesh interfaces. For using these routers I had to expand the configuration scope for AP and client interfaces and the result can be seen here (open PR).

Memory leak of YouHua WR1200JS on ethernet when using VLAN 802.1ad

While testing with the LibreMesh-supported routers I have, TP-Link WDR3600 and YouHua WR1200JS, I had some interesting trouble. The first router saw the routing peers also via ethernet cable connection while the second didn’t. Digging deep into the packets with tcpdump on various interfaces I realized that the YouHua WR1200JS leaks memory (I don’t know from which memory) into the packets’ content when using VLAN of type 802.1ad (the common VLAN 802.1q works just ok) breaking the packets and leaking information.

I reported this fact here and here and received no answer nor confirmation yet.

Data collection: lime-report and bandwidth-test

The objective of my GSoC included the development of reporting utilities and the smart scheduling of their execution.

Regarding the first part, I completed the development of lime-report (based on a draft by Paul Spooren) and developed from scratch bandwidth-test. The former can be seen here (open PR) and the latter here (open PR).

lime-report

lime-report is a very simple shell script outputting a set of debugging commands output and configuration files content. A few options allow the user to select the needed information type.

bandwidth-tool

bandwidth-test is a tool for estimating the maximum available download bandwidth from the internet. In order to work even on restricted connections, it just uses port 80 with HTTP connections. It has be designed for working also on a common Linux machine (requires lua, wget and pv), not only on OpenWrt.

By default, a few large files are downloaded during 20 seconds. After this timeout, the download gets interrupted and the speed estimated. The failed downloads gets ignored and more files gets downloaded until having 5 successful tests. At this point the outputted value is the median of the 5 results.

Tests scheduling at peak and night time

In order to have interesting information, the network status and performances have to be referenced to the network load. Active tests which risk to affect the users experience should be run during the night time, when the network is at rest, while passive tests can be safely done at the network usage peak time, when problems are more likely to show up. The tests results should be stored on the router for allowing the diagnosis of problems after an accident.

Each router determines the peak time based on three different commands giving an estimation of the network-wide connected clients. Once one day-time of load data is collected, each router starts scheduling the passive tests at the peak time, using the classical at command. The load time-profile is constantly updated considering both previous days and today’s loads.

The most heavy test to be run during the night time is the bandwidth test. In order to avoid cross-correlation between the tests, they have to be performed at different times. The synchronization is obtained using the shared-state routine and assuming that all the router’s clock are synchronized (we are performing bandwidth tests towards the internet, so it’s safe to assume that the clocks are synchronized, either via NTP or via check-date-http routine). The implemented strategy is: run the tests-scheduler routine at a randomized time, so that each router does it at a different time. Select the 6 hours in the day where the network load (number of clients) is minimum. Read the time the other routers announced they will run the tests at, this works via shared-state. Between these 6 hours chose the one which has less scheduled tests by other routers. Within this hour, group the other routers scheduled tests in 5 minutes groups and chose the less populated group. Randomize the execution time in this 5 minutes range.

The code is not yet tested enough to be considered ready, but can be seen at this commit. The actual PR will have a rewritten version of this, from another branch, but this link will be kept valid for GSoC reference.

More minor fixes and documentation

I reported here and proposed a fix here (open PR) for a problem noticed by an user. Some very minor errors I noticed and fixed are here (merged PR), here (merged PR) and here (open PR).

In this already mentioned pull request I also updated and expanded the lime-example file which is the most complete documentation on the LibreMesh configuration. Some more improvements to the website are here (merged PR), here (merged PR) and in this already mentioned pull request.

Further work

  • Complete the testing of tests-scheduler
  • Use LibLogNorm for normalizing the logs collected by lime-report and reducing their size
  • Make the tests results available to an external Prometheus monitor
  • Implement a strategy for saving the tests results on flash memory rather than on RAM (so that they are persistent over reboot): frequent writing has to be avoided for limiting the memory tearing, logs can be written on flash just when certain problems are detected (e.g. internet connection lost)
  • Implement a strategy for deleting old tests results when RAM or flash start getting full

Maaany hugs!
Ilario

Load-correlated distributed bandwidth analysis for LibreMesh networks – #3: Completed test network and broadened scope of the work

The planned test network has been built, employing both fully supported (I just documented them in the tested routers list here) and common home routers (officially unsupported by LibreMesh but supported by OpenWrt).

Employing non supported routers required an expansion of my previous work about making possible an AP-sta (point to multi point access point to clients) network architecture (instead of the default IEEE802.11s mesh). My previous solution relied on BMX6 which will not be included in the next release, in favor of Babeld, so the problem is open again. I provisionally managed to have Babeld on AP and client interfaces adding the following setting in /etc/config/lime on the access point:

config wifi 'radio0'
     list modes 'apname'
     option country 'ES'
     option channel_2ghz '11'
     option apname_ssid 'LibreMesh.org/%H'
     option apname_key 'someAPpassword'
     option apname_encryption 'psk2'
     option distance '100'

 config net 'wirelessap'
     option linux_name 'wlan0-apname'
     list protocols 'babeld:17'

and the following in the /etc/config/lime of the client (taking advantage of the client protocol I added some time ago here):

config wifi 'radio0'
     list modes 'client'
     option country 'ES'
     option channel_2ghz '11'
     option client_ssid 'LibreMesh.org/LiMe-eb7f64'
     option client_key 'someAPpassword'
     option client_encryption 'psk2'
     option distance '100'

 config net 'wirelessclient'
     option linux_name 'wlan0-sta'
     list protocols 'client'
     list protocols 'babeld:17'

For some reason this solution does not propagate the default route obtained from Babeld to the whole network, this does not directly affect my project, anyway I’ll surely manage to fix this in the upcoming days.
In case the usage of such perfectly-working trashware was a blocker, I will receive a few more supported routers in the following days and I will just use those.

Also due to the switch to Babeld, to obtain a complete graph of the network is not yet possible (Babeld being based on the distance vector principle, does not know the whole topography and we’ll have to aggregate it using the new shared-state LibreMesh feature).

During the building of the test network, the planned topography changed a bit resulting in this one (solid lines are cabled connections, directional dotted lines with arrows points from the client to the access point, non-directional dotted lines are proper IEEE802.11s mesh):

All the routers were flashed with OpenWrt 18.06-SNAPSHOT image, which is OpenWrt 18.06.4 with additional fixes appeared in the release branch here compiled locally using OpenWrt buildroot. LibreMesh packages were also compiled in the same process but not included in the compiled image, and installed later using opkg and serving the packages over the local network. This approach showed to be more convenient than expected, additionally, the fallback image is a plain OpenWrt, which decrease the risk of “brikking” the devices.

The complete list of the installed packages from the LibreMesh ones is:

check-date-http first-boot-wizard hotplug-initd-services lime-app lime-debug lime-hwd-ground-routing lime-hwd-openwrt-wan lime-proto-anygw lime-proto-babeld lime-proto-batadv lime-proto-wan lime-system shared-state shared-state-babeld_hosts shared-state-dnsmasq_hosts shared-state-bat_hosts shared-state-persist shared-state-dnsmasq_leases shared-state-pirania

Lately, I got also involved in the development of lime-log-review, which uses liblognorm to decrease the volume of the logs and can be used in my project for storing the key information from the voluminous logs when an incident is detected.

Load-correlated distributed bandwidth analysis for LibreMesh networks – #2: Setting up the LibreMesh test network

In order to use the latest version of everything, I merged the latest commits from the LibreMesh community into my forked lime-packages repository.

To set up the test network was more complex than expected.
I managed to collect a very disperse set of routers: 8 routers of 6 different producers and 7 different models.
Two of these are officially supported by LibreMesh (TP-Link TL-WDR3600, Ubiquiti NanoStation Loco M2) and the others which are supported by OpenWrt but not by LibreMesh (Comtrend AR-5387un, Huawei HG556a-C, Observa VH4032N, Comtrend AR-5315u, Astoria ARV7519RW22-A-LT).

The non-LibreMesh-supported routers either cannot do multi-AP or mesh via IEEE802.11s, but this was not expected to be a problem as I took care to add the support to AP-client networks (no need for the routers to support IEEE802.11s mesh, only the last mentioned router does not have support for wifi at all).
My solution was based on BMX6 which seems will be dropped in the next LibreMesh release in favour of Babeld, and this will require an adaptation of the AP-client solution.

As mentioned in the previous post, I started compiling my LibreMesh firmware based on LibreRouter fork of OpenWrt 18.06 repository.
When I flashed my routers and configured the wireless interfaces for using AP or client rather than the default AP+AP+IEEE802.11s, most of them were showing strongly erratic behaviours.

So I decided to flash the routers with plain OpenWrt 18.06.2 without using LibreRouter fork and to install all LibreMesh packages via opkg.
In order to ensure that the compiled packages will be compatible with OpenWrt 18.06.2 release, the LibreMesh packages were compiled in my local buildroot of OpenWrt branch openwrt-18.06.
Then the openwrt/bin/ directory was served via HTTP from my local machine.
In order to have the routers accept my local repositories I had to install usign, create a key pair, sign the Packages file, push the public key to the routers and add the directions of the local repositories to /etc/opkg/customfeeds.conf
So for example, the customfeeds.conf file of the Observa VH4032N router will look like:

src/gz local_base http://192.168.1.3/packages/mips_mips32/base
src/gz local_libremap http://192.168.1.3/packages/mips_mips32/libremap
src/gz local_libremesh http://192.168.1.3/packages/mips_mips32/libremesh
src/gz local_luci http://192.168.1.3/packages/mips_mips32/luci
src/gz local_packages http://192.168.1.3/packages/mips_mips32/packages
src/gz local_routing http://192.168.1.3/packages/mips_mips32/routing
src/gz local_brcm63xx_smp http://192.168.1.3/targets/brcm63xx/smp/packages

Once completely configured, the network structure planned is represented in black in the following scheme.

Planned test network structure.

In order to better test the on a proper mesh, I ordered 3 additional routers fully supported by LibreMesh: YouHua WR1200JS (see here and here) from here.
They come with OpenWrt pre-installed and they fully support multi-AP + IEEE802.11s.
Once I will receive these two additional routers I will be able to add the mesh part of the test network as indicated in the scheme in red.

Regarding the load analysis of the network, the first approach will be to obtain this value from the number of clients currently connected to the network.
This number will be obtained in at least the following ways:

batadv-vis -f jsondoc | sort -u | wc -l

ip neigh show nud reachable | wc -l

In the meanwhile, a minor enhancement has been suggested and two others were accepted.

GSoC 2019 – Load-correlated distributed bandwidth analysis for LibreMesh networks

Introduction

Performance tests are key for identifying the bottlenecks and optimize the network topology.
The main indicator is the bandwidth, but also other values can be useful like latency, number of active users for each node, load average and RAM consumption of each node.
The value of these quantities can vary greatly between the peak time and the night time, for this reason some of the measurements should be carried on in both these two moments.
Some other measurements, which can affect the user experience, could instead be run just in the night time.
To identify the night time we can’t relay on the router’s internal clock which could be years away from the actual time.
So a method for getting the network-wide peak time will be sought.
Each router in the network should separately run these tests, and for avoiding to influence each others’ results they should run at different times.
This synchronization should be possible taking advantage of the LibreMesh architecture and the shared-state service.

About me

Here’s Ilario, I studied organic chemistry in Pisa, Italy and I’m currently in a PhD on perovskite solar cells in Tarragona, Catalonia, Spain.
During the university I contributed to the mesh network eigenNet, part of the Italian community network consortium Ninux.
I started the (nowadays stalled) NinuxVerona community and once in Spain I started actively contributing to GuifiCamp and LibreMesh.

Setup of develop environment and initial interactions with LibreMesh community

After proposing a fix, I managed to build the LibreMesh firmware at its current stable release (17.06) using lime-sdk.
Then I built the latest LibreMesh code on top of forked OpenWrt 18.06 buildroot as suggested by the mentors; at first this was not possible on Arch Linux but after contacting with the community they updated the forked OpenWrt repository and it worked, thanks!
Finally, in order to be able to have the most updated OpenWrt code available, I compiled the latest LibreMesh code on top of the trunk (master branch, the unstable version) of OpenWrt buildroot, this was possible after adapting some configuration to the latest OpenWrt.
For pushing my code I forked lime-packages repository and created a gsoc2019 branch which can be accessed here.
Additionally, in case modifications to OpenWrt 18.06 core were needed, I will push them here.
All the buildroot-based compilation methods are already setup with the new branch as a feed, while the possibility of a back port to the stable LibreMesh 17.06 release will be evaluated once the project is completed.

Objectives

  • Flash with LibreMesh 4+ routers (preferably different models with different performances, if needed buy some) and setup a test network;
  • define a set of information to collect, divide it in network safe (e.g. number of clients)/network intensive (e.g. bandwidth test) and understand how to collect this data;
  • understand how a Prometheus exporter works and develop one in lua for the “network safe” quantities;
  • choose a reasonable “network safe” quantity for identifying the usage peak of the whole network (e.g. number of clients);
  • develop a script that locally identifies the peak and the night time;
  • develop the scripts for the network intensive tests, these should also store on the flash memory the results;
  • discuss with the mentors if the previous logs can be overwritten or if they should accumulate on the router for a certain period of time, in the latter case implement it;
  • implement a strategy for avoiding network intensive tests on different routers to happen at the same time;
  • if for achieving this last point a synchronization of the routers’ clocks is unavoidable, find a converging way for doing so or an available tool which does not require internet access (no NTP);
  • write a small Prometheus exporter for serving the last peak and off-peak network intensive tests results;
  • write the init service;
  • create a Makefile for the package;
  • test in a real-world community;
  • adapt the code written for LibreMesh trunk version to run also on LibreMesh 17.06 release;
  • adapt the code to plain OpenWrt, evaluate needed dependencies, if possible push the created package to upstream repository.