GSoC 2024: eBPF performance optimizations for a new OpenWrt Firewall, Final report

Posted on 26. August 20243. September 2024 by Til Kaiser

Hello again everybody! With GSoC 2024 coming to an end, it is time to present you my final blog post/report for my project, showing you what I have achieved during the summer.

Project Goal

This project aims to introduce a new firewall software offloading variant to OpenWrt by intercepting an incoming data packet from the NIC as early as possible inside or even before the network stack through the eBPF XDP or TC hook. After that, the packet might be mangled (e.g., NAT) and redirected to another network interface or dropped.

The result should be that we see a performance increase, either by having a higher throughput, dropping packets faster, or lowering the overall CPU load.

More detailed descriptions of this project can be found in my first blog post here and my midterm update here.

What I did

To achieve the goals of this project, I had to design and implement three parts of the software:

The eBPF program which intercepts, mangles, and forwards or drops incoming data packets
A user-space program that attaches the eBPF program, reads Firewall rules, and makes routing decisions for the received packets
An eBPF hashmap for the communication between the eBPF- and user-space program

Finally, a performance evaluation is required to compare the results of this eBPF implementation against OpenWrt’s current Firewall.

You can find my implementation, measurement scripts, and some plots in my dedicated GitHub repository here: https://github.com/tk154/GSoC2024_eBPF-Firewall

The current implementation state

eBPF kernel-space program

When the eBPF kernel-space program receives a packet, it parses the layer 2, 3, and 4 packet headers, and if a DSA switch receives the packet, it also parses the DSA tag. If any header is unknown in the respective layer, it passes the packet to the network stack. The following headers/DSA tags are currently supported:

Layer 2	Layer 3	Layer 4	DSA Tags
Ethernet 802.1Q VLAN PPPoE	IPv4 IPv6	TCP UDP	mtk gswip qca

It then checks inside the eBPF hashmap what to do with the received packet:

If there is no entry yet for the packet/flow, it creates a new one to signal the user-space program.
If there is an entry but the eBPF program should not redirect the packet, the packet is passed to the network stack or dropped.
If there is an entry and the eBPF program should redirect the packet, it …
- Mangles the packet by applying NAT (if applicable) and adjusting the TTL and checksums
- Pushes the Ethernet header and possible additional L2 header onto the packet
- Sends the packet out of the designated network interface

User-space program

When the user starts the binary, it attaches the eBPF program to the lowest possible network interfaces on the system or to all network interfaces given by the user per command line argument.

It then loops every n seconds through the flow entries of the eBPF hashmap and checks via nf_conntrack whether a connection tracking exists for that flow.

If so, and if the flow entry is new, it …
- Retrieves possible NAT information via nf_conntrack
- Makes the routing decision
- Checks if the eBPF program needs to push layer 2 headers
- Determines the next hop via rtnetlink
- Saves all that information inside the eBPF map to signal the eBPF program that it can take over now
For all existing flow entries, it updates the nf_conntrack timeouts as long as an established connection tracking entry exists
If a connection tracking entry does not exist, it checks Firewall rules via OpenWrt’s scripting language ucode if the eBPF program should drop the packet.

When a configurable flow timeout occurs, the user-space program deletes the flow entry from the eBPF map.

What is left to do

Submitting an XDP generic patch

Currently, for XDP generic, if the pre-allocated SKB does not have a packet headroom of 256 Bytes, it gets expanded, which involves copy operations consuming so many CPU cycles that the hoped-for performance gain is negated. I have already created a patch that makes the XDP generic packet headroom configurable, but I still need to submit it to upstream Linux.

Routing changes

When there is a new flow entry, the user-space program makes a routing decision and stores the result inside the eBPF map. But it could be possible that such a route changes now, e.g. because the user explicitly changed it or a network interface went down. The user-space program doesn’t react yet to routing changes, which means that the eBPF program still forwards packets to the old routing destination.

Counter updates

As soon as the eBPF program starts forwarding packets, network interface and nf_conntrack counters aren’t updated anymore. Updating the interface counters shouldn’t be a problem, but in my testing, nf_conntrack counter updates seem to get ignored from user-space.

Performance results

Similar to my first blog post, I tested the throughput performance on an AVM FRITZ!Box 7360 v2 running OpenWrt with Linux Kernel version 6.6.41, which CPU is too weak to saturate its Gigabit ports. I used iperf3 to generate IPv6 UDP traffic for 60 seconds where NAT is applied for the source and destination IPs and ports; you can find the results inside the following plot:

The parts are the following:

default: OpenWrt’s Firewall running but without any offloading enabled
sw_flow: Netfilter’s software offloading enabled (flow_offloading)
xdpg256: The eBPF program attached to the XDP generic hook with the default package headroom of 256 Bytes
xdpg32: The eBPF program attached to the XDP generic hook with a custom package headroom set to 32 Bytes
tc: The eBPF program attached to the TC hook
xdpg32_dsa: The eBPF program attached to the XDP generic hook of the DSA switch with a custom package headroom set to 32 Bytes
tc_dsa: The eBPF program attached to the TC hook of the DSA switch

Unfortunately, there is no performance gain when using the XDP generic mode with the default 256 Bytes packet headroom. TC is on the same level as Netfilter’s software offloading implementation. XDP generic with the custom 32 Bytes packet headroom is around 50 MBit/s faster.

The actual performance gain comes into play when attaching the eBPF program to the DSA switch. While XDP generic with 256 Bytes packet headroom is now at least faster than without offloading, XDP generic with 32 Bytes packet headroom is about 250 MBit/s faster than any other offloading, which means about 50% more throughput. TC is also a little bit faster, but there is not such a performance increase as for XDP.

I have created the following graphs using the Linux command line tool perf and scripts from the FlameGraph repository. They show how many CPU cycles Linux kernel functions used for the OpenWrt Firewall running without any offloading and the XDP generic with 32 Bytes packet headroom attached to the DSA switch.

As you can see, since the eBPF program saves some Linux kernel function calls, the CPU can poll for more data via the xrx200_poll_rx function, which consequentially benefits the throughput performance.

Soon, I will also upload the graphs for the other measured parts and the package dropping performance into my already mentioned GitHub repository.

Concluding thoughts

While implementing this new Firewall offloading variant, I learned a lot of new things, not just about eBPF but also about the Linux kernel and network stack itself. Although it was not always easy, because I had to delve into Netlink first, for example, I also had much fun while coding.

As I have shown, the performance gain is somewhat mixed compared to OpenWrt’s current Firewall. To have a higher throughput, my XDP generic patch would need to be accepted for the upstream Linux kernel.

Finally, I would like to thank my mentor, Thomas, for giving me the chance to participate in GSoC 2024, and, the same goes for the OpenWrt core developer, Felix, for guiding me through the project. Furthermore, I appreciate that Andi and all GSoC involved Freifunk members make it possible to participate in such a project.

This concludes my GSoC 2024 project, but as I already mentioned, there is still some work to do. Should you have questions, do not hesitate to contact me. I hope you enjoyed the project as much as I did!

GSoC 2024: eBPF performance optimizations for a new OpenWrt Firewall, Midterm update

Posted on 12. July 202412. July 2024 by Til Kaiser

Hello again, everybody! This is the Midterm follow-up blog post for my GSoC 2024 topic: “eBPF performance optimizations for a new OpenWrt Firewall.” It will cover how I started trying to solve the task, what the current implementation looks like, and what I will do in the upcoming weeks.

As a quick reminder: The project’s goal is to implement a new OpenWrt Firewall offloading variant using eBPF. Why eBPF? Because with eBPF, you can intercept an incoming data packet from the NIC very soon inside or even before the Linux network stack. After intercepting the packet with an eBPF program at the so-called XDP or TC hook, you can mangle it, redirect it to another or out of the same network interface, or drop it. Mangling the packet could mean, for example, applying possible Network Address Translation (NAT), adjusting the Time-To-Live (TTL), or recalculating the checksum(s).

The result should be that we see a performance increase, either by having a higher throughput, dropping packets faster, or lowering the CPU load.

Current implementation

The implementation consists of three components:

The eBPF program which intercepts, mangles, and forwards or drops incoming data packets from a network interface
A user-space program that attaches the eBPF program to the appropriate network interfaces and determines whether to forward a received packet and where to
An eBPF map (in this case, a key-value hash map) so that the eBPF and user-space program can communicate with each other

Originally, I wanted to parse all OpenWrt Firewall rules and dump them into the eBPF map when the user-space program starts. When the eBPF program received a packet, it would try to match it with one of the parsed rules. But I had a few talks with the OpenWrt community and my mentor and concluded that this approach poses some problems:

eBPF has limited looping support, but for rule matching, it is necessary to loop.
OpenWrt uses the Netfilter framework as its firewall backend that has (too) complex features to implement in eBPF, like for example the logging of packets.

That is why we decided to go for a “flow-based” approach. When the eBPF program receives a packet, it creates a tuple from some crucial packet identifiers (Ingress interface index, L3 and L4 protocols, and source and destination IPs and ports). The program uses this tuple as the key for the eBPF hash map to signal the user-space program that it has received a packet for a new flow so that it can look up what the eBPF program should do with packets for that particular flow.

Until the user-space program responds, the eBPF program passes all packets belonging to that flow to the network stack, where the Netfilter framework processes it for now. In the meantime, the user-space program checks what the eBPF program should do with packets from that flow and stores the result inside the hash map as the value.

Connection Tracking must also be available because the to-be-implemented offloading variant should be stateful instead of stateless. I first thought about implementing it in the eBPF or user-space program. But then I realized I would somewhat reinvent the wheel because OpenWrt uses the Netfilter framework, which has a connection tracking implementation called nf_conntrack.

The Netfilter project provides an API through their user-space library libnetfilter_conntrack to add, retrieve, modify, and delete connection tracking entries. I am using this API in my implementation to check whether a conntrack entry exists for a packet flow. In the case of TCP, it only forwards packets while a connection is in the “Established” state so that Netfilter can still handle the opening and closing states of the TCP connections. In the case of UDP, the eBPF offloader starts forwarding packets on its own as soon as and as long as a conntrack exists. The user-space program meanwhile updates the timeouts for offloaded connections.

And there is a charm when using nf_conntrack: Such a connection tracking entry directly has NAT information available, so you don’t have to retrieve them by parsing OpenWrt firewall rules. Furthermore, this means that the forwarding part of the eBPF offloader can run independently of the Linux operating system used. It is only dependent on an OS that runs the Netfilter framework, including nf_conntrack.

Packet Forwarding

The following simplified activity diagram illustrates how incoming packets are forwarded by the current implementation of the offloader:

Here is a step-by-step explanation of what is happening:

The eBPF program receives a data packet from the NIC for a not-yet-seen flow. It creates the packet tuple key and uses it to check whether an entry for that flow already exists inside the eBPF hash map. Since it hasn’t seen the flow yet, there is no entry, so the eBPF program creates a new empty entry inside that map to signal the user-space program. Meanwhile, it passes all the following packets of that flow to the network stack until the user-space program responds.
When the user-space program wakes up, it retrieves the new flow entry from the map and checks through libnetfilter_conntrack whether a conntrack entry for the flow exists. If not, or the TCP state isn’t established, it doesn’t respond to the eBPF program (yet), so packets continue passing to the network stack. If there is an (established) conntrack entry, it also looks up inside that entry if NAT needs to be applied and, if so, calculates the checksum difference. Finally, it updates the flow entry accordingly to signal the eBPF program that it can take over now.
When the eBPF program receives a new data packet for that flow again, it reads from the flow entry that it can forward the packet now, so it does possible NAT and checksum adjustments and redirects the packet to the target network interface. When there is a TCP FIN or RST or a conntrack timeout occurs, the eBPF program doesn’t forward the packet anymore and passes it to the network stack again.

Where to attach? Where to send?

There are two things I didn’t mention yet about the implementation:

On which network interfaces should I attach my eBPF program?
What is the next hop for the packet, i.e., to which output interface and neighbor to send it?

I implemented the latter within the user-space program using the Linux routing socket RTNETLINK. When I started to implement this, I performed the following three steps to determine the next hop:

Send an RTM_GETROUTE message containing the packet tuple to determine the route type and output interface. I only offload unicast flows.
Send an RTM_GETLINK message containing the output interface to determine the source MAC address.
Send an RTM_GETNEIGH message containing the output interface and the destination IP to determine the destination MAC address.

Finally, the user-space program stores the output interface, source, and destination MAC address inside the flow entry. The eBPF program then rewrites the MAC header and redirects the packet to the output interface. But I wasn’t satisfied with that approach yet; I will explain the reason based on the following picture:

Figure 2: Example network interfaces on an OpenWrt device

The picture shows the network interfaces of my AVM FritzBox 7530 running OpenWrt. As you can see, all four LAN ports of my private network and my WiFi are bridged (which is typical, I think, and generally default for an OpenWrt installation). My dsl0 WAN port has a Point-to-Point Protocol over Ethernet (PPPoE) interface on top to establish a VDSL connection to my ISP, which additionally requires tagged VLAN packets (dsl0.7).

When no offloading is happening and, for example, my Notebook connected to phy1-ap0 would send traffic to the internet, the packets would travel through all shown interfaces except the LAN ports. (Figure 3). Regarding the eBPF offloader, the simple way would be to attach the eBPF program to the br-lan and pppoe-wan interfaces because I wouldn’t have to parse any additional L2 headers. The same goes when making routing decision(s) since you won’t have to query more interface information or push L2 headers. But the eBPF fast path would be minimal in that case. (Figure 4)

I thought this was not an acceptable solution for this project because the idea is to intercept an incoming packet as soon as possible. At the same time, the offloader should also send out packets at the lowest possible network interface. Therefore, the user-space program currently attaches the eBPF program to the lowest possible network interface and, while making the routing decision, also tries to resolve to the lowest possible network interface (Figure 5).

Figure 3, 4, and 5: Packet traversal for different offloading variants

The following flowchart shows how the user-space program currently does the next-hop determination:

Figure 6: Next Hop determination via Netlink

The eBPF program can currently parse the following headers of the respective layers. If it receives any packet containing a L2, L3, or L4 header not mentioned here, it passes the packet to the network stack.

L2: VLAN (currently only one) and PPPoE
L3: IPv4 and IPv6
L4: TCP and UDP

DSA: Going one step further down

As you might have seen in the flowchart of Figure 6, the user-space program also parses DSA interfaces, which stands for Distributed Switch Architecture. Routers typically contain an Ethernet Switch for their LAN ports, which has a management port connected to an Ethernet controller capable of receiving Ethernet frames from the switch. While Linux creates a network interface for that Ethernet controller, you can observe that the DSA driver also creates network interfaces (DSA ports) for the front panel ports.

Ideally, when the switch and management interface exchange packets, they tag the packets with a switch resp. DSA tag, which contains the front panel port ID. When the management interface receives a packet from the switch, it can determine from the tag from which front panel port the packet comes and pass it to the appropriate DSA port/interface. When the switch receives a packet from the management interface, it can figure out from the tag to which front panel port it must send the packet.

Let’s consider the following picture, which shows how OpenWrt on default settings uses DSA on a Banana Pi BPI-R64. The DSA switch resp. conduit is eth0 and lan1, lan2, lan3, lan4, and wan are the DSA ports resp. users.

Figure 7: Example network interfaces on an OpenWrt device using a DSA driver

Without offloading, a network packet sent from the private LAN to WAN would go through eth0, lan*, br-lan, wan, and eth0 again (Figure 8). When using the eBPF offloader without attaching to the DSA switch eth0, it is possible to avoid the bridge br-lan (Figure 9). But if you now attach the eBPF program to the DSA switch eth0, it can read and write the DSA tags of packets on itself, and the user-space program can then figure out which front panel received the package and to which one to send a packet. So when the eBPF program receives a packet on eth0, it can send it out of eth0 again without any intermediate interface (Figure 10).

Figure 8, 9, and 10: Packet traversal through a DSA switch for different offloading variants

Although this has the disadvantage that an eBPF program isn’t “generic” anymore because you need to compile it for the DSA driver used by the target device, it has the potential to further increase the forwarding performance.

Work to do in the upcoming weeks

There are a few problems I have encountered, resp., thought of:

I am unsure if nf_conntrack is sufficient for connection tracking because it isn’t possible to query conntrack entries based on the interface that received the packet. I think this can lead to collisions when different interfaces receive identical L3 and L4 flows.
Unfortunately, it is currently impossible to update the nf_conntrack packet and byte counters. This might be patchable in the Linux kernel, but my current workaround is to turn off the counters because I think it is better to have no counters than wrong counters.
I have shown that I retrieve PPPoE information in user space. The problem is that you cannot do that directly via Netlink since the interface attributes don’t provide PPPoE information. This is why I currently retrieve the interface’s link-local peer IPv6 address, convert it to a MAC address, and try to find that MAC inside the file “/proc/net/pppoe”, which is populated by the ppp daemon. I am anything but satisfied with that, but I haven’t found a better way yet.

Next to trying to solve those problems, the next milestone is to implement an eBPF package dropper into the offloader because, for now, it only forwards packets on its own. And then to finally make a performance evaluation of the implementation.

If you have questions, as always, feel free to ask them, and thank you for reading my Midterm update!

GSoC 2024: eBPF performance optimizations for a new OpenWrt Firewall

Posted on 31. May 20243. June 2024 by Til Kaiser

Introduction

Hello everybody! My name is Til, and I am currently a Master’s engineering student at the University of Applied Sciences Nordhausen. For the past year, I focused on Linux network programming, especially eBPF, which brings me to my GSoC 2024 topic.

Over the years, the network stack has evolved to support new protocols and features. The side effect is that this introduces overhead and might hinder network throughput, especially on CPU-limited devices. One solution to overcome this is eBPF.

eBPF and its hooks

eBPF (extended Berkeley Packet Filter) is a technology within the Linux kernel that allows to dynamically re-program the kernel without recompiling or restarting it. Developers can write programs in C code, compile them to BPF objects, and attach them to several so-called hooks inside the Linux kernel. You can use two hooks to redirect incoming data packages to other NICs inside the Linux network stack: XDP and TC.

The XDP (eXpress Data Path) hook is located even before the network stack itself; its programs are attached to the driver of the NIC. That should make it the fastest eBPF hook throughput-wise for redirecting packages.
But there is a catch: NIC drivers must support this so-called “XDP native” mode. Otherwise, you need to attach them through the so-called “XDP generic” mode inside the network stack, which is significantly slower than native, which you will see soon in this post.

The TC (Traffic Control) eBPF hook is already located inside the Linux network stack but is driver-independent, so only the Linux Kernel needs to support it, i.e., compiled with the respective symbol enabled.

My GSoC 2024 project

My GSoC 2024 topic/goal is to introduce a new firewall software offloading variant to OpenWrt using eBPF. The idea is to intercept an incoming data packet from the NIC as early as possible inside or even before the network stack through XDP or TC. Then, for the actual firewall part, apply possible NAT (Network Address Translation) and drop or redirect the intercepted package to another network interface. That saves some CPU cycles and hopefully increases the network throughput on CPU-limited devices.

There are mainly three parts that need to be designed and implemented:

The eBPF program which intercepts, modifies, and redirects or drops network packages
A user-space program that attaches the eBPF program to the NIC(s) and checks the actual firewall rules
A BPF map used for the communication between the BPF and the user-space program

The caveat of XDP generic

As mentioned, you can use the XDP generic mode if your NIC doesn’t support XDP native. Theoretically, the XDP generic mode should be faster than TC because the XDP generic hook still comes before the TC hook. But there is a problem: For XDP generic, the Linux kernel has already allocated the SKB for the network package, and XDP programs must have a package headroom of 256 Bytes. That means if the pre-allocated SKB doesn’t have a sufficient headroom of 256 Bytes, it gets expanded, which involves copy operations that effectively negate the actual performance gain.

I created a patch that, rather than making this package headroom value constant, creates a Linux kernel variable for that headroom exposed through the sysfs interface to Linux user space. It still has the default value of 256 Bytes, but then the user can explicitly lower the XDP generic package headroom according to his requirements.

The following table presents the head of a report generated by the Linux tool perf after running an iperf3 TCP test for 30 seconds through an MIPS router. I tested using an XDP generic package headroom of 256 Bytes first and then 32 Bytes. It shows how much CPU cycles this copy operation wastes.

26.91% ksoftirqd/0  __raw_copy_to_user
3.73%  ksoftirqd/0  __netif_receive_skb_core.constprop.0
3.04%  ksoftirqd/0  bpf_prog_3b0d72111862cc6a_ipv4_forward_func
2.52%  ksoftirqd/0  __kmem_cache_alloc_node
2.32%  ksoftirqd/0  do_xdp_generic
2.06%  ksoftirqd/0  __qdisc_run
1.99%  ksoftirqd/0  bpf_prog_run_generic_xdp

5.70%  ksoftirqd/0  bpf_prog_3b0d72111862cc6a_ipv4_forward_func
5.23%  ksoftirqd/0  __netif_receive_skb_core.constprop.0
3.68%  ksoftirqd/0  do_xdp_generic
3.02%  ksoftirqd/0  __qdisc_run
3.00%  ksoftirqd/0  bpf_prog_run_generic_xdp

I will tidy up and try to submit that patch to the upstream Linux Kernel. Some people also tried similar approaches to fix the XDP generic performance by reducing the package headroom constant but never got accepted. So I hope my different approach has more success.

What to expect throughput-wise

To get some impression about the performance potential of eBPF, I have created a little BPF program that forwards all incoming IPv4 network packages. To test the program, I used an AVM FRITZ!Box 7360 v2 running OpenWrt with Linux Kernel version 6.6.30, whose CPU limits the throughput performance of its Gigabit ports. Then I grabbed a PC with two network ports, connected both ports with one port of the FritzBox respectively, and created two network namespaces at the PC to force the network traffic through the FritzBox. I used iperf3 to generate TCP traffic for 60 seconds for each tested setting respectively; you can find the results inside the following plot:

The settings/parts are the following:

off: OpenWrt’s Firewall stopped (/etc/init.d/firewall stop)
on: OpenWrt’s Firewall started but without any offloading enabled
sw_flow: Netfilter’s software offloading enabled (flow_offloading)
hw_flow: Netfilter’s hardware offloading enabled (flow_offloading_hw)
xdp256: The eBPF IPv4 forwarding program attached to the XDP generic hook with the default package headroom of 256 Bytes
xdp32: The eBPF IPv4 forwarding program attached to the XDP generic hook with a custom package headroom set to 32 Bytes allowed by my patch
tc: The eBPF IPv4 forwarding program attached to the TC hook

Unfortunately, I couldn’t test XDP native yet because I don’t have any hardware around whose driver supports XDP.

As you can see and as I already mentioned, there is no performance gain from the XDP generic mode with the default 256 Bytes packet headroom due to the SKB re-allocation. Contrary to the patched XDP generic and TC, the network throughput about doubled compared to OpenWrt’s Firewall without any offloading.

Compared to Netfilter’s offloading implementations, there is also a performance gain, but admittedly only a small one. When we look at the Linux kernel source code here, this becomes plausible because the TC hook is located right before the Netfilter ingress hook. The XDP generic hook comes earlier than those two, even before taps (e.g., where tcpdump could listen).

So what’s next?

These are the upcoming milestones for the project:

Creation of a user-space program for loading and communicating with the eBPF program
Extending the user-space program with a firewall NAT/rule parser
Extending the eBPF program to apply the parsed firewall configuration on incoming packages
Evaluating the forwarding and dropping performance of the implementation

You can find the current eBPF IPv4 package forwarder, the XDP generic headroom patch, and the measurement and plotting script in my GitHub repository here: https://github.com/tk154/GSoC2024_eBPF-Firewall.

I will also upload the source code of the firewall there, which I am excited to implement in the upcoming weeks. I think the first small forwarding performance test already shows the potential of eBPF as a new offloading variant, and I haven’t even tested XDP native yet.

If you have any questions, don’t hesitate to ask them, and thank you for reading my first post.

Freifunkblog

Author: Til Kaiser