GSoC 2024: Visualise community data, update 1

I’ve set up a lot of services since planning out the data pipeline in June; the new infrastructure diagram looks like this:

So, I wrote a little Rust utility called json_adder which takes the JSON files and reads the data into a MongoDB collection. We also have a GraphQL server, running through Node.js, which can handle some (very) simple queries and return data. Many of these services are individually easy to set up, the tricky part is making sure everything works together correctly. This is what your favourite technical consultancies call “fullstack development”.

Data loading

The first change from the initial plan was to use MongoDB instead of MySQL as a database. MongoDB has a built-in feature for handling time-series data, which is what we’re working with. It integrates well enough with GraphQL, and since one of my mentors (Andi Bräu) works with MongoDB on a daily basis, there’s a lot of experience to draw upon.

Here is the Rust code for the utility which adds the data to the database, and here’s what it does:

  • Read over each JSON file in the directory in a loop.
  • Extract the last-modified time from each community in the file and do some conversion to BSON.
  • Read the data into a struct, and add each object to a vector.
  • Insert the vector into the MongoDB collection.
  • GOTO next file.

When setting up the connection to MongoDB, the setup_db module will also check to create the collection if it doesn’t exist. This extra bit of logic is ready for when this has to be run in an automated process, which might involve setting up the database and filling it with data in one go.

I don’t have a binary to link here, and will write build instructions in future.1 On each commit which changes Rust code, Github Actions runs cargo check on the repository. When it comes to deploying this in reality, we can change that workflow to provide release builds.

At the moment it is a one-way process. You point json_adder at the data directory and it adds everything to the database. For future work, I’ll need a way to keep the database in sync, only add the files which have been modified, run it on a schedule. For now, it works fine.

GraphQL

Here is the GraphQL server code. Fetch the dependencies, in this order:

npm install express express-graphql graphql mongodb

The ordering of these is important.

Then, run npm start, and open http://localhost:4000/api in your browser, you should see the GraphQL IDE.

At the moment, the GraphQL can handle a query with an argument for metadata. I’ll build out the schema, and the resolver, to handle more arguments / more complicated queries. A lot of the documentation for GraphQL assumes a fairly simple schema, built for handling calls to a web app, which is slightly different from our case. The JSON Schema for the Freifunk API is not only quite long, it is also versioned, and newer versions are not necessarily backwards-compatible. I am going to sidestep this complexity by writing a GraphQL schema which only includes fields available in all versions. My first working example to try is a query which counts the number of nodes per community.

You’ll notice that the last step in in the pipeline is yet to be completed, I don’t have a visualisation to show. But, now that we can reliably pull data all the way through the pipeline, we’re almost ready to make some graphs.

  1. In the meantime, you only need to run cargo build in the json_adder directory. ↩︎

GSoC 2024: Visualise community data

Hello everyone, I’m Pierre, I was first introduced to Freifunk when searching for free WiFi in a cafe in Berlin, and I’m inspired by the radical visions of people attaching antennas to apartment buildings. Now I’m here to work on the community data API.

Since 2014, Freifunk has been collecting a snapshot of information about all communities, every hour, in a big directory available here. At the moment the data is being used for a calendar, a news feed,1 and a map of communities, among other services. Maybe you can think of lots of other things which could be done with that data, but first you would have to parse some 80,000+ compressed JSON files.

The goal of this Summer of Code project is to overcome that hurdle up front; make this community data more easily accessible, and uncover some insights about the network along the way. This is a typical situation for large archives, where one of the key challenges is making data readable and useful.

I intend to build a pipeline from the JSON directory through to some visualisations on a web page. One step behind that, I want to expose a GraphQL interface so that anyone else can query the data for their own purposes. The main part of this project revolves around setting up infrastructure to handle each section of this pipeline.

Although this system is conceptually very simple, there are lots of moving parts involved. The JSON files are loaded into a SQL database, the database talks to GraphQL, which itself talks to some browser-based visualisation library.

I’ve tried to make choices which prioritise boring technologies, reduce the fragility of this process, and hopefully make it easy to maintain in future. I will build it from the top down; starting with the database, and finishing with the visualisations themselves.

I’ll be working in this repository. There’s not much to show at the moment, but it should rapidly take shape in the coming weeks. My next post will be about the process of loading these files into a database.

  1. On which this blog appears, hello! 👋 ↩︎