Hello everyone! As we reach the midpoint of Google Summer of Code 2025, I’m excited to share the progress on my project, Deep Q Network-based Rate Adaptation for IEEE 802.11ac Networks. The goal of this project is to build an intelligent rate control mechanism that can dynamically adapt transmission rates in Wi-Fi networks using Deep Reinforcement Learning. So far, I’ve implemented a basic DQN-based agent with a simplified environment, introduced group-based rate handling inspired by Minstrel_HT, incorporated success probability tracking, and collected real-world data using a shielding box. This post will walk through these key developments, the challenges encountered, and the next steps toward building a more robust and generalizable solution.
Background
Rate adaptation in Wi-Fi is a critical task that selects the most suitable Modulation and Coding Scheme (MCS) index for transmission. The MCS index encapsulates key physical-layer parameters such as modulations type, coding rate, bandwidth(BW), guard interval (GI), and a number of spatial streams (NSS). Traditional rate adaptation algorithms often rely on heuristics or statistical measurements to infer the best MCS under varying channel conditions.
In this project, we aim to explore a Deep Q-Network (DQN) based approach for intelligent rate adaptation. Rather than relying on predefined rules, the agent learns to select the optimal MCS indices by interacting with the environment and receiving feedback on transmission success and failure. Unlike many prior works focused solely on simulations, our primary aim is to integrate this learning framework with real-world experimentation using ORCA (Open-source Resource Control API). This enables us to evaluate learning-based adaptation in actual 802.11ac Wi-Fi environments. For more detailed background and motivation, please refer to my initial blog post. Please check out my initial blog post for more information.
1. What We’ve Accomplished So Far
1. Initial Setup -DQN with Dummy States
In the early phase of the project, a basic DQN agent was set up with a placeholder environment to validate the training pipeline. The agent was designed to choose among discrete actions (initially up/down rate shifts), but the state representation was simplified to dummy values.
Each state was initially defined as the list of Received Signal Strength Indicator (RSSI), Signal Noise Ratio(SNR), transmission successes, retransmission attempts, and current MCS (among 8 fixed MCS values). This served as a proof-of-concept to validate the learning loop of DQN interacting with a Gym-style environment.
This served as a proof-of-concept to validate the learning loop of a Deep Q-Network (DQN) interacting with a Gym-style environment. This environment follows the OpenAI Gym(reference) interface, where an agent observes a state, takes an action, and receives a reward and next state in return. This setup helps modularize the environment and makes it easier to test and evolve reinforcement learning agents systematically.
2. Design of the DQN-Based Rate Adaptation Algorithm
In this section, we outline the overall system architecture of our Deep Q-Learning-based Rate adaptation framework. The goal is to give a clear block-level understanding of how the components interact, followed by an explanation of key choices around state, action, reward, and the agent.
Below is the high-level block diagram of the DQN-based rate adaptation system.
Fig1. Current Implementation
PyTorch was chosen as the deep learning framework as it offers easier debugging and flexibility compared to static graph libraries like TensorFlow 1.x. The API is clean and beginner-friendly. As PyTorch is widely used in reinforcement learning research, the community support is strong, which makes it easier to find references and extend the work if needed.
Components of the DQN
1. Q-Network
A simple feed-forward neural network that takes the current state as input and outputs Q-values for each possible action (rate group). This helps to estimate how good each action is from the current state.
2. Target Network
To stabilize training, we maintain a separate target Q-network that is periodically synced with the main Q-network.
3. Custom Gym Environment
A tailored environment was built following the OpenAI Gym interface to simulate the Wi-Fi rate adaptation scenario.
- The state includes variables like RSSI, current rate index, number of attempts, and success history.
- The action space corresponds to selecting among a defined set of MCS rates.
- The step function computes the resulting throughput, success, and termination condition based on real trace data.
4. Agent
The component is responsible for selecting actions using an epsilon-greedy strategy, balancing exploration and exploitation during learning.
5. Experience Replay Buffer
Transitions (state, action, reward, next_state) are stored in a fixed-size buffer. During training, a batch of past experiences is randomly sampled to break temporal correlations of sequential experiences and improve sample efficiency.
6. Epsilon-Greedy Exploration Strategy
To balance exploration and exploitation:
Initially, the exploration rate is set to 100%. This means the agent will select the random actions(rates) to determine which give better rewards.
The exploration rate is decayed linearly to promote exploitation (select the best known action) in the later stages.
7. Training Loop and Episode Tracking
The agent was trained over a series of interactions with the environment using a well-structured training loop, which mimics how reinforcement learning systems learn from trial and error.
In our setup:
An episode is a complete cycle of interaction between the agent and the environment. It starts with an environment reset and continues until the agent reaches a terminal state (e.g retransmissiion threshold, low success rate), or the maximum allowed steps for the episodes are completed.
A run refers to one complete training session, which in our case consists of 500 episodes. We can repeat runs with different random seeds or hyperparameters for comparison.
Each episode proceeds step-by-step, where the agent:
- Selects an action (rate) using an epsilon-greedy policy.
- Takes the action in the environment and receives the next state, reward, and a termination flag.
- Stores the transition (state, action, reward, next_state, done) in the experience replay buffer.
- Samples a batch from the buffer to update the Q-network using the Bellman equation.
- Synchronizes the target network with the policy network at fixed intervals.
This structure enables stable learning over time, leveraging both exploration and exploitation.

Fig2. Sample Training Behavior
2. Core DQN Logic: State Design, Reward & Group-Based Adaptation
As we transitioned toward real-world deployment of our DQN-based rate adaptation agent using ORCA, we faced a key practical constraint: Signal-to-Noise Ratio (SNR) is not directly available from the ORCA interface. This led to a redesign of the agent’s input state and reward function, grounding our system in metrics that are actually observable in real Wi-Fi environments.
a. Removal of SNR
In early versions of the environment, SNR was included as part of the agent’s observation space. However, since ORCA (the testbed we use for data collection and deployment) does not expose SNR, it cannot be used in real training or evaluation. As a result:
- We removed SNR from the environment’s observation space.
- Instead, we rely on RSSI (Received Signal Strength Indicator), which can be collected reliably from transmission feedback logs in ORCA.
In the simulation, RSSI is initialized at –60 dBm and updated with small noise in each step. In real training, RSSI will be sourced directly from logs collected during hardware tests.
b.State Definition
Each state observed by the DQN agent is a 4-dimensional vector:
[current_rate, rssi, tx_success_rate, tx_retransmissions]
- current_rate: The transmission rate index selected by the agent in the previous step.
- rssi: Signal strength (used instead of SNR), simulated in the dummy setup and real in actual experiments.
- tx_success_rate: Maintains an exponential moving average of recent transmission outcomes.
- tx_retransmissions: Counts retransmissions, capped at 10.
This formulation provides a minimal yet sufficient summary of recent channel behavior and agent action history.
c. Reward Calculation
The reward is designed to reflect actual performance, balancing success and throughput:
success_prob is computed as:
Success_prob = successes/attempts
If the success probability for a rate is too low (<0.4), the agent receives a penalty.
Otherwise, the reward scales with the theoretical data rate of the selected rate and its current success probability.
d. Attempt and Success Counts
The environment internally tracks the number of transmission attempts and successes per rate using two dictionaries:
On each action step, the environment:
- Increments the attempt count for the selected rate.
- Samples a binary transmission outcome using the current estimated success probability.
- Updates the success count if the transmission succeeded.
e. Why Warm-Start the Attempt and Success Counts
At the start of training, all rates have 0 attempts and 0 successes. This makes success probability undefined or zero, which results in poor early rewards and unstable learning.
To mitigate this, we use a warm-start strategy:
- A few rates are preloaded with 5 attempts and 4 successes.
This gives the agent initial traction and prevents early negative feedback loops.
This trick significantly stabilizes training in the first few episodes and is especially useful when rewards are sparse or noisy.
f. Group-Based Rate Adaptation (Inspired by Minstrel_HT)
In our implementation, we adopted a group-based rate adaptation strategy inspired by the structure used in Minstrel_HT and aligned with the rate grouping logic defined within ORCA. Each group corresponds to a specific combination of key physical-layer parameters, including spatial streams(NSS), bandwidth, guard interval, MCS modulation and coding.
Rate grouping is employed to generalize success estimates across similar rates.When a rate has never been used (0 attempts), its success probability falls back to that of higher-used rates in the same group.For example, our project uses Group 20, which corresponds to 3 spatial streams, 40 MHz bandwidth, and a short guard interval (SGI). The associated rate indices and their characteristics are:
Rate Index | MCS | Modulation | Coding | Data Rate (Mbps) | Airtime (ns) |
200 | BPSK, 1/2 | BPSK | 1/2 | 45.0 | 213,572 |
201 | QPSK, 1/2 | QPSK | 1/2 | 90.0 | 106,924 |
202 | QPSK, 3/4 | QPSK | 3/4 | 135.0 | 71,372 |
203 | 16-QAM, 1/2 | 16-QAM | 1/2 | 180.0 | 53,600 |
204 | 16-QAM, 3/4 | 16-QAM | 3/4 | 270.0 | 35,824 |
205 | 64-QAM, 2/3 | 64-QAM | 2/3 | 360.0 | 26,824 |
206 | 64-QAM, 3/4 | 64-QAM | 3/4 | 405.0 | 23,900 |
207 | 64-QAM, 5/6 | 64-QAM | 5/6 | 450.0 | 21,424 |
208 | 256-QAM, 3/4 | 256-QAM | 3/4 | 540.0 | 18,048 |
209 | 256-QAM, 5/6 | 256-QAM | 5/6 | 600.0 | 16,248 |
Fig3. Details of Group 20
g. From Incremental Actions to Direct Rate Selections
Initially, the action space consisted of just 2 actions: increase or decrease the rate(MCS-style adaptation). However, after several trials it was clear that this incremental strategy was too slow, and most supported rates were never reached during the training.
For a simple start, we chose 10 rates out of all the supported rates.
So, we updated the action space to be direct rate selection:
- Instead of navigating one rate at a time, the agent now chooses directly from a subset of these 10 representative rates.
- This allows faster exploration and better convergence.
This change made a major difference in the agent’s ability to adapt and explore diverse transmission configurations.
3. Real Data Collection with Shielding Box
To enable rate adaptation in real-world Wi-Fi networks, we began by collecting real channel data using controlled hardware setups. This data is essential for grounding our DQN training in reality and ensuring the learned policy performs well in practical deployment.
Rather than relying solely on simulated environments, our aim is to train and evaluate the DQN algorithm on actual wireless conditions. This is crucial for developing a rate adaptation mechanism that is robust and effective in real network scenarios.
Test Setup
We used a shielding box setup to ensure accurate and isolated measurements. The shielding box allowed us to eliminate external interference, offering a reproducible environment to study channel behavior. ORCA provided fine-grained control over the STA, such as adjusting MCS rates, attenuation and collecting relevant metrics
Measurement Scope:
We systematically tested the full set of supported transmission rates to understand their performance under varying signal conditions:
- Covered all supported rates (hex:120 to 299)
- Grouped rates into prefix-based clusters (e.g., 12, 40)
- Each group was tested across attenuation levels from 40 dB to 70 dB in 5 dB steps for 20 seconds in each step.
- For each group, we collected:
- RSSI over time
- Selected transmission rates
- Corresponding throughput
- RSSI over time
These metrics were then visualized in the form of plots to understand behavior patterns and build insights.
The real-world dataset will be instrumental in training the DQN agent directly on real channel traces. We can validate the reward function logic and evaluate the effectiveness of the learned rate adaptation policy with this data. Finally, it will be helpful in deploying and testing the trained policy in live networks.
For Group 20 => Rates 200-209(Hex) (Fig3.)

Fig4. RSSI vs Time

Fig5. Rates and Corresponding Throughput vs Time
So far, the environment simulated RSSI dynamics using random noise, which lacks the variability and structure of real wireless channels. Moving forward, this collected dataset opens the door to a more realistic simulation and training loop, such as:
Conditional sampling: RSSI values and throughputs can be sampled based on selected rate and attenuation level, simulating how a rate would realistically perform under current channel conditions.
Replay of real RSSI traces: Instead of generating random RSSI values, the environment can replay actual RSSI sequences from the logs.
Rate-Throughput mapping: Instead of assigning theoretical data rates(Mbps), we can use empirical throughput measurements for each (rate, RSSI) pair, making the reward signal much more grounded.
Conditional sampling: RSSI values and throughputs can be sampled based on selected rate and attenuation level, simulating how a rate would realistically perform under current channel conditions.
This integration would make the environment closer to real-world performance, enabling the agent to learn from more structured feedback.This is a critical step for bridging the gap between simulation and real deployment.
4. Next Steps
Environment Refinement Using Real Data
- Use the collected shielding box data to model real rate-to-reward mappings.
- Replace random assignment of noise and RSSI values with the real trace data.
Train with Full Action Space
- Expand action space from binary (up/down) to direct rate selection.
- Evaluate trade-offs in convergence speed and learning stability.
–
Integration with ORCA and Real Hardware
- Plug the DQN model into the ORCA framework for live experimentation.
- Use real-time stats from RateMan to drive the agent’s decisions.
Logging & Visualization
Visualize Q-values, state evolution, and rate changes over time.
Add detailed logging of action-reward-history.
5. Challenges and Open Questions
While progress has been substantial, several challenges and open questions remain, which need thoughtful solutions going forward:
1. Kickstarting Learning Without Warm-Up Rates
To prevent the agent from receiving only negative rewards in early episodes, we manually pre-filled success/attempt stats for certain rates. However, in real deployments, such a warm-up is not always possible.
Challenge: How to design exploration or reward shaping mechanisms that allow agents to learn from scratch without manual initialization?
2. Large Action Space: Curse of Dimensionality
The supported rates in our setup are ~232. A large action space may:
- Make learning slower and unstable
- Lead to sparse exploration and poor generalization
- Cause overfitting to rarely used or unreachable rates
3. High Attenuation Handling
Under higher attenuation (e.g., 65–70 dB), only lower MCS rates may be feasible.
Challenge: Should the agent learn rate-attenuation mappings and restrict its actions contextually, or penalize high-rate selections under weak channel conditions?
4. Real-Time Constraints
The agent needs to make decisions at frame-level or sub-second intervals. Any delay in inference or policy selection may render the decision obsolete due to fast-varying wireless environments.
Challenge: Can we compress or distill the trained agent for low-latency environments?
Conclusion
The first half of GSoC has been a learning-intensive and productive phase. From environment design to agent training, we now have a complete pipeline to simulate, train, and evaluate DQN-based rate adaptation. The next steps will focus on making the model robust, interpretable, and hardware-ready.
Please feel free to reach out if you’re interested in reinforcement learning in networks or Wi-Fi resource control. Thank you for reading!
References
1. Pawar, S. P., Thapa, P. D., Jelonek, J., Le, M., Kappen, A., Hainke, N., Huehn, T., Schulz-Zander, J., Wissing, H., & Fietkau, F.
Open-source Resource Control API for real IEEE 802.11 Networks.
2. Queirós, R., Almeida, E. N., Fontes, H., Ruela, J., & Campos, R. (2021).
Wi-Fi Rate Adaptation using a Simple Deep Reinforcement Learning Approach. IEEE Access.