Six-Robot Warehouse RL
A simulated warehouse where six autonomous forklifts learn to pick pallets and avoid each other. Three RL methods benchmarked against three classical path-planning baselines under rigorous statistical analysis.
Why this page is light on numbers
The current pilot is training the full algorithm sweep at N = 6 robots. Headline metrics land in about ten hours. This page will update with per-algorithm delivery success rates (with rliable IQM + 95 percent bootstrap CIs and Mann-Whitney + Bayesian BEST posterior probability against each classical baseline) the moment the runs complete.
6
Agents in simulation
3
RL methods
3
Classical baselines
190+
Unit tests
What the agents will learn
Three families of deep-RL agents share the same observation space, action space, and reward function. The point of the comparison is to isolate which exploration strategy and which optimisation family handle the sparse-reward, multi-agent warehouse setting best.
DQN
Vanilla deep Q-learning with epsilon-greedy exploration. The baseline against which the Bayesian-exploration variant is measured.
Bootstrapped DQN
An ensemble of Q-networks; the ensemble disagreement is the agent's uncertainty. Replaces epsilon-greedy with posterior sampling, the same idea Osband et al. 2016 showed yields deep exploration in sparse-reward MDPs.
PPO
Stable-Baselines3 implementation of Proximal Policy Optimization. A different policy-gradient family that tends to be sample-hungry but stable, useful as a sanity check that the gains aren't architecture-specific.
What the RL has to beat
A*
Informed shortest-path planner. Requires the full map and re-plans from scratch on every change.
Cooperative A*
A* with space-time reservation. Multiple agents take turns reserving cells, avoiding the typical pile-up at chokepoints.
Conflict-Based Search (CBS)
A two-level search: a constraint tree at the top, single-agent A* at each leaf. Optimal under standard MAPF assumptions but expensive at scale.
Hungarian task assignment maps pallets to robots before each episode so both RL and classical sides face the same assignment problem.
Per-algorithm delivery success at N = 6, with confidence
Once training lands, the headline is per-algorithm delivery success rate over 50 evaluation episodes (best-checkpoint and final-state pair). Beyond that:
- rliable IQM with 95 percent bootstrap confidence intervals to guard against single-seed luck.
- Mann-Whitney U for non-parametric pairwise comparisons.
- Bayesian BEST test for posterior probability that one method actually beats another (not just "p < 0.05").
- Benjamini-Hochberg FDR correction across the full comparison matrix.
190+ unit tests across the environment, agents, baselines, and stats pipeline. The point of the discipline is so the final comparison can't be challenged on methodology, only on interpretation.