parham.
Back to in-progress
In progressPilot training executing

Six-Robot Warehouse RL

A simulated warehouse where six autonomous forklifts learn to pick pallets and avoid each other. Three RL methods benchmarked against three classical path-planning baselines under rigorous statistical analysis.

Why this page is light on numbers

The current pilot is training the full algorithm sweep at N = 6 robots. Headline metrics land in about ten hours. This page will update with per-algorithm delivery success rates (with rliable IQM + 95 percent bootstrap CIs and Mann-Whitney + Bayesian BEST posterior probability against each classical baseline) the moment the runs complete.

6

Agents in simulation

3

RL methods

3

Classical baselines

190+

Unit tests

The three RL methods

What the agents will learn

Three families of deep-RL agents share the same observation space, action space, and reward function. The point of the comparison is to isolate which exploration strategy and which optimisation family handle the sparse-reward, multi-agent warehouse setting best.

DQN

Vanilla deep Q-learning with epsilon-greedy exploration. The baseline against which the Bayesian-exploration variant is measured.

Bootstrapped DQN

An ensemble of Q-networks; the ensemble disagreement is the agent's uncertainty. Replaces epsilon-greedy with posterior sampling, the same idea Osband et al. 2016 showed yields deep exploration in sparse-reward MDPs.

PPO

Stable-Baselines3 implementation of Proximal Policy Optimization. A different policy-gradient family that tends to be sample-hungry but stable, useful as a sanity check that the gains aren't architecture-specific.

The three classical baselines

What the RL has to beat

A*

Informed shortest-path planner. Requires the full map and re-plans from scratch on every change.

Cooperative A*

A* with space-time reservation. Multiple agents take turns reserving cells, avoiding the typical pile-up at chokepoints.

Conflict-Based Search (CBS)

A two-level search: a constraint tree at the top, single-agent A* at each leaf. Optimal under standard MAPF assumptions but expensive at scale.

Hungarian task assignment maps pallets to robots before each episode so both RL and classical sides face the same assignment problem.

Statistical analysis

Per-algorithm delivery success at N = 6, with confidence

Once training lands, the headline is per-algorithm delivery success rate over 50 evaluation episodes (best-checkpoint and final-state pair). Beyond that:

  • rliable IQM with 95 percent bootstrap confidence intervals to guard against single-seed luck.
  • Mann-Whitney U for non-parametric pairwise comparisons.
  • Bayesian BEST test for posterior probability that one method actually beats another (not just "p < 0.05").
  • Benjamini-Hochberg FDR correction across the full comparison matrix.

190+ unit tests across the environment, agents, baselines, and stats pipeline. The point of the discipline is so the final comparison can't be challenged on methodology, only on interpretation.

Tech stack

Full implementation stack

Python 3.11PyTorch (custom MARL DQN and Bootstrapped-DQN with parameter sharing + agent-ID embeddings)Stable-Baselines3 (PPO)Optuna for hyperparameter tuningGymnasium-style custom env with vectorised LIDAR (16 rays per agent per step)Differential-drive physics with sub-step collision resolutionCooperative A* with space-time reservationConflict-Based Search (CBS)Hungarian task assignmentrliable (IQM + bootstrap CIs)Bayesian BEST test, Mann-Whitney U, Wilcoxon, Benjamini-Hochberg FDRTensorBoard live diagnostics + per-run JSONL summariespytest (190+ tests)