Active Object Localization with Deep RL
An RL agent that learns to find objects in natural images by iteratively refining a bounding box through geometric actions, built on frozen CLIP features. A reimplementation-with-modern-components of Caicedo and Lazebnik (ICCV 2015) on Pascal VOC 2007, not a strict reproduction.
Before you read the numbers
This is a reimplementation-with-modern-components of Caicedo and Lazebnik (ICCV 2015), not a strict reproduction. The paper used VGG features; I use frozen CLIP. So direct mAP comparison with the paper's 0.46 is not apples to apples. The realistic target band for CLIP-backbone work in this problem is 0.28 to 0.35. What this project is really about is closing the gap between representation quality and policy quality, and v10 is the iteration that tests the diagnosis.
0.287
Current best mAP@0.5
v10
Iteration cycle
0.28 to 0.35
Target band (CLIP)
0.46
Paper baseline (VGG)
What changed, and why
The original paper's framework is preserved: episodic search of a bounding box via discrete geometric actions, ending on a TRIGGER, with the full-image initial box and the IoU greater than or equal to 0.5 success criterion. Six specific components differ.
| Component | Paper (2015) | Ours |
|---|---|---|
| Visual backbone | VGG-16 fc7 (4096-d) | Frozen CLIP ViT-L/14 (768-d region embedding) |
| Action space | 9 actions | 10 actions (added a fine-grained scale-smaller for endgame refinement) |
| RL algorithm | Vanilla DQN | Double DQN + n-step Bellman + persistent DQfD margin loss |
| Trigger reward | Binary +5 / -1 | Continuous and monotone: shaped reward above IoU 0.5; smooth failure penalty scaling with IoU below threshold |
| Auxiliary tasks | None | Aux IoU prediction head with pairwise ranking loss, gradient flowing into the Q-trunk |
| Conditioning signal | None | Class-conditional SCLIP saliency map (16x16) concatenated into the observation |
Each version is a hypothesis test
Every iteration has its own Drive root for clean A / B. Each version's success criterion is explicit: a specific mechanism is fixed and a specific metric should move.
Why Cure Q-overestimation; v5's greedy policy almost never picked TRIGGER because Q(best_move) was unboundedly positive (classic DQN pathology).
Result mAP 0.257. Q-mean stayed in [-2, +2] for the full training run.
Why Propagate trigger reward 3x faster through the Bellman target and prevent BC-anchor decay.
Result mAP 0.287 (current best across all iterations).
Why Improve the representation quality at the critical IoU band so the policy has a sharper signal to act on.
Result Pearson rho between trunk features and true IoU in [0.5, 0.85] climbed from 0.18 to 0.48, BUT mAP did not move. The aux head fixed the representation; the binding constraint was somewhere else (the trigger-reward structure).
Why Test whether closing the policy-vs-representation gap recovers mAP. The rep-quality win from v9.2 should now translate, given v7's reward structure.
Result Floor: equal v7 (0.287). Stretch: above 0.30. Trigger IoU distribution should also shift right (mean trigger_iou >= 0.6 vs v9.2's ~0.55).
The contribution is the diagnosis
Across v6 → v9.2 the iterations isolated which component was actually limiting performance. The aux head proved representation quality could be lifted dramatically (rho 0.18 → 0.48) but mAP did not move. That non-result is the most informative result in the stack: the binding constraint is the trigger reward, not the representation. v10 tests the fix; whether or not the stretch target lands, the contribution is the diagnostic process and the isolated mechanisms, not a benchmark beat.