parham.
Back to in-progress
In progressv10 training; v7 currently holds best mAP 0.287

Active Object Localization with Deep RL

An RL agent that learns to find objects in natural images by iteratively refining a bounding box through geometric actions, built on frozen CLIP features. A reimplementation-with-modern-components of Caicedo and Lazebnik (ICCV 2015) on Pascal VOC 2007, not a strict reproduction.

Before you read the numbers

This is a reimplementation-with-modern-components of Caicedo and Lazebnik (ICCV 2015), not a strict reproduction. The paper used VGG features; I use frozen CLIP. So direct mAP comparison with the paper's 0.46 is not apples to apples. The realistic target band for CLIP-backbone work in this problem is 0.28 to 0.35. What this project is really about is closing the gap between representation quality and policy quality, and v10 is the iteration that tests the diagnosis.

0.287

Current best mAP@0.5

v10

Iteration cycle

0.28 to 0.35

Target band (CLIP)

0.46

Paper baseline (VGG)

Six differences from the paper

What changed, and why

The original paper's framework is preserved: episodic search of a bounding box via discrete geometric actions, ending on a TRIGGER, with the full-image initial box and the IoU greater than or equal to 0.5 success criterion. Six specific components differ.

ComponentPaper (2015)Ours
Visual backboneVGG-16 fc7 (4096-d)Frozen CLIP ViT-L/14 (768-d region embedding)
Action space9 actions10 actions (added a fine-grained scale-smaller for endgame refinement)
RL algorithmVanilla DQNDouble DQN + n-step Bellman + persistent DQfD margin loss
Trigger rewardBinary +5 / -1Continuous and monotone: shaped reward above IoU 0.5; smooth failure penalty scaling with IoU below threshold
Auxiliary tasksNoneAux IoU prediction head with pairwise ranking loss, gradient flowing into the Q-trunk
Conditioning signalNoneClass-conditional SCLIP saliency map (16x16) concatenated into the observation
Iteration history

Each version is a hypothesis test

Every iteration has its own Drive root for clean A / B. Each version's success criterion is explicit: a specific mechanism is fixed and a specific metric should move.

v6Added Double DQN

Why Cure Q-overestimation; v5's greedy policy almost never picked TRIGGER because Q(best_move) was unboundedly positive (classic DQN pathology).

Result mAP 0.257. Q-mean stayed in [-2, +2] for the full training run.

v7n-step Bellman + persistent DQfD margin loss

Why Propagate trigger reward 3x faster through the Bellman target and prevent BC-anchor decay.

Result mAP 0.287 (current best across all iterations).

v9.2Auxiliary IoU prediction head with gradient flow into Q-trunk

Why Improve the representation quality at the critical IoU band so the policy has a sharper signal to act on.

Result Pearson rho between trunk features and true IoU in [0.5, 0.85] climbed from 0.18 to 0.48, BUT mAP did not move. The aux head fixed the representation; the binding constraint was somewhere else (the trigger-reward structure).

v10Training in progress

Why Test whether closing the policy-vs-representation gap recovers mAP. The rep-quality win from v9.2 should now translate, given v7's reward structure.

Result Floor: equal v7 (0.287). Stretch: above 0.30. Trigger IoU distribution should also shift right (mean trigger_iou >= 0.6 vs v9.2's ~0.55).

What I'm presenting

The contribution is the diagnosis

Across v6 → v9.2 the iterations isolated which component was actually limiting performance. The aux head proved representation quality could be lifted dramatically (rho 0.18 → 0.48) but mAP did not move. That non-result is the most informative result in the stack: the binding constraint is the trigger reward, not the representation. v10 tests the fix; whether or not the stretch target lands, the contribution is the diagnostic process and the isolated mechanisms, not a benchmark beat.

Tech stack

Full implementation stack

PythonPyTorchStable-Baselines3 with custom DoubleDQN(DQN) subclassn-step Bellman target propagationPersistent DQfD margin lossOpenAI CLIP ViT-L/14 (frozen vision and text encoder, 768-d region embeddings)SCLIP for class-conditional saliency (16x16 map concatenated into the observation)Gymnasium environment APIBehaviour-cloning warmup from a greedy IoU oracleAuxiliary IoU-prediction head with pairwise ranking lossImportance-sampled replay buffer10 discrete actions (paper's 9 + a fine-grained scale-smaller for endgame refinement)Continuous monotone trigger reward (replaces paper's binary +5 / -1)Google Colab with chunked checkpoints persisted to Google DriveTensorBoard live diagnostics