parham.
Back to in-progress
In progressAlgorithms validated, web app live; awaiting permission to collect participant dataMSc thesis

Preference Shielding for Human-Robot Interaction (MSc thesis)

MSc thesis. A web-based HRI study comparing four shielding conditions for a Q-learning agent on a 7x7 grid: no shielding, standard preference shielding, Adaptive Shielding (confidence gate), and Hard/Soft per-object Shielding. Participants will watch the agent navigate, express directional preferences, and answer questionnaires.

Thesis hypothesis

Does adding a confidence gate (Adaptive Shielding) or a Hard/Soft per-object enforcement split to the existing Preference Shielding mechanism improve how transparent and trustworthy a learning robot looks to a human observer, without slowing down how quickly it learns the task?

4

Study conditions

240

Pre-study runs

30

Seeds per condition

7×7

Grid size

The four study conditions

What participants will compare

Each participant sees one of four shielding regimes for a Q-learning agent on a 7x7 grid. The two new contributions (Adaptive and Hard / Soft) are the heart of the thesis; Baseline and Standard PS are the controls.

1Baseline

Q-learning agent with no shielding. The agent learns from environmental reward only; participant preferences are recorded but never enforced.

2Standard Preference Shielding

The original mechanism from the literature. Every preference is enforced unconditionally near matching objects, regardless of how confident the agent is or how important the object is to the participant.

3Adaptive Shielding

New contribution. The shield defers to the agent once its Q-value confidence crosses a threshold. Early in learning the participant's preferences carry full weight; as the agent becomes confident, it earns autonomy.

4Hard / Soft Shielding

New contribution. Participants tag each object as Strict or Flexible. Strict objects get unconditional override; Flexible objects let the agent learn freely. Users keep guarantees on what matters most without strangling task performance everywhere.

Pre-study validation

Algorithm benchmark before the humans arrive

Before opening the experiment to participants, I ran a 2-cubed factorial algorithmic pre-study (8 conditions across 30 seeds = 240 runs) to confirm the extensions behave as designed. In particular: the all-Strict configuration of Hard / Soft Shielding reproduces the classic safety-performance tradeoff (perfect preference alignment, lost task success), which is exactly the failure mode the Strict / Flexible split is designed to escape from.

Pre-study figure 1 of 4
Learning curves across the 8 factorial pre-study conditions
Reward over training across the 8 factorial pre-study conditions.How to read it: each curve is one of the 8 algorithmic conditions (Baseline + 7 extension combinations) averaged across 30 seeds. By episode 200 the conditions split into two regimes: those that find a viable policy (top cluster, near zero cumulative reward) and those that get stuck in the failure mode (bottom cluster, near -1000). This is a pre-study algorithmic benchmark, NOT the participant data.
Pre-study figure 2 of 4
Interaction effects bar charts for the 2x2x2 factorial design
Interaction effects across the 2-cubed factorial design.How to read it: red bars mean two extensions cancel each other out (antagonistic), green bars mean they amplify (synergistic), grey bars mean they add independently. M_FIN shows a clean antagonistic interaction in COND_AC and COND_ABC, the signal that extensions A and C should not be combined unconditionally.
Pre-study figure 3 of 4
Radar chart comparing 8 conditions across 5 metrics
Extension comparison across the final 100 episodes (5 metrics, 8 conditions).How to read it: each axis is one outcome metric (preference alignment, task success, short episodes, final reward, low override rate). Higher is better. Most extensions cluster tightly on the bottom 4 metrics but split sharply on Preference Alignment (vertical), the cleanest empirical statement of the safety-performance tradeoff that the Hard/Soft Shielding contribution is designed to solve.
Pre-study figure 4 of 4
Heatmap of policy density across the 8 conditions
Per-condition policy density on the 7x7 grid.How to read it: each panel is the visited-state density of the converged agent under one condition. Conditions that mass density on a tight diagonal are converging to a single goal-directed path; conditions with density spread across rooms are stuck or oscillating.
The web app

Where participants will meet the agent

The study runs in a browser. A FastAPI backend streams the agent's training over a WebSocket while a React 18 frontend paints the 7x7 grid live, frame by frame. Participants express directional preferences with an arrow grid, watch the agent adapt (or not), and answer questionnaires after each session.

Backend

  • FastAPI with async WebSocket training loop
  • aiosqlite for participant data + questionnaires
  • Per-condition Q-table runs, async streaming
  • Admin REST API for analytics, settings, and live monitoring
  • Docker-deployable on Fly.io

Frontend

  • React 18 + Vite + Tailwind CSS v4
  • Live animated 7x7 GridCanvas, looping AnimatedPath
  • PreferencePanel (arrow grid) and EnforcementPanel (Hard / Soft per object)
  • Mid-training popup questionnaires + post-session screens
  • Consent gate + onboarding + debrief flow
Status

Gated on data-collection permission

Algorithms and the web app are complete; the algorithmic pre-study above confirms the contributions behave as designed. The participant experiment opens once permission to collect human-subjects data is granted by the relevant institutional review.

Once participants come through, the headline analysis will compare the four conditions on three outcome families: perceived robot transparency (questionnaire scales), perceived trust (Likert + qualitative), and learning speed (objective task-success time). I'll wire the live results into a dashboard here once the data is in.

Tech stack

Full implementation stack

PythonTabular Q-learning (NumPy) on a 7x7 grid environmentFastAPI backend with async WebSocket training loopaiosqlite for participant data and questionnairesReact 18 + Vite + Tailwind CSS v4 frontendLive animated grid (GridCanvas, AnimatedPath)Adaptive Shielding (confidence-gated override)Hard/Soft Shielding (per-object Strict vs Flexible enforcement)SciPy stats (Mann-Whitney U, Welch's t-test, Bonferroni correction)Plotly (interactive) + Matplotlib (paper figures)reportlab (PDF report generation)Docker on Fly.io for participant access