🚘 Learning Social Navigation from Positive and Negative Demonstrations and Rule-Based Specifications

Department of Aritificial Intelligence, Korea University1
Department of Computer Science and Engineering, Korea University2
Department of Artificial Intelligence, Yonsei University3
Robotics Institute, Carnegie Mellon University4
Mobinn5
Department of Electrical and Computer Engineering, Queen's University6
Department of Statistics, Korea University7

This video presents real-world results of PioneeR (Positive and Negative demonstration density–driven Rewards with Rule-based specifications), a framework that combines data-driven learning and rule-based safety for socially aware robot navigation.

Abstract

Mobile robot navigation in dynamic human environments requires policies that balance adaptability to diverse behaviors with compliance to safety constraints. We hypothesize that integrating data-driven rewards with rule-based objectives enables navigation policies to achieve a more effective balance of adaptability and safety. To this end, we develop a framework that learns a density-based reward from positive and negative demonstrations and augments it with rule-based objectives for obstacle avoidance and goal reaching. A sampling-based lookahead controller produces supervisory actions that are both safe and adaptive, which are subsequently distilled into a compact student policy suitable for real-time operation with uncertainty estimates. Experiments in synthetic and elevator co-boarding simulations show consistent gains in success rate and time efficiency over baselines, and real-world demonstrations with human participants confirm the practicality of deployment.

Overview

System Overview

PioneeR is a framework that combines data-driven learning from demonstrations with rule-based safety to achieve reliable social navigation. Stage 1 – Reward Learning: Construct density-based rewards from positive and negative demonstrations, augmented with rule-based safety and goal terms. Stage 2 – Teacher Policy: Generate safe and adaptive supervisory actions through sampling-based lookahead control using the combined reward. Stage 3 – Student Policy: Distill the teacher’s guidance into a compact, uncertainty-aware policy for real-time robot navigation.

Effect of Reward Design

Pos.

(a) Learning-Based
(Pos.)

Pos. & Neg.

(b) Learning-Based
(Pos. & Neg.)

Pos. & Neg. & Rule

(c) Learning & Rule-Based
(Pos. & Neg.)

The synthetic example illustrates how each component contributes to the navigation. (a) Positive demonstrations only: the learned reward highlights both feasible corridors but lacks explicit safety awareness. (b) Positive + negative demonstrations: unsafe regions near humans are suppressed, guiding the robot toward safer trajectories. (c) With rule-based specifications: the final reward yields smooth, goal-directed paths that preserve clearance and achieve reliable navigation.

Dataset Generation

Positive Demonstrations

Negative Demonstrations

We trained the density-based reward using a dataset gathered through keyboard teleoperation that included both positive and negative demonstrations.

Teacher Policy

HR-RL (Human-Right, Robot-Left)

HL-RR (Human-Left, Robot-Right)

Selected Action High Reward Low Reward

The teacher policy was constructed using lookahead control on a reward that combines the density-based learning from positive and negative demonstrations with rule-based terms for obstacle avoidance and goal seeking.

Student Policy

HR-RL (Human-Right, Robot-Left)

HL-RR (Human-Left, Robot-Right)

The privileged information leveraged by the teacher policy was distilled into a student policy, enabling deployment in real-world environments.

Uncertainty Analysis

Danger!
Epistemic Uncertainty: 0.741
Danger!
Epistemic Uncertainty: 0.524
Safe!
Epistemic Uncertainty: 0.360

Uncertainty analysis reveals that higher epistemic uncertainty consistently corresponds to risky interactions, enabling the policy to distinguish safe and risky situations.

Real-World Demonstrations

Single Policy Across Diverse Scenarios

Scenario Involving Multiple Human Interactions