Mobile robot navigation in dynamic human environments requires policies that balance adaptability to diverse behaviors with compliance to safety constraints. We hypothesize that integrating data-driven rewards with rule-based objectives enables navigation policies to achieve a more effective balance of adaptability and safety. To this end, we develop a framework that learns a density-based reward from positive and negative demonstrations and augments it with rule-based objectives for obstacle avoidance and goal reaching. A sampling-based lookahead controller produces supervisory actions that are both safe and adaptive, which are subsequently distilled into a compact student policy suitable for real-time operation with uncertainty estimates. Experiments in synthetic and elevator co-boarding simulations show consistent gains in success rate and time efficiency over baselines, and real-world demonstrations with human participants confirm the practicality of deployment.
PioneeR is a framework that combines data-driven learning from demonstrations with rule-based safety to achieve reliable social navigation. Stage 1 – Reward Learning: Construct density-based rewards from positive and negative demonstrations, augmented with rule-based safety and goal terms. Stage 2 – Teacher Policy: Generate safe and adaptive supervisory actions through sampling-based lookahead control using the combined reward. Stage 3 – Student Policy: Distill the teacher’s guidance into a compact, uncertainty-aware policy for real-time robot navigation.
(a) Learning-Based
(Pos.)
(b) Learning-Based
(Pos. & Neg.)
(c) Learning & Rule-Based
(Pos. & Neg.)
The synthetic example illustrates how each component contributes to the navigation. (a) Positive demonstrations only: the learned reward highlights both feasible corridors but lacks explicit safety awareness. (b) Positive + negative demonstrations: unsafe regions near humans are suppressed, guiding the robot toward safer trajectories. (c) With rule-based specifications: the final reward yields smooth, goal-directed paths that preserve clearance and achieve reliable navigation.
Positive Demonstrations
Negative Demonstrations
We trained the density-based reward using a dataset gathered through keyboard teleoperation that included both positive and negative demonstrations.
The teacher policy was constructed using lookahead control on a reward that combines the density-based learning from positive and negative demonstrations with rule-based terms for obstacle avoidance and goal seeking.
The privileged information leveraged by the teacher policy was distilled into a student policy, enabling deployment in real-world environments.
Uncertainty analysis reveals that higher epistemic uncertainty consistently corresponds to risky interactions, enabling the policy to distinguish safe and risky situations.