Semantic-Metric Bayesian Risk Fields: Learning Robot Safety from Human Videos with a VLM Prior

Abstract

Humans interpret safety not as a binary signal but as a continuous, context- and spatially-dependent notion of risk. We extract an implicit human risk model by introducing a semantically conditioned and spatially varying Bayesian risk field supervised directly from safe human demonstration videos and VLM common sense. The prior is furnished by a pretrained vision-language model, and the likelihood is a learned ViT that maps pretrained features (e.g., DINOv3) to pixel-aligned risk values. Our pipeline produces pixel-dense risk images usable as value predictors for robot planning or as 3D fields for trajectory optimization, enabling generalization to novel objects and contexts and scaling to larger datasets. The Bayesian formulation supports adaptation to additional observations or common-sense rules. We show that the resulting risk aligns with human preferences and supports downstream applications, including visuomotor value shaping and classical trajectory optimization, taking a step toward robots with human-like risk reasoning.

Method Overview

Likelihood

Likelihood diagram

We learn a distance-conditioned likelihood from safe human demonstrations. RGB-D tabletop videos are segmented and tracked (YOLOv8 + SAM2 + HoistFormer), inter-object distances are aggregated into per-trajectory histograms, and discrete CDFs supervise a transformer that maps DINOv3 features of the manipulated and reference objects to Bézier control points of the CDF. The model is permutation-invariant across object pair samples and yields pixel-aligned, distance-aware risk tendencies from actions that humans naturally avoid or approach.


Prior

Prior diagram

We build a semantics-driven prior using an LLM-assisted object-pair risk table. An LLM enumerates household objects by room and rates every pair’s proximity risk (1–5) with rationales (e.g., electrocution, fire, sharpness, fragility, water damage). Images per object are embedded with DINOv3 to form an object lookup; pixel features vote for the nearest object IDs, which query a pairwise risk LUT to yield object-aware priors. The prior is interpretable, scalable via LLM data generation, and adapts to task-specific rulesets.

Experiments

We evaluate the pixel-wise risk fields and their posterior combination. Qualitatively, prior and likelihood compose into posterior risk that depends on the manipulated object and discounts far-away hazards. As a value signal, the posterior ranks options (e.g., shelving choices) by predicted risk. Quantitatively, trajectories generated by a classical optimizer around 3D risk buffers are judged more risk-aware than state-of-the-art learned policies in tabletop tasks, while remaining competitive in human-trajectory similarity.
Note that in the videos, the object being manipulated out of frame, so the risk map is relative to the chosen manipulated object.

The Turbo colormap is used to denote risk, with blue being safe to red being higest risk.

Turbo colormap

Downstream Applications

Risk-aware Value Learner

Posterior risk maps act as value functions for visuomotor decision-making: streaming risk over candidate rollouts enables selecting lower-risk behaviors (e.g., choosing the safest shelf for placing a cup among alternatives that contain electronics).
In the videos, the object being manipulated out of frame is the cup, so the risk map is relative to the cup. Note that the videos may be out of sync with each other after scrolling so beware.

The Turbo colormap is used to denote risk for the Posterior Risk Map, with blue being safe to red being highest risk.

Turbo colormap

RGB

Posterior

Risk Graph


Prior Reasoning

This section demonstrates how the prior component of our Bayesian risk field provides semantic reasoning about object interactions. The prior encodes common-sense knowledge about which objects pose risks when in proximity to each other.

This is useful for downstream applications where our risk is context dependent. A great example of this is using a pressure cooker since although it's an electronic device, the risk of electric shock with water is accepted in the environment. Other examples include cooking (water, electronics, heat), welding (heat), sailing (water, some electronics), etc.



Risk-aware Robot Navigation

We convert posterior viability thresholds to buffer radii around depth point clouds and plan with a classical optimizer that avoids these unions of balls. Across 33 experiments (trials 6–33 with all methods), human raters prefer our risk-aware trajectories more often than GR00T (VLA) and Diffusion Policy, while achieving competitive similarity to tele-operated references (DTW).

Trial ID: