We learn a distance-conditioned likelihood from safe human demonstrations. RGB-D tabletop videos are segmented and tracked (YOLOv8 + SAM2 + HoistFormer), inter-object distances are aggregated into per-trajectory histograms, and discrete CDFs supervise a transformer that maps DINOv3 features of the manipulated and reference objects to Bézier control points of the CDF. The model is permutation-invariant across object pair samples and yields pixel-aligned, distance-aware risk tendencies from actions that humans naturally avoid or approach.