Its like a transforming distributions procedure, but your f is not constrained to be differentiable. So you can still sample from it. we perform a random sample of possible next state (weighted by the action you took, meaning an instantiation of s’ \sim T(\cdot | s,a)) and reward R(s,a) from current state

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?