Markov Chain Monte-Carlo

Start with an initial sample \tau
At each distribution…
sample \tau ’ \sim g\left(\cdot | \tau\right) (for instance, \tau’ \sim \mathcal{N}\left(\cdot | \tau, \sigma^{2}\right)) accept the sample with probability given by \frac{\bar{p} \left(\tau’\right) g\left(\tau | \tau’\right)}{\bar{p}\left(\tau\right) g\left(\tau’ | \tau\right)}, otherwise keep \tau (this is also called the Metropolis-Hastings criteria) intuition The kernel is often chosen to be symmetric, so:

\begin{equation} g\left(\tau | \tau’\right) = g\left(\tau’ | \tau\right) \end{equation}

we want to sample from areas of high density more often. Consider:
\bar{p}\left(\tau’\right) \geq \bar{p}\left(\tau\right), meaning \tau’ is more likely under the target density than \tau, so \frac{\bar{p}\left(\tau’\right)}{\bar{p} \left(\tau\right)} \geq 1, so we will always accept whereas \bar{p}\left(\tau’\right) < \bar{p}\left(\tau\right), meaning \frac{\bar{p}\left(\tau’\right)}{\bar{p}\left(\tau\right)} < 1, so we will still accept some of the time but mostly not accept; the bigger the deviations, the less likely we are able to sample it Key: this all works assuming this is in the limit of infinite samples.
problems with MCMC The problem is that we may not ever get to a failure by doing this naively (think multi-modal distributions)
burn in A useful hack to do is to start sampling and throw away some samples in the beginning.
failure smoothing In certain cases where the failure distribution is multi-modal, its really hard to perturb your way into jumping between different distributions. So, let’s define:
\Delta \left(\tau\right) = 0, if \tau \not \in \psi (is failure) \Delta \left(\tau\right) > 0, otherwise for instance, a function that behaves like tis is \max \left(\rho(\tau, \psi), 0\right)
Now, we can write:

\begin{align} \bar{p} \left(\tau | \tau \not \in\psi\right) &= 1 \left\{\tau \not \in \psi\right\} p\left(\tau\right) \\ &=\mathbb{1} \left\{\Delta \left(\tau\right) = 0\right\} p\left(\tau\right) \end{align}

Which we can now approximate—using some smoothing parameter \varepsilon:

If we got a non-failure from the smoothed distribution, just throw it away.
a good kernel A good kernel: using the gradient of the likelihood to define the kernel is sometimes a good method of increasing likelihood of the discovered failure. (NUTS, HMC, etc.)
or generally: use probabilistic programming—which allows us to use probabilistic modes as computer programs.