Public View
Suggest
Download this page (.md) Download entire wiki (.zip)
Clone entire wiki

Optimal Exploration Policy

Suppose we have offline statistic regarding wins and losses of each slot machine as our state:

\begin{equation} w_1, l_{1}, \dots, w_{n}, l_{n} \end{equation}

What if we want to create a policy that maximises exploration?
We construct a value function:

\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}

our policy is the greedy policy:

\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \arg\max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}

Now, how do we go about calculating the action-value:

\begin{align} Q ([w_1, l_{1}, \dots, w_{n}, l_{n}], a) =\ & \frac{w_{a}+1}{w_{a}+l_{a}+2} (R(w) + U^{*}(\dots, w_{a}+1, l_{a}, \dots)) \&+ \left(1-\frac{w_{a}+1}{w_{a}+l_{a}+2}\right)(R(l) + U^{*}(\dots, w_{a}, l_{a}+1, \dots)) \end{align}

“the probability of you win” (expectation of Beta Distribution), times the instantaneous reward you win + the utility you gain in terms of information of you doing that.
To solve this in a finite horizon, note that at time t=k, your U^{*} is 0 because you have nothing to do anymore.
Then, you can back up slowly to get each previous state:
calculate Q[w_1-1, l_1, …, 1] calculate Q[w_1, l_1-1, …,1] and so on, and choosing the utility of each state from there.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?