Off Policy Eligibility Traces



Original Paper by Diana Precup : https://webdocs.cs.ualberta.ca/~sutton/papers/PSS-00.pdf


Mu denotes the Behavioural Policy.

Note that as the target policy is more assimilar to the behavioral policy, it takes longer to learn an accurate value estimate.Clearly, when Mu=0.5=Pi, Convergence is Fastest!

RMS is the difference between our estimate obtained and the true value for the random walk, for the equirandom policy.

Off Policy Eligibility traces play a crucial role, and can be used effectively in Intra Option Learninng to learn about policies other than those being followed. When used in control, the target policy is the greedy policy with respect to  the current value function.