Sridhar Thiagarajan: Off Policy Eligibility Traces

Original Paper by Diana Precup : https://webdocs.cs.ualberta.ca/~sutton/papers/PSS-00.pdf

Matlab Code For Comparing Various Behaviour Policies on Random Walk :

Mu denotes the Behavioural Policy.

Note that as the target policy is more assimilar to the behavioral policy, it takes longer to learn an accurate value estimate.Clearly, when Mu=0.5=Pi, Convergence is Fastest!

RMS is the difference between our estimate obtained and the true value for the random walk, for the equirandom policy.

Off Policy Eligibility traces play a crucial role, and can be used effectively in Intra Option Learninng to learn about policies other than those being followed. When used in control, the target policy is the greedy policy with respect to the current value function.

Navigation

Off Policy Eligibility Traces