Intra Option Q Learning

Hey guys,

Today we will compare the Intra option Q learning to a case with only primitive actions, on a 21x21 grid world with bottleneck states.

Options helps direct exploration towards states toward which we feel will help reach our goal.At the same time, in continuous doamins, having options localised will help as the function approximators may concentrate on its initiation set, leading to better value function approximation.

*Matlab Code*

Here, we will assume the options are given, and the policy for this options will be learned online.
Learning rate parameter and exploration parameter was both set to 0.1. No options algorithm used one step Q learning. Domain was deterministic, with 4 actions, up, down, right and left.

Agent started at a random state in the left portion of the left hall.All plots are averaged over 15 runs for smoothness.


21vs 21 GridWorld. Option targets are the bottlneck states.



The options can also be detected automatically by Diverse density estimation, Graph Cuts or few other methods. For DDE, head over to : Diverse Density Estimation for bottleneck detection

Note that intra option Q learning is different from Q learning, in that it makes greater use of experiences.

However, This can only be used if the options are Markov!
If you have an option that is semi markov, say go 2 steps forward and 3 left, we cannot use this method, and have to resort to SMDP Q learning.



First 150 epsiodes.Note the faster convergence to better policy for the case of Intra Option Q learning and the lesser number of steps taken during initial runs.This plot also serves to show that both converge to the same optimal policy after sufficient training episodes.


Feel free to contact me if you have any questions regarding this post. Also check out my autonomous option discovery using diverse density estimation post - here