Qlearning and SARSA: Cliff Task

 Hey guys!
Today, we'll be comparing the SARSA and Q learning algorithm on the cliff walking task.
The Agent starts at the Start state.All transitions ( Up, down, right and left) are deterministic.If it hits the boundaries, it stays where it is.Falling into the cliff gives it a reward of -100.All other rewards on transitions are -1.


SARSA is what is known as an On-policy algorithm. Q-learning, on the other hand is an off policy algorithm.
What is On policy and off policy?
An on policy algorithm updates its value estimates based on the actual actions it takes!
SARSA Update:
Q(s_{t},a_{t})\leftarrow Q(s_{t},a_{t})+\alpha [r_{{t+1}}+\gamma Q(s_{{t+1}},a_{{t+1}})-Q(s_{t},a_{t})]
Clearly, the Q values are updated based on the actions it takes.
Off policy algorithm, like Q learning on the other hand updates the Q values based on another control policy, in this case the optimal policy Q*.
Experiments were conducted with same value of Alpha and Epsilon (Exploration parameter) for both the cases.

         
Can you guess the policies found by the 2 algorithms?

Q-learning has learnt the shortest path, whereas SARSA has learnt the safer route.Why is this?
This is due to the exploratory parameter Epsilon.SARSA Realises that when it takes the safer route, it falls into the cliff even if it doesnt choose to go down.This results in the qvalues for those states becoming largely negative.
Q-Learning on the other hand, learns the shorter route,despite epsilon.This is because even when it falls, the qvalues for these states are updated based on the optimal qnext, i.e max over a q(s',a')!
Let us compare the average rewards obtained.



SARSA Obtains better reward overall,as it takes the safer route and doesnt fall into the cliff very often. Q-learning on the other hand, although learns the best route, gets more negative reward.Note that the plots are smoothened!.
Few more questions to think about.
What would happen if the exploratory parameter were to be brought to zero after the run?
How Will the task change if all transition except to cliff are given zero reward instead of 1, and goal state +1 reward? Do we still nee
Cheers.