Dyna Q Planning on Maze Task

Hey guys,
Today we will be learning about incorporating planning actions into the learning task.We will be taking a simple example of the maze task to see how learning planning integrated, can converge faster than one being used alone.
Learning is basically an agent learning from experience given to it.In this case we only have a sample model!  I.e, we can use the model to generate trajectories, but we do not have the Transition probabilities explicitly!
Planning methods involving using a full model of an environment, to compute an optimal policy for the task.
Combining these 2 methods, we get the learning planning-agent.We end up alternating between learning planning steps.
Here is the algorithm (Picture taken from Book by A.G Barto)
The gist is that we end up learning an estiamte of  the model once we experience an episode.Once the episode ends, we use this model for whatever states available, to improve our Q value estimates! Assume for simplicity the environment is deterministic, i.e the agent goes exactly where it chooses to (loosely speaking).
Now once we experience this transition, we save in our memory that in state S,taking action A results in state S'.
So each run of episode, is followed by N planning steps.Value of N has been varied in the code provided to see the results.Here is the figure our code generates.
ComparisonofN
Clearly, as the number of planning steps are increased, the algorithm converges after lesser number of iterations.
However, note that as we increase the number of planning steps, each iteration takes longer to run! 
So shouldn't we be comparing the time taken before convergence to get a more meaningful estimate?
However, note that in the real world, it may be expensive for the agents to experience the episodes, as each episode experience consumes time and may result in some sort of agent damage.Hence here it makes sense, that after our robot has gone through the task once, we use planning steps to improve our Qvalue estimates.
Some questions to ponder:
Would eligibility traces perform better for some value of Gamma and Lambda?
How would one proceed in the case of a stochastic environment?


Cheers.