Double Q-Learning

The second in my series on RL that I created in graduate school at Georgia Tech will be on the Q-Learning algorithm. I will use the algorithm to “solve” the OpenAI CartPole environment.

If you missed any of the previous blogs here is the first.

Please go to my GitHub repo and get the 03-DoubleQLearning Juypter Notebook and follow along. It will make this a lot easier and will fill you in on any of the missing pieces that I leave out in this write up. Also, I can’t put code into these posts without some plugins that are not allowed on my current tier.

Double Q-Learning was created by Hado van Hasselt (who actually replied to my email when I was creating this project) in 2010. He noticed that using ‘max’ overestimated the action values and you could get into a situation where your results would vary widely. His solution was to use a second Q table and randomly swap between the tables. He would use the “other” table to grab the value for the update equation.

In the notebook, you can see the updated equations as well as trying your best to code them up. After getting the correct solution you can continue on to the fully coded algorithm and see if you can beat my best solution.

One thing to point out, you can see that the yellow line in this notebook is a lot more smooth than in the original Q-learner. That is what was fixed by the second Q table.

Please, download the notebook and give it a try. I even challenge you at the end to beat my solution in fewer iterations.

Open in Google Colab03-DoubleQLearning.ipynb

References
Hasselt, H. V. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23,2613-2621. Retrieved from http://papers.nips.cc/paper/3964-double-q-learning.pdf

Q-Learning with CartPole

The first in my series on RL that I created in graduate school at Georgia Tech will be on the Q-Learning algorithm. I will use the algorithm to “solve” 2 different gyms from OpenAI. The first is an altered FrozenLake and the second is the CartPole environment.

Just note that I am skipping over the first notebook as that is just introduction with MDPs and PI/VI.

Please go to my GitHub repo and get the 02-QLearning Juypter Notebook and follow along. It will make this a lot easier and will fill you in on any of the missing pieces that I leave out in this write up. Also, I can’t put code into these posts without some plugins that are not allowed on my current tier.

Quick Introduction: Q-learning is an RL technique. It was “discovered” in 1989 by Chris Watkins [web page] after working with Sutton and Barto’s book on reinforcement learning. During some research he came up with a new algorithm that didn’t need to model the environment like MDPs.

The next few segments of the notebook explain some hyper parameters, some methodologies, and finally show some pen and paper examples.

Next, I cover solving the FrozenLake example by creating a custom environment that removes the slippage. I do this to ensure that it is easy on the users to see the optimal solution without having to run many more iterations when their expected commands don’t do what they want.

This is a pretty straight forward example to get a grasp on the update rule as well as how the gym environments work.

Continuous Environments: This section is where I introduce an environment that can’t be held in memory. This requires us to “discretize” the environment. I go through some steps that will show the user the range of values for each of the observable variables. I then chunk those together to trim down the possible environment size into something manageable.

Finally, I put everything together and code up the algorithm with fairly good results.

Please, download the notebook and give it a try. I even challenge you at the end to beat my solution in fewer iterations.

Open in Google Colab: 02-QLearning.ipynb