I decided to sharpen my skills and signed up to watch the MIT Deep Learning class. The lecture is 3-430 every day for 4 weeks. Most likely, I will catch the videos on YouTube. My plan is to try and create some content (notes, notebooks, etc) that I can put in my GitHub repo. So far, it has greatly helped my learning by forcing me to think through what was done.
If you missed any of the previous blogs here is the first.
Please go to my GitHub repo and get the 03-DoubleQLearning Juypter Notebook and follow along. It will make this a lot easier and will fill you in on any of the missing pieces that I leave out in this write up. Also, I can’t put code into these posts without some plugins that are not allowed on my current tier.
Double Q-Learning was created by Hado van Hasselt (who actually replied to my email when I was creating this project) in 2010. He noticed that using ‘max’ overestimated the action values and you could get into a situation where your results would vary widely. His solution was to use a second Q table and randomly swap between the tables. He would use the “other” table to grab the value for the update equation.
In the notebook, you can see the updated equations as well as trying your best to code them up. After getting the correct solution you can continue on to the fully coded algorithm and see if you can beat my best solution.
One thing to point out, you can see that the yellow line in this notebook is a lot more smooth than in the original Q-learner. That is what was fixed by the second Q table.
Please, download the notebook and give it a try. I even challenge you at the end to beat my solution in fewer iterations.
Open in Google Colab: 03-DoubleQLearning.ipynb
Hasselt, H. V. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23,2613-2621. Retrieved from http://papers.nips.cc/paper/3964-double-q-learning.pdf