In this post I am going to walk through the k-armed bandit. I will be using Professor Sutton’s book “Reinforcement Learning: An Introduction”. A link to an updated version is on his web site located here. I will be walking through coding up the figures in Chapter 2. I will be doing the normal k-armed bandit in section 2.1 and 2.2, the Softmax in section 2.3, and finally, the Optimistic Initial Values in section 2.7. As always, I will post my code in my GitHub repository.
First, lets talk about the k-armed bandit problem. According to wikipedia, it is a problem in which a gambler at a row of slot machines has to decide which lever to pull to maximize his winnings. Each machine returns a random reward from a probability distribution specific to that machine. The goal is to maximize your reward through a sequence of lever pulls.
In practice, this model could be used to manage research project, or any R&D department, when the rewards are not known to determine where to allocate resources. Does this ever happen? Probably not. But, it is a cool theory and I can code it!
Action-Value (figure 2.1):
In this figure, Professor Sutton is using the epsilon greedy method, which is where you explore epsilon percent and exploit the rest. He sets up 2,000 randomly generated k-armed bandit tasks with 10 arms. The rewards are returned from a normal (Gaussian) probability distribution (numpy has a method for this, thankfully) with mean of Q*(a) and variance 1. Q* was generated as a normal distribution with mean 0 and variance 1. Lets get started.
First, creating Q* from the normal distribution of mean 0 and variance of 1. We will also need to create arrays for the number of times each arm was selected and the estimate values
self.q_star = np.random.normal(0,math.sqrt(1),arms)
self.K = np.zeros(arms)
self.estimate_values = np.zeros(arms)
Second, we need to create a function that handles the arm selection. If we have epsilon of 0.1 that means 10% of the time we are exploring the space and the other 90% we grab the best value
#Randomly determine if we are going to explore or exploit
if np.random.random() > self.epsilon:
Third, we need to create a function that will generate our rewards. From the book it says that we need to take the normal probability of the mean of Q* and variance of 1.
def get_reward(self, arm):
#Normal (Guassian) probabilty of mean Q* and variance 1
reward = np.random.normal(self.q_star[arm],math.sqrt(1))
Fourth, we need to create a method that will update our local estimate array with the current rewards.
def update_estimate_values(self, arm, reward):
self.K[arm] += 1
k = self.K[arm]
#Set new reward (r1 + r2 … ra)/ka
current_total = self.estimate_values[arm]
#(k-1/k)* current_total + (1/k) * reward
self.estimate_values[arm] = ((k-1) / float(k)) * current_total + (1/float(k)) * reward
That completes our bandit code. I have uploaded this all to my GitHub account for you to see them all in action.
This is a variation on how you run the get action method. You would add the following code. Knowing that as temperature goes to 0 it becomes more greed and as temperature goes to 1 you do more exploring
e = np.exp(self.est_values / self.temperature)
dist = e/np.sum(e, axis=0)
sf = self.calc_softmax()
action = np.random.choice(self.arms,1,p=sf)
Optimistic Initial Values:
Optimistic initial values is exactly what it sounds like. You set the initial values of your estimate array to a value instead of 0. You will need some domain knowledge for this to help.
def __init__(self, arms, epsilon, initial):
self.estimate_values[:] = initial #Set initial value
self.step_size = 0.1 #Set alpha to 0.1
I also added in some update code using step size that was introduced before this figure.
def update_est(self, arm, k, current_total, reward):
self.estimate_values[arm] += self.step_size * (reward – current_total)
Here is a link to this graph.