In this post I am going to walk through the k-armed bandit. I will be using Professor Sutton’s book “Reinforcement Learning: An Introduction”. A link to an updated version is on his web site located here. I will be walking through coding up the figures in Chapter 2. I will be doing the normal k-armed bandit in section 2.1 and 2.2, the Softmax in section 2.3, and finally, the Optimistic Initial Values in section 2.7. As always, I will post my code in my GitHub repository.

**K-Armed Bandit**:

First, lets talk about the k-armed bandit problem. According to wikipedia, it is a problem in which a gambler at a row of slot machines has to decide which lever to pull to maximize his winnings. Each machine returns a random reward from a probability distribution specific to that machine. The goal is to maximize your reward through a sequence of lever pulls.

In practice, this model could be used to manage research project, or any R&D department, when the rewards are not known to determine where to allocate resources. Does this ever happen? Probably not. But, it is a cool theory and I can code it!

**Action-Value (figure 2.1)**:

In this figure, Professor Sutton is using the epsilon greedy method, which is where you explore epsilon percent and exploit the rest. He sets up 2,000 randomly generated k-armed bandit tasks with 10 arms. The rewards are returned from a normal (Gaussian) probability distribution (numpy has a method for this, thankfully) with mean of Q*(a) and variance 1. Q* was generated as a normal distribution with mean 0 and variance 1. Lets get started.

First, creating Q* from the normal distribution of mean 0 and variance of 1. We will also need to create arrays for the number of times each arm was selected and the estimate values

[python]

self.q_star = np.random.normal(0,math.sqrt(1),arms)

self.K = np.zeros(arms)

self.estimate_values = np.zeros(arms)

[/python]

Second, we need to create a function that handles the arm selection. If we have epsilon of 0.1 that means 10% of the time we are exploring the space and the other 90% we grab the best value

[python]

def get_arm(self):

#Randomly determine if we are going to explore or exploit

if np.random.random() > self.epsilon:

#Exploit

return np.argmax(self.estimate_values)

else:

#Explore

return np.random.randint(0,self.arms)

[/python]

Third, we need to create a function that will generate our rewards. From the book it says that we need to take the normal probability of the mean of Q* and variance of 1.

[python]

def get_reward(self, arm):

#Normal (Guassian) probabilty of mean Q* and variance 1

reward = np.random.normal(self.q_star[arm],math.sqrt(1))

return reward

[/python]

Fourth, we need to create a method that will update our local estimate array with the current rewards.

[python]

def update_estimate_values(self, arm, reward):

self.K[arm] += 1

k = self.K[arm]

#Set new reward (r1 + r2 … ra)/ka

current_total = self.estimate_values[arm]

#(k-1/k)* current_total + (1/k) * reward

self.estimate_values[arm] = ((k-1) / float(k)) * current_total + (1/float(k)) * reward

[/python]

That completes our bandit code. I have uploaded this all to my GitHub account for you to see them all in action.

Here is a link to the Rewards graph

Here is a link to the Optimal Selection graph

**Softmax Variation**:

This is a variation on how you run the get action method. You would add the following code. Knowing that as temperature goes to 0 it becomes more greed and as temperature goes to 1 you do more exploring

[python]

def calc_softmax(self):

e = np.exp(self.est_values / self.temperature)

dist = e/np.sum(e, axis=0)

return dist

def choose_eps_greedy(self):

sf = self.calc_softmax()

action = np.random.choice(self.arms,1,p=sf)[0]

return action

[/python]

instead of

[python]

np.argmax(self.estimate_values)

[/python]

**Optimistic Initial Values**:

Optimistic initial values is exactly what it sounds like. You set the initial values of your estimate array to a value instead of 0. You will need some domain knowledge for this to help.

[python]

def __init__(self, arms, epsilon, initial):

super().__init__(arms, epsilon)

self.estimate_values[:] = initial #Set initial value

self.step_size = 0.1 #Set alpha to 0.1

[/python]

I also added in some update code using step size that was introduced before this figure.

[python]

def update_est(self, arm, k, current_total, reward):

self.estimate_values[arm] += self.step_size * (reward – current_total)

[/python]

Here is a link to this graph.