Rock, Paper, Scissors in Linear Programming

Today I want to work through some stuff I struggled with in my current class. I haven’t done linear programming before in my life (college was a long time ago). Well, that changed a few weeks ago while trying to replicate a graph for Foe-Q and a graph for Correlated Equilibrium (CE-Q) of a zero sum game. In order to work my way through the assignment I started small and used the game of Rock, Paper, Scissors. The “trick” is adding the ‘U’ variable that you will minimize.

For the solver, I went with cvxopt and glpk since it was an order of magnitude faster than anything else. But, since I was on Windows it was a pain to get installed and I didn’t get nearly the speed up that Linux users got.

[python]
from cvxopt import matrix, solvers
glpksolver = ‘cvxopt_glpk’
[/python]

For the ‘c’ matrix we need to set the first column to -1 since we need to minimize the U variable to get the correct value.

[python]
#Minimize: U + R + P + S
c = matrix([-1.,0.,0.,0.])
[/python]

For the G matrix we need to build our equations. The first 3 set up the results. The first row is throwing Rock where you lose (-1) to Paper but win (+1) to Scissors. Row 2 is Paper’s results and row 3 is Scissors.

[python]
#1. U – P + S <= 0 => [1, 0,-1, 1]
#2. U + R – S <= 0 => [1, 1, 0,-1]
#3. U – R + P <= 0 => [1,-1, 1, 0]
[/python]

The second set of 3 rows is to ensure you can only throw a single play. Not, since we are using cvxopt we need to negate these. So, instead of R greater than or equal to 0 we need -R less than or equal to 0.

[python]
#4. -R <= 0 => [0,-1, 0, 0]
#5. -P <= 0 => [0, 0,-1, 0]
#6. -S <= 0 => [0, 0, 0,-1]
[/python]

The last set are the equality parameters. This ensure that we get a total 1 when the game is over. You can’t throw more than 1 play.

[python]
#7. R + P + S <= 1 => [0, 1, 1, 1]
#8. -R – P – S <=-1 => [0,-1,-1,-1]
[/python]

The h matrix is the results of the previous 8 equations.

[python]
h = matrix([0.,0.,0.,0.,0.,0.,1.,-1.])
[/python]

Here is the code:

[python]
from cvxopt import matrix, solvers
glpksolver = ‘cvxopt_glpk’
#Minimize: U + R + P + S
c = matrix([-1.,0.,0.,0.])
#1. U – P + S <= 0 => [1, 0,-1, 1]
#2. U + R – S <= 0 => [1, 1, 0,-1]
#3. U – R + P <= 0 => [1,-1, 1, 0]
#4. -R <= 0 => [0,-1, 0, 0]
#5. -P <= 0 => [0, 0,-1, 0]
#6. -S <= 0 => [0, 0, 0,-1]
#7. R + P + S <= 1 => [0, 1, 1, 1]
#8. -R – P – S <=-1 => [0,-1,-1,-1]
G = matrix([[1.,1.,1.,0.,0.,0.,0.,0.],[0.,1.,-1.,-1.,0.,0.,1.,-1],[-1.,0.,1.,0.,-1.,0.,1.,-1.],[1.,-1.,0.,0.,0.,-1.,1.,-1.]])
h = matrix([0.,0.,0.,0.,0.,0.,1.,-1.])
sol = solvers.lp(c,G,h, solver=glpksolver)
print(sol[‘status’])
print(sol[‘x’])
[/python]

To wrap this up, I was able to get the Foe-q graph replicated by setting up the linear equation to solve for minimax. But, CE-Q defeated me.

K-Armed Bandit

In this post I am going to walk through the k-armed bandit. I will be using Professor Sutton’s book “Reinforcement Learning: An Introduction”. A link to an updated version is on his web site located here. I will be walking through coding up the figures in Chapter 2. I will be doing the normal k-armed bandit in section 2.1 and 2.2, the Softmax in section 2.3, and finally, the Optimistic Initial Values in section 2.7. As always, I will post my code in my GitHub repository.

K-Armed Bandit:
First, lets talk about the k-armed bandit problem. According to wikipedia, it is a problem in which a gambler at a row of slot machines has to decide which lever to pull to maximize his winnings. Each machine returns a random reward from a probability distribution specific to that machine. The goal is to maximize your reward through a sequence of lever pulls.

In practice, this model could be used to manage research project, or any R&D department, when the rewards are not known to determine where to allocate resources. Does this ever happen? Probably not. But, it is a cool theory and I can code it!

Action-Value (figure 2.1):
In this figure, Professor Sutton is using the epsilon greedy method, which is where you explore epsilon percent and exploit the rest. He sets up 2,000 randomly generated k-armed bandit tasks with 10 arms. The rewards are returned from a normal (Gaussian) probability distribution (numpy has a method for this, thankfully) with mean of Q*(a) and variance 1. Q* was generated as a normal distribution with mean 0 and variance 1. Lets get started.

First, creating Q* from the normal distribution of mean 0 and variance of 1. We will also need to create arrays for the number of times each arm was selected and the estimate values

[python]
self.q_star = np.random.normal(0,math.sqrt(1),arms)
self.K = np.zeros(arms)
self.estimate_values = np.zeros(arms)
[/python]

Second, we need to create a function that handles the arm selection. If we have epsilon of 0.1 that means 10% of the time we are exploring the space and the other 90% we grab the best value

[python]
def get_arm(self):
#Randomly determine if we are going to explore or exploit
if np.random.random() > self.epsilon:
#Exploit
return np.argmax(self.estimate_values)
else:
#Explore
return np.random.randint(0,self.arms)
[/python]

Third, we need to create a function that will generate our rewards. From the book it says that we need to take the normal probability of the mean of Q* and variance of 1.

[python]
def get_reward(self, arm):
#Normal (Guassian) probabilty of mean Q* and variance 1
reward = np.random.normal(self.q_star[arm],math.sqrt(1))
return reward
[/python]

Fourth, we need to create a method that will update our local estimate array with the current rewards.

[python]
def update_estimate_values(self, arm, reward):
self.K[arm] += 1
k = self.K[arm]
#Set new reward (r1 + r2 … ra)/ka
current_total = self.estimate_values[arm]
#(k-1/k)* current_total + (1/k) * reward
self.estimate_values[arm] = ((k-1) / float(k)) * current_total + (1/float(k)) * reward
[/python]

That completes our bandit code. I have uploaded this all to my GitHub account for you to see them all in action.

Here is a link to the Rewards graph
Here is a link to the Optimal Selection graph

Softmax Variation:
This is a variation on how you run the get action method. You would add the following code. Knowing that as temperature goes to 0 it becomes more greed and as temperature goes to 1 you do more exploring

[python]
def calc_softmax(self):
e = np.exp(self.est_values / self.temperature)
dist = e/np.sum(e, axis=0)
return dist

def choose_eps_greedy(self):
sf = self.calc_softmax()
action = np.random.choice(self.arms,1,p=sf)[0]
return action
[/python]

instead of

[python]
np.argmax(self.estimate_values)
[/python]

Optimistic Initial Values:
Optimistic initial values is exactly what it sounds like. You set the initial values of your estimate array to a value instead of 0. You will need some domain knowledge for this to help.

[python]
def __init__(self, arms, epsilon, initial):
super().__init__(arms, epsilon)
self.estimate_values[:] = initial #Set initial value
self.step_size = 0.1 #Set alpha to 0.1
[/python]

I also added in some update code using step size that was introduced before this figure.

[python]
def update_est(self, arm, k, current_total, reward):
self.estimate_values[arm] += self.step_size * (reward – current_total)
[/python]

Here is a link to this graph.

Discretization with OpenAI

For my code, go to my GitHub account and look at Discretization.py.

In this post, I will continue from last time talking about discretization but will use it against an OpenAI environment. Most of this post was picked up after watching my TA work through one of his notebooks on the subject. If you want a good resource check out Miguel’s GitHub.

I am going to start with determining the state space from the OpenAI gym CartPole v0.  You can check out the GitHub for the actual source. There are 4 states. The first two are the location and the second two are the angle at which the pole is leaning. Once the pole gets +15 degrees or the cart get more than 2.5 units from the center you lose.

Next, we need to determine their values. Actions are 2 discrete states. Either push (+1) or pull (-1). The state will be the x, x_dot, theta, and theta_dot values. It wouldn’t make sense to have an infinite state space of every combination for all 4 of those so we will need to create our own.

Using a random approach we could run 10k random session against the CartPole problem and then determine their values. My results were (-1.4,1.4), (-3,3), (-0.2,0.2), and (-3.3,3.3). These are outside the acceptable bounds because they will still do the last calculations before setting the ‘done’ flag. But we can get their real values but just calling their observation space. Grabbing the high and low values and dividing them by 2 you get (-2.4,+2.4) and (-0.2,+0.2). These values make sense because a failure is when you are 2.4 away from center and a tilt of 20 degrees.

Last post, I used digitize to break the entries into bins. This time, I will use numpy’s linspace. This will return an even distribution and allow us to limit the number of states.

With some tweaking you should be able to get the states narrowed down enough that your q-learner will be able to solve a simple continuous state environment.

Next time, I will introduce solving this environment with a neural network and q-learning.

Discretization

From Wikipedia, discretization is the process of transferring continuous values into discrete states. This can be done because of memory/space requirements that come up with continuous states.

In a simple q-learning environment you would have a grid of X spaces. In the Frozen Lake environment you have a 4×4 grid. So, you would set up your learner with your X actions and their expected values across 16 state spaces. You would then proceed to run the q-learning process (get action, retrieve next state and reward, etc) until you have the optimal policy.

But, when you move to a problem that doesn’t have a discrete state space you will need to discretize it. A simple array sample can be shown with numpy’s digitize call. I will walk through it.

Create your “continuous” state space (image this is huge)

state_space = np.array([0,1,2,3,4,5,6,7,8,9])

Create the bins that you can handle (lets assume that you can only store 3 spaces)

bins = np.array([0,2,7])

Now, run the digitize to combine all the numbers in state_space into their new discretized space

dis = np.digitize(state_space,bins)

dis = array([1, 1, 2, 2, 2, 2, 2, 3, 3, 3], dtype=int64)

This print out shows that numbers 0,1 are in new space 0. 2,3,4,5,6 are in space 1 and 7,8,9 are in space 2.

You can go into more depth if you don’t know the numbers. This is what I had to do with the Machine Learning for Trading grad class when it revolved around technical indicators. In that case, we took it a step further by using the digitized values from 4 indicators and then concatenated them. For example, I used an indicator determining if it was + or – from the previous day. If it was + I would give it a 1. If it was negative I would give it a 0. I would then do the same thing with if it was above an simple moving average. That would lead me to have a positive in the first indicator and a positive in the second indicator returning 11 as my state. If it was + and then -, it would have been 10. And so on.

Discretization allows you to put massive state spaces into something you can store.