Moving to Google Cloud

I have decided to move my lift to the Google Cloud. Before, my web site was hosted in Azure through GoDaddy. I have since moved my HTML to a Google Cloud bucket and my domain to their domain service.

Over the next few weeks I will discuss what I have done and why. Including trying to put my DevPost submission in the cloud before Iowa gets finished approving legalized gambling.

My first post will be moving my web site to the cloud. This includes updating my DNS, implementing deployments through GitHub, and setting permissions.

My second post will be trying to use the AI-Engine with TensorFlow v2

My third and final post will be about moving my DevPost project to the cloud and using TensorFlowJS.

Advertisements

Information Gain in Decision Trees

One of the ways to determine HOW to break apart a decision tree is by using the Information Gain on that column. The idea is that it tells you how much you gain by splitting on that column. For example, if you split and ALL of the values in the left child have the same value in the Target column you don’t gain anything and your value is 0. If they are split in the middle you get a gain of 1.

The equation is:

IG(A) = Entrophy(Target-or-Prior) - Remainder(A)
This breaks down to:
Entropy(Target-or-Prior) = I(P_1,P_2,..P_n) = \sum_{n=1}^{n} -p_i log_2 p_i
Remainder(A) = \sum_{i=1}^{v}\frac{a_i + b_i}{a+b} I(\frac{a_i}{a_i + b_i};\frac{b_i}{a_i + b_i})

Now that we have the ugly formula out of the way lets put in some actual numbers to determine how we want to split on this tree

SampleCloudsSunRain
1YesYesNo
2YesYesNo
3NoYesNo
4NoNoYes
5YesYesYes
6NoNoNo
7NoNoYes
8YesYesYes
9NoYesNo
10YesNoNo

In this case ‘Rain’ is our Target column. We have 6 no and 4 yes values. Our first step is to find the Entropy of Rain. We then need to find the Remainder of Clouds and Sun. Finally, we find the Information Gain and use that to break out tree.

Entropy of Rain

If we remember I(P_1,P_2) = \sum_{i=1}^{n} -p_i log_2 p_i

We can now put in the actual numbers:

I(\frac{6}{10},\frac{4}{10}) = - \frac{6}{10} log_2  \frac{6}{10} + - \frac{4}{10} log_2  \frac{4}{10}

I(\frac{6}{10},\frac{4}{10}) =  -0.6 * -0.7 + -0.4 * -1.3

I(\frac{6}{10},\frac{4}{10}) =  0.42 + 0.52

I(\frac{6}{10},\frac{4}{10}) =  0.94

We know that the Entropy of Rain is 0.94.

On a side note, I(0,1) = 0 AND I(\frac{1}{2},\frac{1}{2}) = 1.

Remainder of Cloud and Sun

If we remember the formula is Remainder(A) = \sum_{i=1}^{v}\frac{a_i + b_i}{a+b} I(\frac{a_i}{a_i + b_i};\frac{b_i}{a_i + b_i})

Let get some actual numbers and break down the formula.

CloudNoYes
No33
Yes22
SunNoYes
No24
Yes22

For the remainder we need to calculate the number of Yes/No for each of the Yes/No from Rain. These are the values in the grids above.

Lets start with Cloud:

Remainder(A) = \sum_{i=1}^{v}\frac{a_i + b_i}{a+b} I(\frac{a_i}{a_i + b_i},\frac{b_i}{a_i + b_i})

Remainder(Cloud)  = \frac{6}{10}I(\frac{3}{6},\frac{3}{6})+\frac{4}{10}I(\frac{2}{4},\frac{2}{4})

WHERE \frac{6}{10} comes from the number of No’s total. \frac{3}{6} comes from the number of No’s that are ALSO No’s in the Rain column. The second \frac{3}{6} from the number of Yes’s that are ALSO No’s in the Rain column.

\frac{4}{10} comes from the number of Yes’s total. \frac{2}{4} comes from the number of No’s that are ALSO Yes’s in the Rain column. The second \frac{2}{4} from the number of Yes’s that are ALSO Yes’s in the Rain column.

Remainder(Cloud)  = \frac{6}{10}(-\frac{3}{6} log_2  \frac{3}{6} + -\frac{3}{6} log_2 \frac{3}{6} )+\frac{4}{10}I(-\frac{2}{4} log_2 \frac{2}{4} + -\frac{2}{4} log_2  \frac{2}{4} )

Remainder(Cloud)  = \frac{6}{10}(-.5 * -1 + -.5 * -1)+\frac{4}{10}I( -.5 * -1 + -.5 * -1 )

Remainder(Cloud)  = \frac{6}{10}+\frac{4}{10}

Remainder(Cloud)  = 1

Now, lets do Sun

Remainder(Sun)  = \frac{6}{10}I(\frac{2}{6},\frac{4}{6})+\frac{4}{10}I(\frac{2}{4},\frac{2}{4})

WHERE \frac{6}{10} comes from the number of No’s total. \frac{2}{6} comes from the number of No’s that are ALSO No’s in the Rain column. \frac{4}{6} from the number of Yes’s that are ALSO No’s in the Rain column.

\frac{4}{10} comes from the number of Yes’s total. \frac{2}{4} comes from the number of No’s that are ALSO Yes’s in the Rain column. The second \frac{2}{4} from the number of Yes’s that are ALSO Yes’s in the Rain column.

Remainder(Sun)  = \frac{6}{10}(-\frac{2}{6} log_2  \frac{2}{6} + -\frac{4}{6} log_2 \frac{4}{6} )+\frac{4}{10}I(-\frac{2}{4} log_2 \frac{2}{4} + -\frac{2}{4} log_2  \frac{2}{4} )

Remainder(Sun)  = \frac{6}{10}(-.33 * -1.58 + -.66 * -.58)+\frac{4}{10}I( -.5 * -1 + -.5 * -1 )

Remainder(Sun)  = \frac{6}{10}(.52 +.38)+\frac{4}{10}(1)

Remainder(Sun)  = \frac{6}{10}(.9)+\frac{4}{10}(1)

Remainder(Cloud)  = .54 + .4

Remainder(Cloud)  = .94

Information Gain

Remember IG(A) = Entropy(Rain) - Remainder(A)

IG(Cloud) = 0.94 - 1 = -0.06

IG(Sun) = 0.94 - 0.94 = 0.0

So, Sun has the highest Information gain and that is how you should start your decision tree.

Now, I made this simple since I only had 2 columns but if you had more columns you would need to do this again using Sun as the Target and build off of that.

NFL Point Spread Model

After completing the DevPost project I decided I would take what I learned and try and do it with the NFL. I go in knowing that I won’t do very well. But, that didn’t stop me from trying to get close.

Data Collection
My first step was to figure out how to collect data. With the college basketball project I just scraped from the NCAA web site. Since my inputs for that project was just based around scoring it was pretty straight forward. With the NFL, I wanted to use more than just the score. I wanted to grab some offense and defensive stats.

To grab the data, I was able to use nflscrapR from Maksim Horowitz. He was able to collect the game stats for each game since 2009 in a JSON file. I was then able to import that into a C# project I had created. A few hundred formatted lines later I had that data.

But, I wanted more. I wanted data from 2000 forward to increase my training data. For these years, I used Pro Football Reference to get it. While the JSON was nice and easy, this was not. I had to download the HTML and then use some string manipulation to get the data.

I now have 4,848 game to use. Breaking that down into 80/20 splits and I have 3,878 games for training and 970 for testing.

Model Creation
For the model, I again used TensorFlow v2 and Keras. Since this is a regression project I am using MSE (mean squared error) as my loss function but using MAE (mean absolute error) for accuracy metric.

I will most likely be tweaking the network layers and nodes for a while to see if I can increase my accuracy. Right now, I am around 8 points away. I would like to see this under a touchdown (7).

 

Master Theorem

Since I have started tutoring college level computer science I have had to relearn a lot of things that I haven’t used since college (both undergrad and masters). One of them is the Master’s Theorem that is used to analyze the running time for divide-and-conquer algorithms.

Every time I look at these I have to take a minute to remind myself how to determine the run times. So, this post is to handle just that.

First, the general form of the equations:

T(n) = aT(\frac{n}{b}) + f(n)

This has two parts. The first aT(\frac{n}{b}) is the sub problem where the algorithm does the divide-and-conquer. The second part f(n) shows the time it takes to recreate the problem.

There are 3 cases to determine the running time of this algorithm. Each of the is determined by the time it takes to run each part.

Determine the “cost” of each part of the equation:
To determine the sub problem speed you need to solve log_b a.
To determine the recreation you need to solve based on ‘c’ which is different for each case.

Case 1: f(n) = O(n^c)
Case 2: f(n) = O(n^c log^k n)
Case 3: f(n) = \Omega(n^c)

Case 1: When the work to combine the problem is dwarfed by the sub problem.

aT(\frac{n}{b}) > f(n)

Case 2: When the work to combine the problem is comparable to the sub problem.

aT(\frac{n}{b}) \approx f(n)

Case 3: When the work to combine the problem dominates the sub problem

aT(\frac{n}{b}) < f(n)

To start I am going to work through a single instance of each (from wikipedia) and then give multiple examples of each.

Case 1 Example:
T(n) = 8T(\frac{n}{2}) + 1000n^2
First, we need to determine the variables a, b, and c from f(n).
Here, a = 8 and b = 2.
For c, f(n) = 1000n^2. Since we know f(n) = O(n^c). So we get c=2.

For case 1 we need to have log_b a > c
Doing the math log_2 8= 3 which is greater than 2 from c. This confirms we are in case 1.
Using the formula T(n) = O(n^c) we get O(n^3)

Case 2 Example:
T(n) = 2T(\frac{n}{2}) + 10n
First, we need to determine the variables a, b, and c from f(n).
Here, a = 2 and b = 2.
For c, f(n) = 10n. Since we know f(n) = O(n^c log^k n). So we get c=1.

For case 2 we need to have log_b a \approx c
Doing the math log_2 2= 1 which is the same as 1 from c. This confirms we are in case 2.
Using the formula T(n) = O( n^c log^k n ) we get O(n log n)

Case 3 Example:
T(n) = 2T(\frac{n}{2}) + n^2
First, we need to determine the variables a, b, and c from f(n).
Here, a = 2 and b = 2.
For c, f(n) = n^2. Since we know f(n) = O(f(n)). So we get c=2

For case 3 we need to have log_b a < c
Doing the math log_2 2= 1 which is the same as 2 from c. This confirms we are in case 3.
Using the formula T(n) = O( f(n) ) we get O(n^2)

Samples:

Now, I am going to bang through a few examples of each case. Case 1 and 3 are pretty straight forward but we start getting some interesting cases in 2.

Each of these will be broken into 4 sections. The first column will be the formula. The second column will be the results of log_b a. The third column will be the value of c. The last column will be the notation.

These were pulled from Abdul Bari’s YouTube channel.

Case 1 Samples:
T(n) = 2T(\frac{n}{2}) + 1 -> log_b a = 1 -> c=0 -> O(n^1)
T(n) = 4T(\frac{n}{2}) + 1 -> log_b a = 1 -> c=0 -> O(n^2)
T(n) = 4T(\frac{n}{2}) + n -> log_b a = 2 -> c=1 -> O(n^2)
T(n) = 8T(\frac{n}{2}) + n^2 -> log_b a = 3 -> c=2 -> O(n^3)
T(n) = 16T(\frac{n}{2}) + n^2 -> log_b a = 4 -> c=2 -> O(n^4)

Case 3 Samples:
T(n) = T(\frac{n}{2}) + n -> log_b a = 0 -> c=1 -> O(n)
T(n) = 2T(\frac{n}{2}) + n^2 -> log_b a = 1 -> c=2 -> O(n^2)
T(n) = 2T(\frac{n}{2}) + n^2 log n -> log_b a = 1 -> c=2 -> O(n^2 log n)
T(n) = 4T(\frac{n}{2}) + n^3 log n -> log_b a = 2 -> c=3 -> O(n^3 log n)
T(n) = 2T(\frac{n}{2}) + \frac{n^2}{log n} -> log_b a = 1 -> c=2 -> O(n^2)

Case 2 Samples:
Remember, you are multiplying f(n) times log n.
T(n) = T(\frac{n}{2}) + 1 -> log_b a = 0 -> c=0 -> O(log n)
T(n) = 2T(\frac{n}{2}) + n -> log_b a = 1 -> c=1 -> O(n log n)
T(n) = 2T(\frac{n}{2}) + n log n -> log_b a = 1 -> c=1 -> O(n log^2 n)
T(n) = 4T(\frac{n}{2}) + n^2 -> log_b a = 2 -> c=2 -> O(n^2 log n)
T(n) = 4T(\frac{n}{2}) + (n log n)^2 -> log_b a = 2 -> c=2 -> O(n^2 log^3 n)
T(n) = 2T(\frac{n}{2}) + \frac{n}{log n} -> log_b a = 1 -> c=1 -> O(n log log n)
T(n) = 2T(\frac{n}{2}) + \frac{n}{log^2 n} -> log_b a = 1 -> c=1 -> O(n)

Conclusion:

I hope this helps anyone that is struggling through this stuff. After writing all this down with pen and paper it cleared it up for me.

500lb Deadlift PR

Workout

I changed up my workout on January 7th of this year to something called Greasing the Groove by Soviet strength coach Pavel Tsatsouline. The idea is that you train as often as possible while still being fresh as possible. You don’t do your typical workout where you wipe out the muscle and your central nervous system.

For me, that is training every day during the week but only doing 2 sets of 5 at a weight that I can do 10 times. For Deadlift, this is 315 and bench press it is 225. I get a little stress from them but I am never sore so I can pick back up the next day. One nice side effect is that I don’t have great grip strength and doing 2×5 at 315 pushes my grip so that is getting better as well.

This workout is perfect for me at my age (38) and goals. I don’t need some crazy CrossFit style workout. First, I don’t need that intensity for what I need to accomplish. More power to those people but that isn’t for me. Second, I need to have a workout that will be able to go with me through life.

Deadlift

I didn’t max out before I started but I was probably at 425×3 which comes out to a max of about 450. In March (10 weeks in), I got 455×4 which is calculated at 496. But, with multiple reps I do tend to get a little bounce at the bottom. Now, at 17 weeks, I hit a single rep of 500.

Bench Press

With bench, I can’t do the same comparisons since I was using dumbbells up until recently. When I started I was doing 240×6 and at week 10 I did that same weight 8 times which comes out to about 288lbs. Since I don’t have to worry about getting stuck under the weight with this workout I switched to traditional bench press. I have only gotten 300lbs once and that was in college when I was doing some crazy bench workout that was designed to only grow your max. Last week I easily hit 315. I could maybe put on an extra 10 pounds.

Goals

My goals for the future are to hit 550 on deadlift and 350 on bench by my birthday in August (~18 weeks). It is a pretty large jump at these weights for my age/body. I assume that I will need to up my current workout weight since I am getting stronger.

Here is the YouTube video: https://youtu.be/nwPp1kZV_4c

How I almost became a billionaire

First, some background information.

  • A regression problem is when your neural network outputs a continuous value. Think housing prices, stock market value, etc.
  • A spread is sports is what the sports book think will be the difference in the final score of a game. For example, Team 1 ends with 24 points and Team 2 ends with 21 points the spread is -3 for Team 1

I am assuming that some readers could guess where this is going. After doing the DevPost project on college basketball I tried to mess around with NFL scores. Going in I knew that I wouldn’t actually get anything of value but it would be “fun” to try. Anyway, I started training my model and I was getting amazing results. I was within a quarter of a point. If I actually had a model that would do that I would be a billionaire and pretty much shut down the sports betting market.

Well, since I am still here and not on a personal island you can assume I messed up. It turns out that I had skipped the step where I removed the ACTUAL SCORE OF THE GAME from my training data. The network picked up on this and was ignoring all my other inputs.

Back to work, I guess.

K Nearest Neighbor in TF v2

As I am going through the Google Developer Expert process I was asked with coding up KNN in TensorFlow. I couldn’t find an example online so I decided to create it.

The gist was that I could use *tf.math.squared_difference* to measure the difference. Then, I used reduce_sum to combine the 4 feature differences to a single number. Then, I would reverse the values using tf.negative and call tf.math.top_k to grab the nearest. Pretty straight forward.

GitHub: https://github.com/ehennis/Blog/blob/master/TensorFlow/K_Nearest_Neighbor.ipynb

YouTube: https://youtu.be/dA7QK80jTzA