There are few things for frustrating to me in the machine learning/AI world than seeing Buzzfeed type companies write about ML/AI. Sites like that burn down the actual facts into clickbait titles. It seems any “learning” that occurs is a step away from Skynet. I honestly don’t have a single site that I trust that I can go to and see the facts displayed. I always have to fall back on the technical paper that was hopefully written or something from the actual authors.
I come from an ML background with very little AI experience but most of the latest advancements are similar to Reinforcement Learning that I can piece it together. So, this is my attempt to do just that.
Poker and Machines
I have long wanted to be part of something that would be able to “solve” poker. The entire state and action space fascinate me. I was always of the opinion that with infinite knowledge you could beat emotional players. It always seemed the best and most consistent players were the mellow mathematicians that approached the game like a math problem. At each step, there are known percentages on different plays. A computer with a much larger memory than we could ever hope could have all of this available.
With Google doing so well with Chess and Go I figured it was fairly close to getting poker. Obviously, poker doesn’t have “perfect” information since you don’t know what the opponent has in their hand.
Learning From Machines
My biggest excitement going forward is how humans can learn from machines. It has been stated that the top chess players have learned a few new opening strategies after playing AlphaZero.
EXCLUDING THE ETHICAL ASPECT, I am curious to see what we could learn on the battlefield from an advanced war simulation. If there is something out there that could save lives during our ongoing battles that humans have never even thought about.
From what I can tell, the researchers at CMU had the same problems that early Q-learning did where the action space and state space were too large to handle in a traditional array. Q-learning went towards neural networks eventually but early on they did discretize the input. That is what Pluribus is doing with the actions and information.
For the action state they group the bet sizes (think $105 is the same as $100 or $110) and then during play they use a search algorithm to narrow down the decision.
For the information collected they group similar hands together IN FUTURE ROUNDS since they are played the same. But, in the current round they use the exact hand.
Offline Training (Blueprint)
To build the offline model they used what is common called counterfactual regret (CFR) minimization. Basically, after the hand is over they go back through and decide what “should” have been done and how it affected the outcome. What is new (to me at least) is they used a Monte Carlo simulation to sample actions versus traversing through the entire game. Because they have all the players in the offline game the AI knows all of this information.
CFR guarantees the overall interactions converge to a nash equilibrium in a zero-sum game.
CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. This, in turn, guarantees in the limit that the average performance of CFR on each iteration that was played matches the average performance of the best single fixed strategy in hindsight. CFR is also proven to eliminate iteratively strictly dominated actions in all finite gamesSuperhuman AI for multiplayer poker
According to the paper the training was done of 8 day on a 64-core server with 12,4000 CPU core hours. They used less than 512 GB of memory. The assume current cloud rates and said it would cost them about $144 to produce.
Because the offline playing is “course” because of the complexity of poker they only use it for the first round where they considered it safe to use. After that, or if the player confuses it by betting oddly, they use real time search.
While real time search has been successful in perfect information games there is a problem in imperfect information games. The paper says it is “fundamentally broken”. There is example is Rock/Paper/Scissors. In my previous blog post I showed that the expected value of each move is 1/3. Using the search algorithm you would think each move is the same and will just pick scissors each time. This would cause an issue as the other player would know this and win every time with rock.
There were 2 alternatives that were used in the past. AI DeepStack determined the leaf nodes value based on the strategy used to get to the leaf. This doesn’t scale with a large tree. The other alternative was used in AI Libratus who would only use the search when they could extend to the end of the game. This wouldn’t work with the addition of extra poker players.
Pluribus instead uses a modified form of an approach that we recently designed—previously only for two-player zero-sum games (41)— in which the searcher explicitly considers that any or all players may shift to different strategies beyond the leaf nodes of a subgame. Specifically, rather than assuming all players play according to a single fixed strategy beyond the leaf nodes (which results in the leaf nodes having a single fixed value) we instead assume that each player may choose between k different strategies, specialized to each player, to play for the remainder of the game when a leaf node is reached.Superhuman AI for multiplayer poker
They used 4 strategies that were all based on the baseline blueprint strategy. The first was the actual blueprint, the second was the blueprint with a bias towards folding, the third is a bias toward raising, and the fourth is a bias towards calling.
Another issue of the imperfect game is around bluffing. The optimal strategy with the best hand it to play and the worst hand is to fold. If you do this all the time the other players will know what you have. To help solve this, Pluribus determines the probability of reaching the current point with each possible hand. It then creates a balanced strategy and does that action.
Pluribus was tested against elite human players (won at least $1 million playing poker) in 2 formats. The first was 5 v 1 with 5 players and the second was with 5 Pluribus instances.
The measurement was the number of milli big blinds per game. This is the measurement of the number of big blinds won per 1k hands of poker. Pluribus ended up winning 48mbb/game which is considered very high and shows it is “better” than the players it was against.
I hope this helps clear the hype around most articles written about the subject and maybe a little easier to read than the Science article linked above.
I am excited to see if this will lead any poker players to change their game or if this will kill the online poker world.