Reinforcement Learning Tic Tac Toe with Value Function

A simple reinforcement learning algorithm for agents to learn the game tic-tac-toe. This demonstrate the purpose of the value function.
Reinforcement Learning Tic Tac Toe with Value Function

A simple reinforcement learning algorithm for agents to learn the game tic-tac-toe. This project demonstrate the purpose of the value function.

You begin by training the agent, where 2 agents (agent X and agent O) will be created and trained through simulation. These 2 agents will be playing a number of games determined by 'number of episodes'. You can adjust the 1) learning rate and 2) the probability of exploration of each agent. After training, you can play against agent X.

The 9 percentages on the right show the value of each grid at each state. Don't give up when playing against the trained agent! Because as humans, we can actually outsmart them. Try it! The agents are still learning as you play. You will notice the percentages of those grids that used to be very confident dropping over time if the agent didn't manage to win for a few games.

Demo

Train Agent
Number of times agent play against each other in simulation
Play against agent (Skill level: )
Agent's Value Function
. . .
. . .
. . .

Evaluate Epsilon Greedy

In order to figure out the most suitable epsilon-greedy value, I will test different epsilon-greedy value. First, initialise agent X with epsilon-greedy of 0.01, means there is 1% probability agent X will choose to explore instead of exploiting. Then, agents will play against each other for 10,000 games, and I will record the number of times agent X wins. After 10,000 games, I will increase the epsilon-greedy value to 0.02 and agents will play another 10,000 games.

Agent X (eps increment) vs Agent O (eps 0.05)

The blue line is the number of times agent X won against agent O. The winning rate decreases as the epsilon-greedy value increases and peaked at winning 9268 games at the epsilon-greedy value of 0.05 (agent X explores 5% of the time). Agent O begin to win more games as agent X explores more than 50% of the time.

Agent X (eps increment) vs Agent O (eps 0.05)

Agent O (eps increment) vs Agent X (eps 0.05)

Let us try something, by experimenting with different epsilon greedy for agent O, challenging agent X with an optimal 5% epsilon greedy.

Given agent X has optimal epsilon greedy and advantage to start first in the game, agent O lost the game before it is able to learn the game.

Agent X (eps increment) vs Agent O (eps 0.05)

Agent O (eps increment) vs Agent X (eps 1)

Let's tweak epsilon greedy of agent X to 100%, where agent X will be playing random actions all the time.

Agent O is able to learn the game and win against agent X, peaking at 4% epsilon greedy and begin losing at 30%.

Agent X (eps increment) vs Agent O (eps 0.05)