Doxa: NIM

Tuesday, December 7, 2021

In the classical supervised learning model, one chooses between

possible scenarios that are external to the subject: ie rainy day vs non-rainy

day. In reinforcement learning, the agent has to make choices, going

from one state to the other. It then becomes a matter of keeping

track of outcomes for each and every state and action consequence.

The markov chain becomes a Markov Decision process.

Our goal, then, is to generate a function Q:

It seems appropriate to then implememt greedy decision-making ie always

move to maximize our outcome reward. But that might get stuck in not exploring

sufficiently to make an optimal decision. One thus needs to reintroduce an element

of randomness so that the robot/agent occasionally moves randomly.

The example that comes in the lecture is a learning program for the game of NIM.

It is illustrated but the assignment is to complete some of the NIM functions...

(Cute game: eaxh player removes as many dots as desired on a row at each move.

whover removes the last dot looses)

The computer is going to train playing 10,000 games against itself, with alpha

at 0.5 (the factor for updating the value of a move from new information), and epsilon

at .10 (how often a random move will be tried). Computer should be a formidable

opponent after this practice...

In the case of NIM, it's pretty cut and dried that thecomputer needs to keep track

of everything. In a larger game-space, one might use approximations that recognize patterns.

Doxa