### An Alternate Formulation of Markov Decision Processes for Solving Probabilistic Games

Today I'm going to talk about one of the coolest things I know about: Markov Decision Processes (MDPs). I was formally introduced to
MDPs in my undergraduate course in AI, however I had actually independently discovered something very similar when I analyzed an unfair game. In this blog post,
I will generalize my formulation of MDPs which have many similarities to the traditional setup, but I believe my formulation is better suited for finding theoretically optimal strategies
to games with uncertainty.

###

In the traditional setup, there is some state space, \( S \), and an agent who moves around the state space with some degree of randomness and some degree of choice (actions). Different actions lead
to different transition probabilities for the agent. These probabilities are denoted by \( P_a(s, s') \) where a is the chosen action, \( s \) is the starting state, and \( s' \) is the state transitioned to.
As a side note and to solidify your understanding of the last sentence, the following is an invariant: \( \sum_{s' \in S} P_a(s,s') \) for all states \( s \) and actions \( a \).
There is also a notion of a reward function \( R_a(s, s') \) that denotes the utility for the agent to go from state \( s \) to state \( s' \) through action \( a \).
Note that this is the first place where my formulation differs from the traditional.

Under the traditional setup, the goal is then to find a policy (an action for every state) that maximizes the expected value of \( \sum_{t=0}^\infty \gamma^t R_{a_t}(s_t, s_{t+1}) \) where $ 0 \leq \gamma < 1 $. This particular function was chosen because it favors short term rewards (as the \gamma^t term exponentially decays to \( 0 \)) while still accounting for all (infinitely many) rewards. The Value Iteration algorithm can be used to find the policy that maximizes this expected value. The algorithm seeks to find a find a function \( V(s) \) that denotes the value of a state (as determined by the expected value of the sum from above). Such a function must satisfy the following equality: $$ V(s) = max_{a} \sum_{s' \in S} P_a(s, s') \Big[ R_a(s,s') + \gamma V(s') \Big] $$ To determine the associated optimal policy, simply use the action \( a \) that attains the maximum in the evaluation of \( V(s) \) for the given state. If you expand out \( V(s) \) to a small fixed depth by plugging in it's recursive definition, you will see that it is a very nice and compact way to write out the expected value of the sum we are attempting to maximize. The value iteration algorithm says that you can find \( V(s) \) iteratively and it is guaranteed to converge granted that $ 0 \leq \gamma < 1 $.

###

I will now outline my formulation, which I believe are more suited for solving probabilistic games, where the optimal strategy for a given player is the strategy that maximizes his/her
probability of winning the game. I will describe the method assuming we are working with a single player game. I believe it can be extended as a framework for solving multi-player games as well,
but I will not be talking about that in this blog post.

The first difference between my formulation and the traditional one is that my formulation doesn't assign value based on the actions being taken and the states transitioned to. Instead, the value of a state is determined by the probability of eventually reaching a winning state under the optimal strategy. Thus, there is no need for the function \( R_a(s, s') \). The second difference is that my formulation doesn't have \( \gamma \). My method directly optimizes the probability of eventually reaching a winning state - it doesn't matter if it happens in 5 turns or 5000 turns. My formulation can be expressed in the traditional formulation by defining \( R_a(s,s') = 1 \) for all winning states \( s' \) and \( R_a(s, s') = 0 \) for all other states and setting \( \gamma = 1 \). However, when \( \gamma = 1 \), value iteration is not guaranteed to converge unfortunately.

Anyway, we can define a value function in this formulation that is similar to the one defined earlier. Let \( W \subseteq S \) be the set of all winning states, \( L \subseteq S \) be the set of all losing states. Then we have $$ V(s) = \begin{cases} 1 & s \in W \\ 0 & s \in L \\ \max_a \sum_{s' \in S} P_a(s, s') V(s') & otherwise \end{cases} $$ This function differs from the other one because there are base cases for winning and losing states, and the general case is somewhat simpler. However, this function is also less general than the other one. Finding the values of \( V(s) \) can be somewhat tricky, and I don't have a general framework for doing it unfortunately. In the case where the state space forms a directed graph (i.e. cycle occurs with probability 0 for all policies), the values of \( V \) can be computed using a dynamic programming algorithm. Luckily most games aren't going to have cycles because then there would be a chance for the game to never end. In those situations, maybe it makes sense to use the original formulation (with $ \gamma < 1 $) so that infinitely long games have value \( 0 \). In situations where the state space is infinite, like in this game, you will have to use techniques specific to the problem you are dealing with. I do not know of any general way to handle those types of problems.

###

Anyway, hopefully some of that makes sense. If the first time you've seen MDPs, I highly recommend learning more about them as they are really pretty awesome. If you have seen them before, hopefully you
understand my formulation and maybe you will find it useful if you are ever working out a strategy for stochastic games. In a future blog post, I am going to apply this technique to solve Farkle,
a pretty fun multi-player dice game. A couple years ago, I worked out a simplified version of the game but was unable to solve the actual game
due to limitations in knowledge and computational power.

###

The Traditional Setup

In the traditional setup, there is some state space, \( S \), and an agent who moves around the state space with some degree of randomness and some degree of choice (actions). Different actions lead
to different transition probabilities for the agent. These probabilities are denoted by \( P_a(s, s') \) where a is the chosen action, \( s \) is the starting state, and \( s' \) is the state transitioned to.
As a side note and to solidify your understanding of the last sentence, the following is an invariant: \( \sum_{s' \in S} P_a(s,s') \) for all states \( s \) and actions \( a \).
There is also a notion of a reward function \( R_a(s, s') \) that denotes the utility for the agent to go from state \( s \) to state \( s' \) through action \( a \).
Note that this is the first place where my formulation differs from the traditional.Under the traditional setup, the goal is then to find a policy (an action for every state) that maximizes the expected value of \( \sum_{t=0}^\infty \gamma^t R_{a_t}(s_t, s_{t+1}) \) where $ 0 \leq \gamma < 1 $. This particular function was chosen because it favors short term rewards (as the \gamma^t term exponentially decays to \( 0 \)) while still accounting for all (infinitely many) rewards. The Value Iteration algorithm can be used to find the policy that maximizes this expected value. The algorithm seeks to find a find a function \( V(s) \) that denotes the value of a state (as determined by the expected value of the sum from above). Such a function must satisfy the following equality: $$ V(s) = max_{a} \sum_{s' \in S} P_a(s, s') \Big[ R_a(s,s') + \gamma V(s') \Big] $$ To determine the associated optimal policy, simply use the action \( a \) that attains the maximum in the evaluation of \( V(s) \) for the given state. If you expand out \( V(s) \) to a small fixed depth by plugging in it's recursive definition, you will see that it is a very nice and compact way to write out the expected value of the sum we are attempting to maximize. The value iteration algorithm says that you can find \( V(s) \) iteratively and it is guaranteed to converge granted that $ 0 \leq \gamma < 1 $.

###

A Different Formulation

I will now outline my formulation, which I believe are more suited for solving probabilistic games, where the optimal strategy for a given player is the strategy that maximizes his/her
probability of winning the game. I will describe the method assuming we are working with a single player game. I believe it can be extended as a framework for solving multi-player games as well,
but I will not be talking about that in this blog post.The first difference between my formulation and the traditional one is that my formulation doesn't assign value based on the actions being taken and the states transitioned to. Instead, the value of a state is determined by the probability of eventually reaching a winning state under the optimal strategy. Thus, there is no need for the function \( R_a(s, s') \). The second difference is that my formulation doesn't have \( \gamma \). My method directly optimizes the probability of eventually reaching a winning state - it doesn't matter if it happens in 5 turns or 5000 turns. My formulation can be expressed in the traditional formulation by defining \( R_a(s,s') = 1 \) for all winning states \( s' \) and \( R_a(s, s') = 0 \) for all other states and setting \( \gamma = 1 \). However, when \( \gamma = 1 \), value iteration is not guaranteed to converge unfortunately.

Anyway, we can define a value function in this formulation that is similar to the one defined earlier. Let \( W \subseteq S \) be the set of all winning states, \( L \subseteq S \) be the set of all losing states. Then we have $$ V(s) = \begin{cases} 1 & s \in W \\ 0 & s \in L \\ \max_a \sum_{s' \in S} P_a(s, s') V(s') & otherwise \end{cases} $$ This function differs from the other one because there are base cases for winning and losing states, and the general case is somewhat simpler. However, this function is also less general than the other one. Finding the values of \( V(s) \) can be somewhat tricky, and I don't have a general framework for doing it unfortunately. In the case where the state space forms a directed graph (i.e. cycle occurs with probability 0 for all policies), the values of \( V \) can be computed using a dynamic programming algorithm. Luckily most games aren't going to have cycles because then there would be a chance for the game to never end. In those situations, maybe it makes sense to use the original formulation (with $ \gamma < 1 $) so that infinitely long games have value \( 0 \). In situations where the state space is infinite, like in this game, you will have to use techniques specific to the problem you are dealing with. I do not know of any general way to handle those types of problems.

## Comments

## Post a Comment