Skip to main content

Infinitely many states and no memory

Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares:
The random states come with a distribution over successor states, but in the controlled states a controller chooses a successor state (or a probability distribution over the successor states). For instance, the controller could stay on the leftmost column forever, by always choosing to go one state down. Or the controller could go right at some point; in the random state a successor is picked randomly, either the initial state or the state on the right, according to the blue probabilities.

What objective should the controller aim at? In this post, the objective will be the following: visit green states infinitely often, and red states only finitely often. Here is the previous MDP with colours:
Since the controller wants to visit (infinitely many) green states, it needs to leave the left column at some point. But once it leaves the left column, there is a risk of visiting the red state before returning to the initial state. Visiting the red state is not a disaster, but it shouldn't happen infinitely often.

How should the controller play then? Let's say a new round starts whenever we return to the initial state. Consider the following strategy: in each round go one state further down the left column before going right. Under that strategy, the controller visits green states infinitely often. One can show that, with probability $1$, the red state will be visited only finitely often. So under that strategy, the controller achieves the objective almost surely, i.e., with probability $1$. Achieving the objective almost surely is the gold standard in this post: we don't settle for less. We call a strategy winning if it achieves the objective almost surely.

The described winning strategy (go ever further down) requires infinite memory: each time the controller visits the initial state, it needs to remember how far it went down in the previous round. In contrast, every finite-memory strategy loses with probability $1$. I won't define memory here, only "no memory": A strategy is MD (memoryless deterministic) if the controller assigns to each controlled state $s$ a fixed successor $s'$ that the controller chooses whenever it visits state $s$ (regardless of the history of the play). In the example above, every MD strategy loses almost surely (either because it never leaves the left column, or because it visits the red state infinitely often).

In this post we focus on MDPs where the controller has some winning strategy. Any state that could be visited by this strategy must again have a winning strategy, because otherwise the winning strategy could not afford to visit the state. In the following we ignore all other states, i.e., we can assume that our MDP has only states for which there exists a winning strategy. Under what circumstances is there also a winning strategy that is MD? In this post I'll show:
Consider an MDP where for every state there is a winning strategy. If the MDP does not have white states (i.e., all states are green or red) then there is also an MD winning strategy. The same MD winning strategy works, regardless of which state is the start state.
The example above shows that the no-white-states assumption cannot be dropped. Absence of white states means that visiting red states only finitely often is effectively the only goal, because then green states will automatically be visited infinitely often. A promising strategy might be a safety strategy: in each state minimize the probability of ever visiting a red state again. The appeal of a safety strategy is twofold:
  • If a safety strategy succeeds in never visiting a red state again, then clearly it visits red states only finitely often.
  • There are safety strategies that are MD (provided each controlled state has only finitely many successors, which we assume in this post).
Safety strategies do not work though. Consider the following MDP and imagine that after the first random state the controller is unlucky and reaches not the first green self-loop but the second red state.
The controller might reason: "Ok, I've been unlucky and am now in a red state. But if I go down one state, there is chance of $1/4$ that this was the last red state that I ever see. So let's give it a go." However, with probability $3/4$ the controller will be unlucky again. The controller might keep pushing its luck by playing "down", but the dream of reaching a green sink might move further and further away. Actually, the probability that this strategy (the safety strategy) succeeds is less than 72%. Compare this with the opposite strategy, which always returns to the initial state: that strategy is MD as well and succeeds almost surely! We conclude that playing only for safety is too impatient.

The following picture visualizes the situation more abstractly. There I've arranged the states according to their safety level. By safety level, I mean the probability of never visiting another red state, assuming a safety strategy is played (i.e., a strategy that maximizes the probability of never visiting another red state). A safety strategy entails not to pick successor states with smaller safety level, as indicated in grey:
By the example above, a safety strategy is not necessarily winning. What will work better is to fix a safety strategy only for the states with safety level at least $1/3$. In this way we obtain another MDP. For the less safe states, the controller is free to choose any outgoing transition:
Consider the MDP (visualized above) obtained by fixing a safety strategy for the states with safety level at least $1/3$ (the blue area). In this MDP, for every state there is a winning strategy.
Consider any state. Since there is a winning strategy in the original MDP (before having fixed anything), the controller might play this strategy, until the play reaches the blue area. At this point the controller switches to a safety strategy (which is consistent with what has been fixed in the blue area). Now the controller has a chance of at least $1/3$ of never visiting red again and thus achieving the objective. If the play does reach a red state, the controller reverts to a strategy that is winning for that red state in the original MDP. The controller follows that strategy until the play again reaches the blue area. At this point the controller switches again to a safety strategy, thus forever avoiding red with a fresh chance of at least $1/3$. Continuing in this way, the controller achieves the objective almost surely. Note that this strategy is not MD, as the controller needs to remember in which phase it is: at any point it either follows a winning strategy of the original MDP, or it follows a safety strategy.
This shows that by freezing the blue area with a safety strategy we have not yet screwed up. I still need to define an MD strategy for the non-blue area. But bear with me for a moment and consider a state with even higher safety level, at least $2/3$. Such a state is in the blue area, so a safety strategy has been fixed there. Two things might happen:
  • Either the play remains in the blue area forever and never visits a red state;
  • or it eventually leaves the blue area or visits a red state.
The two cases are visualized here:
The second case (dropping below safety $1/3$) cannot have very large probability: after all, we start in the dark-blue area with safety level at least $2/3$, so if the safety level is very likely to drop below $1/3$, we can't have been very safe to start with. Doing the maths shows that the probability of the second case is at most $1/2$. So starting in the dark-blue area, the probability of avoiding red forever is at least $1/2$.

Here is our hope: If we could always return to that dark-blue area almost surely, then each time we arrive there we would get a fresh chance of at least $1/2$ to avoid red forever. In this way we would achieve the objective almost surely. This is illustrated here:
But is this just a pipe dream? After all, in the light-blue area we have already fixed a safety strategy. Perhaps that safety strategy prevents us from reaching the dark-blue area? Perhaps playing for safety in the light-blue area is greedy and short-sighted, and we should rather swallow the bitter pill and visit some red states for the greater benefit of reaching the dark-blue area?

No, we are actually good. Remember that the proposition above says that after freezing the blue area there is still a winning strategy (albeit not necessarily MD) for each state. So the following proposition applies:
Consider an MDP where for each state there is a winning strategy. Then each state has a strategy to almost surely reach a state with safety level at least $2/3$.
Consider an arbitrary start state $s_1$ and an arbitrary strategy. Suppose that $s_1$ has safety level less than $2/3$, so the probability of visiting a red state from $s_1$ is at least $1/3$. It follows that there is a finite number $k_1$ such that the probability of visiting a red state within the next $k_1$ steps is at least $1/4$. After these $k_1$ steps we are in another state, say $s_2$. Suppose that $s_2$ also has safety level less than $2/3$. Then, similarly, there is a number $k_2$ such that the probability of visiting a red state within the next $k_2$ steps is at least $1/4$. Etcetera. It follows that the probability of never reaching safety level at least $2/3$ and visiting only finitely many red states is zero. So every winning strategy almost surely reaches safety level at least $2/3$.
So after freezing the blue area, there is still for every state a strategy to almost surely reach the dark-blue area. It is known that if there is a strategy to achieve a reachability objective almost surely then there is also an MD strategy to achieve the reachability objective almost surely. Reaching the dark-blue area is obviously a reachability objective. So there is an MD strategy to almost surely reach the dark-blue area. As argued above, this strategy, combined with the safety strategy in the blue area, is winning.

In summary, here is our MD winning strategy: in the blue area play an MD strategy for safety (avoiding red states), and in the white area play an MD strategy for reaching the dark-blue area. The key idea is that these two MD strategies are not conflicting.

The material in this post is from a LICS'17 paper by Richard Mayr, Mahsa Shirmohammadi, Dominik Wojtczak, and myself.

Comments

Popular posts from this blog

Tell me the price of memory and I give you €100

Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares: The random states (except the brown sink state) come with a probability distribution over the successor states. In the controlled states, however, a controller chooses a successor state. What does the controller want to achieve? That depends. In this blog post, the objective is very simple: take a red transition . The only special thing about red transitions is that the controller wants to take them. We consider only MDPs with the following properties: There are finitely many states and transitions. The MDP is acyclic, that is, it has no cycles. There are a unique start state from which any run starts (in the pictures: blue, at the top) and a unique sink state where any run ends (in the pictures: brown, at the bottom). No matter what the controller does,

Calibrating Assessments

Last year I wrote a post on how Oxford selects undergraduate students for computer science. Oxford undergraduate admissions are politically charged: the British public suspects that Oxbridge is biased towards privileged applicants. This post is unpolitical though. I'm advocating a better way of solving a typical problem in academia: you have many candidates (think: papers, grant proposals, or Oxford applicants) and few assessors (think: members of program committees, assessment panels, or admission tutors), and the goal of the assessors is to select the best candidates. Since there are many candidates, each assessor can assess only some candidates. From a candidate's point of view: a paper may get 3 reviews, a grant proposal perhaps 5, and a shortlisted Oxford applicant may get 3 interviews. Each such review comes with a score that reflects the assessor's estimate of the candidate's quality. The problem is that different assessors will score the same candidate differ

Short Killing Words

Given a finite set $\mathcal{M}$ of $n \times n$ matrices over the integers, can you express the zero matrix as a product of matrices in $\mathcal{M}$? This is known as the mortality problem. Michael Paterson showed in 1970 that it is undecidable, even for $n=3$. Later it was shown that mortality remains undecidable for $n=3$ and $|\mathcal{M}| = 7$, and for $n=21$ and $|\mathcal{M}| = 2$. Decidability of mortality for $n=2$ is open. If all matrices in $\mathcal{M}$ are nonnegative, the mortality problem becomes decidable, because then it matters only whether matrix entries are $0$ or not. Specifically, view the given matrices as transition matrices of a nondeterministic finite automaton where all states are initial and accepting, and check whether there is a word that is not accepted. This problem is decidable and PSPACE-complete. One can construct cases where the shortest word that is not accepted by the automaton has exponential length. In this post we focus on the nonnegative