∫ The reign of expectation terror


Kernel:

  • In a certain expectation game, there's a finite convergence strategy.

Humans are tenacious, persistent, and hardly know when to quit, whether it's about when to stop playing games and go to sleep, or when to sell holding stocks to cut loss. In computer science vernacular, it's called a "stopping problem".

I've read a great blog by Samir Khanhttps://patternsofideas.github.io/posts/expectations/. He argued that in some scenarios, when it comes to choose the optimal stopping, not even a great tool such as the expectation of payoff helps.

∂ The game

Imagine a game of sequential bets where at each step we could decide to stop and take all the credit we've collected so far or we could take a chance to double it, only if the chance fails then we lose it all.

{ "node": "r", "transition": { "p": { "node": "2r", "transition": { "p": { "node": "4r" }, "1-p": { "node": "0" } } }, "1-p": { "node": "0" } } }
$$ \begin{align*} E[\text{next payoff} | \text{bet}] &= p2r + (1-p)0 \\ E[\text{next payoff} | \text{stop}] &= 0 \\ \end{align*} $$

We could try to do some math to decide whether should we continue at each step. But if the chance \(p > 0.5\) You will see that the expected payoff favors we take the bet. Yet if there's anything we learn from probability, we know that eventually we would fail if we don't stop. That we will surely get back empty-handed, or continue betting until we do. So expectation fails us! ... or not?

The reason why we use expectation is to marginalize out all uncontrollable factors, and that reason is pretty solid. If we follow the expectation yet the outcome does not making sense. Perhaps your reward function does not truly reflect what you feel?

Let's us consider the scenario where we don't actually play the game. We only input a number, say k , that tells the game how many times we would take the bets, the game itself will then instantly resolve the game and display the outcome to you. When treating the game this way, I feel more comfortable taking a larger \(k\) from the fact that now I've gotten rid of in-game cognitive biases.

$$ \begin{align*} E[\text{payoff} | \text{stop at}\,k] &= r(2p)^k \end{align*} $$

Humans have so many cognitive biases. One in active here is called loss aversion. When the stakes are high enough, gaining more might not be more important than preventing loss. Thus we incline to stop to collect what we already have than going back to nothing. This seems like the bias works against us to define what's the best interest.

Still I would not choose \(k = \infty\), despite the fact that it gives the best expected payoff; i.e., there is a slim probability that I would get a handsome reward. Am I wrong? Or are there some other cognitive biases in work here? If not, then may be the expectation fails me?

∂ With great expectation comes great policies

While I enjoy Khan's premise and statements, I have to disagree with his conclusion. The problem of this game is not about expectation failing us, but it's the fact that we're dealing with infinity. If you think about it, we will never reach infinity in real life, no matter how much time we have. What we should have been doing instead is to find better policies via the expectation. Ones that do not involve betting to the infinity to get the reward.

Suppose that at each step, I'll toss a biased coin that goes head with probability \(q\). If it comes out head, I'll continue playing; otherwise I'll stop and the current reward. What is the expected payoff of this policy?

{ "node": "r", "transition": { "q": { "node": "r", "transition": { "p": { "node": "2r", "transition": { "q": { "node": "2r", "transition": { "p": { "node": "4r" }, "1-p": { "node": "0" } } }, "1-q": { "node": "2r" } } }, "1-p": { "node": "0" } } }, "1-q": { "node": "r" } } }
$$ \begin{align*} E[\text{payoff} | q] &= r\frac{1-q}{1-q2p} \end{align*} $$

The expected payoff of this random policy goes to infinity when \(q\) approaches \(\frac{1}{2p}\). Meaning sometimes I would secure the reward, sometimes I will continue, sometimes I will get nothing, but in overall the expected payoff is infinity just like the case where I never stop in the first policy. Now we don't leave our fate at the hand of infinitesimal god, but rather the marvel of mathematics.

Now which policy is faster? Faster means I could expectedly get the same reward while enjoy a shorter game. We could analytically show that the expected stopping point of the random policy is always sooner than the first policy at the same expected payoff.

Given an expected payoff of \(xr > 0\), I could derive the parameters of both policies:
$$ \begin{align*} (2p)^k = x &\rightarrow k = \frac{\log x}{\log 2p} \\ \frac{1-q}{1-q2p} = x &\rightarrow q = \frac{1 - x}{1 - x2p} \end{align*} $$
The expected number of steps taken to stop under these parameters:
$$ \begin{align*} E[\text{steps} | k] &= \frac{p^k - 1}{p - 1} \\ E[\text{steps} | q] &= \frac{q}{1-qp} \end{align*} $$
They both are of equal speed when \(p = 0.5\):
$$ \begin{align*} k = \frac{\log x}{0} = \infty &\rightarrow \frac{p^k - 1}{p - 1} = 2 \\ q = \frac{1-x}{1-x} = 1 &\rightarrow \frac{q}{1-qp} = 2 \end{align*} $$
The random policy is faster when \(\frac{1}{2} < p < 1, x \rightarrow \infty \):
$$ \begin{align*} \frac{p^k - 1}{p - 1} - \frac{q}{1 - qp} &= \frac{(1-p^k)(p(x+1)-1) - (1-p)(x-1)}{(1-p)(p(x+1) - 1)} \\ &= \frac{(p(x+1)-1) - (1-p)(x-1)}{(1-p)(p(x+1) - 1)} \quad \text{since} \quad k \rightarrow \infty \\ &= \frac{(2p - 1)x}{(1-p)(p(x+1) - 1)} > 0 \end{align*} $$

Surely, things are getting weird when we deal with the limits. And with a little change in perspective, we can still rely on expectation to guide our decisions. Just that the decision is not about the expected payoff, but the expected stopping point. So expectation has yet to fail. We just have to ask the right question to get the right answer to begin with.