3 Comments

I absolutely love this example! Also I want to play Slay the Spire now. And I also love that the implication here is "optimization" is basically about knowing yourself and having an accurate picture of your abilities and of what you do and (crucially) do not know about reality. I think this is v important in life, and love looking for ways to optimize for optimization lol.

Expand full comment

Slay the Spire is a masterpiece of a game and I recommend it to anyone that doesn't mind being ruined for everything else.

Expand full comment

Thanks for this, it's got me thinking about where it might apply in my life.

I'm not an ML expert, but I think this distinction you're talking about maps pretty well to the distinction between on-policy and off-policy action-value functions in reinforcement learning.

Here's an explanation from GPT (didn't find a good one googling):

Suppose you're navigating your character in a video game. [...]You want to maximize points, and you have a map plus a strategy or policy, denoted by π. Each action you take has a potential reward, represented by numerical values.

When using the on-policy action-value function, denoted by Qπ(s, a), you're estimating the value of each action (a), you could take in each possible state (s), considering your current game-playing style or strategy (π).

Mathematically, the on-policy action-value function Qπ(s, a) is the expected return (or future accumulated reward) when starting in state s, taking action a, and then following policy π:

Qπ(s, a) = Eπ[Rt | St=s, At=a]

Here Eπ denotes the expectation according to π, Rt is the total accumulated reward after t steps, and St=s, At=a means start at state s and taking action a at time t.

Now, suppose you decide to jump over a pit in the game. You calculate, "If I stick with my strategy (π), what's my expected total score (Rt) after this jump?". The on-policy value function directly uses the current strategy (π) to estimate this value.

On the other hand, with an off-policy action-value function, denoted by Q*(s, a), you're still playing with your current style but envisioning an optimal strategy. This optimal strategy dictates the 'best' possible actions to take to achieve maximum rewards, without regard to what your current policy recommends.

The off-policy action-value function Q*(s, a) is defined as the maximum expected return when starting in state s, selecting action a, and thereafter following an optimal policy:

Q*(s, a) = max π Eπ[Rt | St=s, At=a]

In this case, when faced with the same pit in the game, you ponder, "If I were playing ideally, what's my maximum possible total score after this jump?". The off-policy value function does not consider what the current policy would do next, rather it tries to estimate the value based on an optimal policy.

To summarize, on-policy context values actions on your current playstyle, while the off-policy estimation theoretically takes the best possible actions under optimal play, regardless of your current strategy. Both functions help to forecast the value of potential actions, but they do so under different hypothetical futures, dictated by either your current or the optimal strategy.

Expand full comment