Machine Learning – Understanding Softmax Policy Parametrization with Binary State Features

artificial intelligencemachine learning

I have a fairly simple "mountain car"-ish problem, where the agent is in some position and must decide whether to go left or right. The state space is continuous and the action space is discrete. I'm using a policy gradient method with softmax parametrization, where the action preferences are simply linear in the state features. The state feature vector is made using tile coding. My main problem is that the way I'm setting up the state action feature vector does not seem to be compatible with the softmax eligibility vector – What ends up happening is that the weights are only updated based on actions, without caring about what the state (tile coded position) is.

To describe it more rigorously:

The state is a normalized real valued scalar $s \in [0,1]$ and I use tile coding on this (with just one tiling for this example) to go give me a binary state feature vector x(s) of size 4. So the normalized state $s$ is just discretized into 4 bins. As an example, if $s = 0.1$ the binary state feature vector would be x(s) = [1 0 0 0], i.e. the first feature or tile is active.

The action $a$ can only take on two values: $left$ or $right$, so $a \in \{left,right\}$. This can be represented by a action feature vector x(a) where x(a) = [1 0] $\implies left$ and x(a) = [0 1] $\implies right$.

Then I have combined x(s) and x(a) to give me a state action feature vector x(s,a). So for a given state e.g. $s=0.1$ and action $a=left$, the state action feature vector would be x(s,a) = [(1 0 0 0) (1 0)] where the parantheses are just to indicate the distinction between the state part and action part of the state action feature vector.

Given the above, I calculate the action preferences as follows:
$$
h(s,a,\theta) = \theta^T x(s,a)
$$
where $\theta$ is the weights to be learned by the policy gradient algorithm.

The softmax policy is:
$$
\pi(a|s,\theta) = \dfrac{e^{h(s,a,\theta)}}{\sum_{b}{e^{h(s,b,\theta)}}}
$$
where $b$ iterates over all the actions.

The eligibility vector becomes
$$
\nabla\ln{\pi(a|s,\theta)} = x(s,a) – \sum_{b}{\pi(b|s,\theta)x(s,b)} \tag{1}
$$
where $b$ again iterates over all the actions.

Finally, the weight vector $\theta$ is updated according to
$$
\theta \leftarrow \theta + \nabla\ln{\pi(a|s,\theta)}
$$

Now, the $\theta$ vector has the same size as the $x(s,a)$ vector, one weight for each feature. However, the problem I'm having is that when I calculate (1) for a given state $s$ the 'state contributions' cancel out. Again, let me demonstrate by using the same example as above:

Given that the state $s=0.1$, (1) will be computed as follows:
[(1 0 0 0)(1 0)] - (p1*[(1 0 0 0)(1 0)] + p2*[(1 0 0 0)(0 1)]) = [(0 0 0 0)(p1 p2)]. As we see, the effect of the state cancels out because for a given state all the action probabilities sum to $1$, so $p_1 + p_2 = 1$. This has the effect that the weight vector becomes: $\theta = (0,0,0,0,\theta_5,\theta_6)$ $\forall$ $s \in [0,1]$.

This means that the algorithm is unable to appreciate different actions in different states. For instance, if the best action is $left$ when $s < 0.5$ and $right$ when $s > 0.5$, the algorithm would not be able to figure that out in this case.

I have probably misunderstood something when it comes to how the state action feature vector $x(s,a)$ should be designed. For the cases where you only need the state feature vector $x(s)$ I feel like it is more obvious, but when you have to augment the feature vector with actions as well it is a little bit unclear, and the way I have solved it here does not seem right.

NOTE

This case is simplified relative to the actual case in order to focus on the problem I'm actually curious about, which is the design of $x(s,a)$ when calculating (1). I know the weight update equation is lacking some elements, but it has all the necessary parts to demonstrate the issue at hand.

Best Answer

You propose to formalise state-action feature vectors like $x(s, a) = \left[ x^s_1, x^s_2, x^s_3, x^s_4, x^a_1, x^a_2 \right]$, where all the $x^s_i$ are "state features", and all the $x^a_i$ are "action features". All the features are binary, and you always have exactly one active state feature, and one active action feature.

This could likely work if you had a non-linear function approximator (like a neural network), but will not work with a linear function approximator (with or without a softmax "wrapped around it"). Intuitively, you can see that this will not work out because your chosen representation requires the ability to learn non-linear interactions between the input features, which a linear function naturally cannot do.

Suppose that we have your example, with $x^s_1 = 1$ and $x^a_1 = 1$, and all other features $0$. A linear function can only independently learn weights for each of those separate input features, but cannot learn weights for "combinations" of input features. So it could (in theory, with different learning algorithms) learn a large weight for $x^s_1$ independent of the action being taken, or a large weight for $x^a_1$ independent of the state you're in, but a linear function can never learn something like:

"$x^a_1$ is good if $x^s_1$ is active, but the same action is bad if $x^s_1$ is not active"

That's a non-linear interaction between multiple different input features, and a linear function cannot possibly learn that.

The most common approach in Reinforcement Learning with linear functions is to first define feature vectors for states, which in your example would look like $x(s) = \left[ x^s_1, x^s_2, x^s_3, x^s_4 \right]$. Then, to obtain a feature vector $x(s, a)$ for a state-action pair, you would concatenate multiple such vectors; one for every possible action $a$. You would then set all the input features to $0$, except for the ones corresponding to the $i^{th}$ "chunk" if $a$ is the $i^{th}$ action. For example, in your situation you'd have:

$x(0.1, left) = \left[ 1, 0, 0, 0, 0, 0, 0, 0 \right]$
$x(0.1, right) = \left[ 0, 0, 0, 0, 1, 0, 0, 0 \right]$
$x(0.3, left) = \left[ 0, 1, 0, 0, 0, 0, 0, 0 \right]$
$x(0.3, right) = \left[ 0, 0, 0, 0, 0, 1, 0, 0 \right]$
etc.

With such input features, you already have a separate feature for every possible combination of state-feature + action, which means that something like "$x^a_1$ is good if $x^s_1$ is active, but the same action is bad if $x^s_1$ is not active" becomes linear in the new input space, rather than a non-linear interaction.

Best Answer

Related Solutions

[Math] Log of Softmax function Derivative.

Machine Learning – Implementing Policy Gradient Reinforcement Learning in Continuous State and Action Space

Related Question