Solved – How exactly to compute Deep Q-Learning Loss Function

deep learningleast squaresloss-functionsq-learningreinforcement learning

I have a doubt about how exactly the loss function of a Deep Q-Learning Network is trained. I am using a 2 layer feedforward network with linear output layer and relu hidden layers.

  1. Let's suppose I have 4 possible actions. Thus, the output of my
    network for the current state $s_t$ is $Q(s_t) \in \mathbb{R}^4$.
    To make it more concrete let's assume $Q(s_t) = [1.3, 0.4, 4.3,
    1.5]$
  2. Now I take the action $a_t = 2$ corresponding to the value $4.3$ i.e
    the 3rd action, and reach a new state $s_{t+1}$.
  3. Next, I compute the forward pass with state $s_{t+1}$ and lets say I
    obtain the following values at the output layer $Q(s_{t+1}) = [9.1,
    2.4, 0.1, 0.3]$. Also let's say the reward $r_t = 2$, and $\gamma = 1.0$.
  4. Is the loss given by:

    $\mathcal{L} = (11.1- 4.3)^2$

    OR

    $\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 11.1, 11.1, 11.1] –
    [1.3, 0.4, 4.3, 1.5])^2$

    OR

    $\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 4.4, 2.1, 2.3] –
    [1.3, 0.4, 4.3, 1.5])^2$

Thank you, sorry I had to write this out in a very basic way… I am confused by all the notation. ( I think the correct answer is the second one…)

Best Answer

After reviewing the equations a few more times. I think the correct loss is the following:

$$\mathcal{L} = (11.1 - 4.3)^2$$

My reasoning is that the q-learning update rule for the general case is only updating the q-value for a specific $state,action$ pair.

$$Q(s,a) = r + \gamma \max_{a*}Q(s',a*)$$

This equation means that the update happens only for one specific $state,action$ pair and for the neural q-network that means the loss is calculated only for one specific output unit which corresponds to a specific $action$.

In the example provided $Q(s,a) = 4.3$ and the $target$ is $r + \gamma \max_{a*}Q(s',a*) = 11.1$.

Related Question