I am going to contradict @Pieter's answer and say that your problem is that you have too much bias and too little variance. In other words, your network is not complex enough for this task.
To see this, let $Y$ be the true and correct output that your network should return (the target), and let $\hat{Y}$ be the output that your network actually returns. Your loss-function is the mean squared-error, averaged over all examples in your training dataset $\mathcal{D}$ :
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right]
$$
In this loss-function, using your network, we are trying to adjust the probability distribution of $\hat{Y}$ such that it matches the probability distribution of $Y$. In other words, we are trying to make $Y=\hat{Y}$, such that the mean squared-error is $0$. This is the lowest possible value of the mean squared-error:
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq 0
$$
However, from the question How can I prove mathematically that the mean of a distribution is the measure that minimizes the variance?, we know that the mean squared-error actually has a tighter lower-bound, which is when $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$, such that the mean squared-error loss-function becomes
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \mathbb{E}_{\mathcal{D}}[Y])^2\right] = \text{Var}(Y)
$$
Since we know that the variance of $Y$ is non-negative, then the mean squared-error loss-function has the following lower-bounds
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq \text{Var}(Y) \geq 0
$$
In your case, you have reached the lower-bound $\text{Var}(Y)$, since you observe that $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$. This means that the bias (strictly speaking, this is not the correct definition of bias, but it gets the point across.) of $\hat{Y}$ is
$$
(Y - \hat{Y})^2 = (Y - \mathbb{E}_{\mathcal{D}}[Y])^2
$$
The variance of $\hat{Y}$ is
$$
\mathbb{E}_{\mathcal{D}}\left[\left(\hat{Y} - \mathbb{E}_{\mathcal{D}}\left[\hat{Y}\right]\right)^2\right] = \mathbb{E}_{\mathcal{D}}\left[\left(\mathbb{E}_{\mathcal{D}}[Y] - \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathcal{D}}[Y]]\right)^2\right] = 0
$$
Clearly, you have too much bias and too little variance.
So, how do we reach the lower-lower-bound of $0$? We need to increase the variance of $\hat{Y}$ by either adding more parameters to the network or adjusting the network architecture. As discussed in What should I do when my neural network doesn't learn? (highly recommended read), consider over-fitting and then testing your network on a single example by adding many more parameters or by adjusting the network architecture.
If the network no longer predicts the mean on a single example, then you can scale up slowly and start over-fitting and testing the network on two examples, then three examples, and so on. Otherwise, you need to keep adding more parameters/adjusting the network architecture until your network no longer predicts the mean.
Eventually, once you reach a dataset size of around 100 examples, you can start to split your data into training and testing to evaluate the generalization performance of your network. At this point, if it starts to predict the mean again, then make sure that the examples that you are adding to the dataset are similar to the examples that you already worked through in the smaller datasets. In other words, they are normalized and "look" similar. Also, keep in mind that as you add more data to the dataset, you will more likely need to add more parameters for better generalization performance.
Another helpful modification, but not as helpful as what I stated above, that helps in practice, is to slightly adjust the mean squared-error loss function itself. If your mean squared-error loss function is
$$
\mathcal{L}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2
$$
where $N$ is the dataset size, then consider using the following loss function instead:
$$
\mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i)-\log(\hat{y}_i))^2\right]
$$
Where $\alpha$ is a hyperparameter that can be tuned via trial and error. A starting value for $\alpha$ could be $\alpha=5$. The advantage of this loss function over the plain mean squared-error loss function is that the $\log(\cdot)$ function stretches small values in the interval $[0,1]$ away from each other, which means that small differences between $y$ and $\hat{y}$ are amplified, leading to larger gradients. I have personally found this modified loss function to be very helpful in practice.
For this to work well, it is recommended (but not necessary) that $y$ and $\hat{y}$ are each scaled to have values in the interval $[0,1]$. Also, since $\log(0)=-\infty$, and since it is likely that $y$ and $\hat{y}$ will have values very close to $0$, then it is recommended to add a small value $\epsilon$, such as $\epsilon=10^{-9}$, to $y$ and $\hat{y}$ in the loss function as follows:
$$
\mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i + \epsilon)-\log(\hat{y}_i + \epsilon))^2\right]
$$
This loss function may be thought of as the Mean Squared Log-scaled Error Loss.
Best Answer
A neural net with multiple outcomes takes the form $$ \mathbf{Y} = \mathbf{\gamma} + \mathbf{V}_1\Gamma_1 + \epsilon\\ \mathbf{V}_1 = a\left(\gamma_2 +\mathbf{V}_2\Gamma_2\right)\\ \mathbf{V}_2 = a\left(\gamma_3 +\mathbf{V}_3\Gamma_3\right)\\ \vdots \\ \mathbf{V}_{L-1} = a\left(\gamma_L+ \mathbf{X}\Gamma_L\right)\\ $$ If your outcome has the dimension $N\times 8$, then $[\gamma_1, \Gamma_1]$ will have the dimension $(p_{V1}+1) \times 8$.
Which is to say that you'd be assuming that each outcome shares ALL of the parameters in the hidden layers, and only has different parameters for taking the uppermost derived variable and relating it to the outcome.
Is this a realistic assumption for your context?