Unless you restrict the range of your inputs, the sigmoid may be giving you a problem. You won't even be able to learn the function $y=x$ if you have a sigmoid in the middle.
If you have a restricted range, then the input-hidden weights could scale the input values so that they hit the sigmoid at the part of the graph which looks like a line i.e. the part around 0:
Once you have the linear behavior, the hidden-output weights could re-scale the values back. The training process will take care of all this for you - but to test this hypothesis you could just train (and validate) with input values in $[-1,1]$.
Do you know what the network's function looks like? If you plot $(x_1, x_2) \rightarrow y$, does it look anything like a plane? At least in parts?
You could also try training with more data and seeing if the graph gets closer to $x_1+2x_2$.
I am going to contradict @Pieter's answer and say that your problem is that you have too much bias and too little variance. In other words, your network is not complex enough for this task.
To see this, let $Y$ be the true and correct output that your network should return (the target), and let $\hat{Y}$ be the output that your network actually returns. Your loss-function is the mean squared-error, averaged over all examples in your training dataset $\mathcal{D}$ :
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right]
$$
In this loss-function, using your network, we are trying to adjust the probability distribution of $\hat{Y}$ such that it matches the probability distribution of $Y$. In other words, we are trying to make $Y=\hat{Y}$, such that the mean squared-error is $0$. This is the lowest possible value of the mean squared-error:
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq 0
$$
However, from the question How can I prove mathematically that the mean of a distribution is the measure that minimizes the variance?, we know that the mean squared-error actually has a tighter lower-bound, which is when $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$, such that the mean squared-error loss-function becomes
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \mathbb{E}_{\mathcal{D}}[Y])^2\right] = \text{Var}(Y)
$$
Since we know that the variance of $Y$ is non-negative, then the mean squared-error loss-function has the following lower-bounds
$$
\mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq \text{Var}(Y) \geq 0
$$
In your case, you have reached the lower-bound $\text{Var}(Y)$, since you observe that $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$. This means that the bias (strictly speaking, this is not the correct definition of bias, but it gets the point across.) of $\hat{Y}$ is
$$
(Y - \hat{Y})^2 = (Y - \mathbb{E}_{\mathcal{D}}[Y])^2
$$
The variance of $\hat{Y}$ is
$$
\mathbb{E}_{\mathcal{D}}\left[\left(\hat{Y} - \mathbb{E}_{\mathcal{D}}\left[\hat{Y}\right]\right)^2\right] = \mathbb{E}_{\mathcal{D}}\left[\left(\mathbb{E}_{\mathcal{D}}[Y] - \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathcal{D}}[Y]]\right)^2\right] = 0
$$
Clearly, you have too much bias and too little variance.
So, how do we reach the lower-lower-bound of $0$? We need to increase the variance of $\hat{Y}$ by either adding more parameters to the network or adjusting the network architecture. As discussed in What should I do when my neural network doesn't learn? (highly recommended read), consider over-fitting and then testing your network on a single example by adding many more parameters or by adjusting the network architecture.
If the network no longer predicts the mean on a single example, then you can scale up slowly and start over-fitting and testing the network on two examples, then three examples, and so on. Otherwise, you need to keep adding more parameters/adjusting the network architecture until your network no longer predicts the mean.
Eventually, once you reach a dataset size of around 100 examples, you can start to split your data into training and testing to evaluate the generalization performance of your network. At this point, if it starts to predict the mean again, then make sure that the examples that you are adding to the dataset are similar to the examples that you already worked through in the smaller datasets. In other words, they are normalized and "look" similar. Also, keep in mind that as you add more data to the dataset, you will more likely need to add more parameters for better generalization performance.
Another helpful modification, but not as helpful as what I stated above, that helps in practice, is to slightly adjust the mean squared-error loss function itself. If your mean squared-error loss function is
$$
\mathcal{L}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2
$$
where $N$ is the dataset size, then consider using the following loss function instead:
$$
\mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i)-\log(\hat{y}_i))^2\right]
$$
Where $\alpha$ is a hyperparameter that can be tuned via trial and error. A starting value for $\alpha$ could be $\alpha=5$. The advantage of this loss function over the plain mean squared-error loss function is that the $\log(\cdot)$ function stretches small values in the interval $[0,1]$ away from each other, which means that small differences between $y$ and $\hat{y}$ are amplified, leading to larger gradients. I have personally found this modified loss function to be very helpful in practice.
For this to work well, it is recommended (but not necessary) that $y$ and $\hat{y}$ are each scaled to have values in the interval $[0,1]$. Also, since $\log(0)=-\infty$, and since it is likely that $y$ and $\hat{y}$ will have values very close to $0$, then it is recommended to add a small value $\epsilon$, such as $\epsilon=10^{-9}$, to $y$ and $\hat{y}$ in the loss function as follows:
$$
\mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i + \epsilon)-\log(\hat{y}_i + \epsilon))^2\right]
$$
This loss function may be thought of as the Mean Squared Log-scaled Error Loss.
Best Answer
$\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\by}{\mathbf{y}}$
I personally prefer the Monte Carlo approach because of its ease. There are alternatives (e.g. the unscented transform), but these are certainly biased.
Let me formalise your problem a bit. You are using a neural network to implement a conditional probability distribution over the outputs $\by$ given the inputs $\bx$, where the weights are collected in $\theta$:
$$ p_\theta(\by~\mid~\bx). $$
Let us not care about how you obtained the weights $\theta$–probably some kind of backprop–and just treat that as a black box that has been handed to us.
As an additional property of your problem, you assume that your only have access to some "noisy version" $\tilde \bx$ of the actual input $\bx$, where $$\tilde \bx = \bx + \epsilon$$ with $\epsilon$ following some distribution, e.g. Gaussian. Note that you then can write $$ p(\tilde \bx\mid\bx) = \mathcal{N}(\tilde \bx| \bx, \sigma^2_\epsilon) $$ where $\epsilon \sim \mathcal{N}(0, \sigma^2_\epsilon).$ Then what you want is the distribution $$ p(\by\mid\tilde \bx) = \int p(\by\mid\bx) p(\bx\mid\tilde \bx) d\bx, $$ i.e. the distribution over outputs given the noisy input and a model of clean inputs to outputs.
Now, if you can invert $p(\tilde \bx\mid\bx)$ to obtain $p(\bx\mid\tilde \bx)$ (which you can in the case of a Gaussian random variable and others), you can approximate the above with plain Monte Carlo integration through sampling:
$$ p(\by\mid\tilde \bx) \approx \sum_i p(\by\mid\bx_i), \quad \bx_i \sim p(\bx\mid\tilde \bx). $$
Note that this can also be used to calculate all other kinds of expectations of functions $f$ of $\by$:
$$ f(\tilde \bx) \approx \sum_i f(\by_i), \quad \bx_i \sim p(\bx\mid\tilde \bx), \by_i \sim p(\by\mid\bx_i). $$
Without further assumptions, there are only biased approximations.