More generally an output neuron of a feed-forward neural net is defined as $f(\sum \Theta_i a_i)$, where $f$ is the link function of the output neuron (which was the identity function in your question) and the other notations are as in your question. The activation values of the neurons of the previous layer $a_i$ depend on the choice of their own link function (say $g$) and of the activation values of the layer before that.
You are right when you say that if all the link functions $g$ in the previous layer are bounded, then the network's output values will be bounded too, even if $f$ is identity. (There was a small confusion however, the logistic link is bounded in $(0,1)$, the hyperbolic tangent is bounded in $(-1,1)$.)
Setting all link functions ($g$ and $f$) of the network to the identity function might seem like a good idea in order to allow extrapolation outside the observed response values in the training set, but that special case of feedforward neural networks is in fact just a linear regression model, in which case you don't need the neural net framework at all!
A better solution would be to use unbounded but non-linear link functions for g, like a rectifier $g(x)= max(0, x)$. This keeps the nice "universal approximation" property of the neural net, but with unbounded outputs.
Beware, however, prediction will be less good in regions of the data that were not present in the training data. You should probably use regularization to prevent over-fitting and (hopefully) improve extrapolating prediction power.
Wikipedia provides a synopsis of the universal approximation theorem.
In the mathematical theory of artificial neural networks, the universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function.
This theorem is the core justification for attempting to model complex, nonlinear phenomena using neural networks. Even though it is very flexible, it doesn't cover everything -- in this case, you've defined a discontinuous function, and the universal approximation theorem only extends to continuous functions.
I am not aware of a theorem which allows a neural network to approximate arbitrary, discontinuous functions.
Perhaps if you treated either case of your target variable as a categorical outcome and used cross-entropy loss you would have success approximating the decision boundary between the two cases.
Best Answer
Using the same reasoning, you could say a 2nd degree polynomial function can approximate a linear function. It is true, but the polynomial model is more complex.
This means that you will need more data to fit it correctly and it will have a stronger tendency to overfit. So, if the true model is a line and you fit/evaluate a linear model versus a 2nd degree polynomial model, the linear model will do likely better for a given amount of data.