More generally an output neuron of a feed-forward neural net is defined as $f(\sum \Theta_i a_i)$, where $f$ is the link function of the output neuron (which was the identity function in your question) and the other notations are as in your question. The activation values of the neurons of the previous layer $a_i$ depend on the choice of their own link function (say $g$) and of the activation values of the layer before that.
You are right when you say that if all the link functions $g$ in the previous layer are bounded, then the network's output values will be bounded too, even if $f$ is identity. (There was a small confusion however, the logistic link is bounded in $(0,1)$, the hyperbolic tangent is bounded in $(-1,1)$.)
Setting all link functions ($g$ and $f$) of the network to the identity function might seem like a good idea in order to allow extrapolation outside the observed response values in the training set, but that special case of feedforward neural networks is in fact just a linear regression model, in which case you don't need the neural net framework at all!
A better solution would be to use unbounded but non-linear link functions for g, like a rectifier $g(x)= max(0, x)$. This keeps the nice "universal approximation" property of the neural net, but with unbounded outputs.
Beware, however, prediction will be less good in regions of the data that were not present in the training data. You should probably use regularization to prevent over-fitting and (hopefully) improve extrapolating prediction power.
Here are those I understand so far. Most of these work best when given values between 0 and 1.
Quadratic cost
Also known as mean squared error, this is defined as:
$$C_{MST}(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{MST} = (a^L - E^r)$$
Cross-entropy cost
Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy
$$C_{CE}(W, B, S^r, E^r) = -\sum\limits_j [E^r_j \text{ ln } a^L_j + (1 - E^r_j) \text{ ln }(1-a^L_j)]$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{CE} = \frac{(a^L - E^r)}{(1-a^L)(a^L)}$$
Exponentional cost
This requires choosing some parameter $\tau$ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.
$$C_{EXP}(W, B, S^r, E^r) = \tau\text{ }\exp(\frac{1}{\tau} \sum\limits_j (a^L_j - E^r_j)^2)$$
where $\text{exp}(x)$ is simply shorthand for $e^x$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{2}{\tau}(a^L- E^r)C_{EXP}(W, B, S^r, E^r)$$
I could rewrite out $C_{EXP}$, but that seems redundant. Point is the gradient computes a vector and then multiplies it by $C_{EXP}$.
Hellinger distance
$$C_{HD}(W, B, S^r, E^r) = \frac{1}{\sqrt{2}}\sum\limits_j(\sqrt{a^L_j}-\sqrt{E^r_j})^2$$
You can find more about this here. This needs to have positive values, and ideally values between $0$ and $1$. The same is true for the following divergences.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{\sqrt{a^L}-\sqrt{E^r}}{\sqrt{2}\sqrt{a^L}}$$
Kullback–Leibler divergence
Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).
Kullback–Leibler divergence is typically denoted $$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$,
where $D_{\mathrm{KL}}(P\|Q)$ is a measure of the information lost when $Q$ is used to approximate $P$. Thus we want to set $P=E^i$ and $Q=a^L$, because we want to measure how much information is lost when we use $a^i_j$ to approximate $E^i_j$. This gives us
$$C_{KL}(W, B, S^r, E^r)=\sum\limits_jE^r_j \log \frac{E^r_j}{a^L_j}$$
The other divergences here use this same idea of setting $P=E^i$ and $Q=a^L$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = -\frac{E^r}{a^L}$$
Generalized Kullback–Leibler divergence
From here.
$$C_{GKL}(W, B, S^r, E^r)=\sum\limits_j E^r_j \log \frac{E^r_j}{a^L_j} -\sum\limits_j(E^r_j) + \sum\limits_j(a^L_j)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{a^L}$$
Itakura–Saito distance
Also from here.
$$C_{GKL}(W, B, S^r, E^r)= \sum_j \left(\frac {E^r_j}{a^L_j} - \log \frac{E^r_j}{a^L_j} - 1 \right)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{\left(a^L\right)^2}$$
Where $\left(\left(a^L\right)^2\right)_j = a^L_j \cdot a^L_j$. In other words, $\left( a^L\right) ^2$ is simply equal to squaring each element of $a^L$.
Best Answer
None of the above; maxout networks don't follow the architecture you assumed.
From the beginning of the "description of maxout" section in the paper you linked, which defined maxout:
So, each unit of the $m$ units has $k$ different affine combinations of the previous layer, and outputs the max of those $k$ affine functions. Imagine each layer being conected to the previous layer with $k$ different-colored connections, and taking the max of the colors.
Alternatively, you can think of a maxout unit as actually being two layers: each of the previous layer's units is connected to each of $k$ units with the identity activation function, and then a single unit connects those $k$ linear units with a max-pooling activation.
This means that the unit, viewed as a function from $\mathbb R^d$ to $\mathbb R$, is the piecewise max of affine functions. The paper's Figure 1 gives some examples of different functions it might look like:
Each of the dashed lines represents a $W^T x + b$. You can represent any convex function in this way, which is pretty nice.