Whether to convert input variables to binary depends on the input variable. You could think of neural network inputs as representing a kind of "intensity": i.e., larger values of the input variable represent greater intensity of that input variable. After all, assuming the network has only one input, a given hidden node of the network is going to learn some function $f(wx + b)$. where $f$ is the transfer function (e.g. the sigmoid) and $x$ the input variable.
This setup does not make sense for categorical variables. If categories are represented by numbers, it makes no sense to apply the function $f(wx + b)$ to them. E.g. imagine your input variable represents an animal, and sheep=1 and cow=2. It makes no sense to multiply sheep by $w$ and add $b$ to it, nor does it make sense for cow to be always greater in magnitude than sheep. In this case, you should convert the discrete encoding to a binary, 1-of-$k$ encoding.
For real-valued variables, just leave them real-valued (but normalize inputs). E.g. say you have two input variables, one the animal and one the animal's temperature. You'd convert animal to 1-of-$k$, where $k$=number of animals, and you'd leave temperature as-is.
For accessing a complexity of a model, number of free parameters is a good start, with it you can calculate AIC or BIC from number of free parameters. And getting number of free parameters in a Multi Layer Perception (MLP) neural network can be found here: Number of parameters in an artificial neural network for AIC
In addition, there are some cases, that you have a lot parameters, but they are not "totally free" / with regularization. For example, for linear regression, if you have $1000$ features but $500$ data points, it is totally OK to fit a model with $1000$ coefficients, but regularize the coefficients with a large regularization parameter. You can search Ridge Regression or Lasso Regression for details.
In Neural network case, it is also possible people have a very compacted network structure (many layers many neurons) but with some regularization in there. In that case, the method mentioned above will not work.
Finally, I would not agree your statement about random forest. As discussed in Breiman's original paper: in creasing number of trees is will not lead a more complex model / have over fitting. Instead, the out of bag (OOB) error will converge, if you have large number of trees. In practice, if computational power is not a concern, building a random forest with large number trees is actually recommended.
To your comment:
The model complexity is an abstract concept, and can be defined in different ways. AIC and BIC are some definitions and other way of defining it exists. See this Definition of model complexity in XGBoost as an example.
In addition, it is fine, if two NN has different structure, but it is still can have same complexity. Here is an example: say, we are doing polynomial regression. You have 2 ways, one is have a higher order model with more regularization, another is lower order without regularization. You can have same "complexity" but the structure are different.
Best Answer
In the second case you are probably writing about softmax activation function. If that's true, than the sigmoid is just a special case of softmax function. That's easy to show.
$$ y = \frac{1}{1 + e ^ {-x}} = \frac{1}{1 + \frac{1}{e ^ x}} = \frac{1}{\frac{e ^ x + 1}{e ^ x}} = \frac{e ^ x}{1 + e ^ x} = \frac{e ^ x}{e ^ 0 + e ^ x} $$
As you can see sigmoid is the same as softmax. You can think that you have two outputs, but one of them has all weights equal to zero and therefore its output will be always equal to zero.
So the better choice for the binary classification is to use one output unit with sigmoid instead of softmax with two output units, because it will update faster.