1. Should I still scale the input features using feature scaling? What range?
Scaling does not make anything worse. Read this answer from Sarle's neural network FAQ: Subject: Should I normalize/standardize/rescale the data? .
2. What transformation function should I use in place of the sigmoid?
You could use logistic sigmoid or tanh as activation function. That doesn't matter. You don't have to change the learning algorithm. You just have to scale the outputs of your training set down to the range of the output layer activation function ($[0,1]$ or $[-1,1]$) and when you trained your network, you have to scale the output of your network to $[-5,5]$. You really don't have to change anything else.
I'm no expert in this field, so I might be wrong. Therefore, correct me
if I'm wrong.
consider this neural network (which I suppose is equivalent to yours):
A---H1
\ / \
X C
/ \ /
B---H2
consider that the activation function of H1, H2 and C is the bipolar
sigmoid, to which we'll refer to as "bsig(x)"
also, we'll name the connections as follows:
A, H1: wa1;
A, H2: wa2;
B, H1: wb1;
B, H2: wb2;
H1, C: wh1;
H2, C: wh2
now the values of H1, H2 and C can be defined as:
H1 = bsig(wa1 * A + wb1 * B)
H2 = bsig(wa2 * A + wb2 * B)
C = bsig(wh1 * H1 + wh2 * H2)
So, C can be written as:
C = bsig(wh1 * bsig(wa1 * A + wb1 * B) + wh2 * bsig(wa2 * A + wb2 * B))
All you need to do is solve this equation in order to B or A depending on which of the values is unkown.
Best Answer
I implemented a simple variant of this. Some example images are included for convenience: https://github.com/iver56/image-regression