Maximizing correlation is useful when the output is highly noisy. In other words, the relationship between inputs and outputs is very weak. In such case, minimizing MSE will tend to make the output close to zero so that the predication error is the same as the variance of the training output.
Directly using correlation as objective function is possible for gradient descent approach (simply change it to minimizing minus correlation). However, I do not know how to optimize it with SGD approach, because the cost function and the gradient involves outputs of all training samples.
Another way to maximize correlation is to minimize MSE with constraining the output variance to be the same as training output variance. However, the constraint also involves all outputs thus there is no way (in my opinion) to take advantage of SGD optimizer.
EDIT:
In case the top layer of the neural network is a linear output layer, we can minimize MSE and then adjust the weights and bias in the linear layer to maximize the correlation. The adjustment can be done similarly to CCA (https://en.wikipedia.org/wiki/Canonical_analysis).
Once you take logs, your response is not in seconds. In effect it's unit free.
When you calculate mean absolute error on the log scale, it, too, is not a measurement in seconds.
It's (roughly-speaking) telling you something about the typical size of percentage error on the original scale.
An MAE(-of-the-logs) of 0.01 would tell you that typically your original values deviate by about 1% from the geometric mean.
Let $z_i=\log(y_i)$. Then an MAE of 0.01 in the logs means that $\frac{_1}{^n}|z_i-\bar{z}|=0.01$. Now on the original scale $\exp(\bar{z})$ is the geometric mean of the $y$-values, $\text{GM}(y)$.
Now consider observations sitting as far away from the mean as the MAE: $z_i=\bar{z}+ 0.01$ and $z_j = \bar{z}- 0.01$. Then
$y_i=\exp(z_i) = \exp(\bar{y}) \times \exp(0.01)$ $= 1.01005 \text{ GM}(y)\approx 1.01 \text{ GM}(y)$
or about 1% above the geometric mean. Similarly
$y_j=\exp(z_j)$ $= \exp(\bar{y}) \times \exp(-0.01)$ $= 0.99005 \text{ GM}(y)$ $\approx 0.99 \text{ GM}(y)$
or about 1% below the geometric mean.
Similarly an MAE (log scale) of 0.10 would tell you that typically your original values deviate by about 10.5% from the geometric mean. As you move further away (as MAE gets bigger) this convenient approximate-percentage relationship changes.
There's nothing wrong with calculating a MAE on the log scale as long as you don't misinterpret what it is. If you want an MAE on the original scale you'd need to compute it on that scale (but the fact that you're working with modelling the logs suggests that perhaps it may not actually be especially useful on the original scale)
Best Answer
When you multiply your training data by 100, then your predictions will also change by a factor of (about) 100. The MSE is the mean of the squared differences between actuals and predictions. If you scale both actuals and (roughly) predictions by a factor of 100, the difference is also scaled by 100, so the square of the difference is scaled by 10,000. It works out. The features don't have anything to do with this effect.
The MSE is not a relative measure. It is just the mean of the squared errors. Yes, this is hard to interpret. You may want to look at Mean absolute error OR root mean squared error?
Scaling and normalizing will usually not help (except that scaling will scale the MSE, as above, but that is not helpful). Without knowing much more about your data, the best we can do is suggest How to know that your machine learning problem is hopeless?
This should not happen. The MAE is the mean of absolute errors. Scaling the actuals (and therefore also the predictions) should scale the MAE by the same amount.
This may be helpful - it's written in the context of time series forecasting, but you can apply it in other contexts, too.