Maximizing correlation is useful when the output is highly noisy. In other words, the relationship between inputs and outputs is very weak. In such case, minimizing MSE will tend to make the output close to zero so that the predication error is the same as the variance of the training output.
Directly using correlation as objective function is possible for gradient descent approach (simply change it to minimizing minus correlation). However, I do not know how to optimize it with SGD approach, because the cost function and the gradient involves outputs of all training samples.
Another way to maximize correlation is to minimize MSE with constraining the output variance to be the same as training output variance. However, the constraint also involves all outputs thus there is no way (in my opinion) to take advantage of SGD optimizer.
EDIT:
In case the top layer of the neural network is a linear output layer, we can minimize MSE and then adjust the weights and bias in the linear layer to maximize the correlation. The adjustment can be done similarly to CCA (https://en.wikipedia.org/wiki/Canonical_analysis).
When I implement a simple linear regression model using scikit learn
in Python, I get the MSE to be about 2.037727147668752e-07. However I
noticed if I multiplied all my features and the value to be predicted
by say 100, the MSE changed to 0.0024.
When you multiply your training data by 100, then your predictions will also change by a factor of (about) 100. The MSE is the mean of the squared differences between actuals and predictions. If you scale both actuals and (roughly) predictions by a factor of 100, the difference is also scaled by 100, so the square of the difference is scaled by 10,000. It works out. The features don't have anything to do with this effect.
If the MSE is a metric that is to be used on a relative scale, how do I interpret it? Does it mean an error of 0.002 means that if my
actual value is 0.008, my predicted value is 0.008 +/- 0.002 = 0.006
or 0.01?
The MSE is not a relative measure. It is just the mean of the squared errors. Yes, this is hard to interpret. You may want to look at Mean absolute error OR root mean squared error?
This is a large difference between the actual and predicted values, are there any specific regression machine learning models that
work well for this kind of problem? Will normalizing the data or
scaling it help improve performance and if so, why?
Scaling and normalizing will usually not help (except that scaling will scale the MSE, as above, but that is not helpful). Without knowing much more about your data, the best we can do is suggest How to know that your machine learning problem is hopeless?
I noticed that MAE remained constant regardless of the scale. Is this an absolute error measure?
This should not happen. The MAE is the mean of absolute errors. Scaling the actuals (and therefore also the predictions) should scale the MAE by the same amount.
What other metric can I use to evaluate the performance of my model?
This may be helpful - it's written in the context of time series forecasting, but you can apply it in other contexts, too.
Best Answer
One of the reasons MAE is used in time series or forecasting is that non-scientists find it easy to understand. So if you tell your client the MAE is 1.5 units, for example, he/she can interpret that as the average amount that the forecast is in error (in absolute units). But if you tell them the MSE you may well get a blank look because it has no such interpretation.
I'm not sure what causes the confusion between MAE and mean absolute deviation, but I'd attribute it to a lack of clear definitions or explanations in the specific context where it is used.