Your question is a little vague, but no, variance isn't used because of its association with the normal distribution. Most distributions have at least a mean and a variance. Some do not have a variance. Some can either have or not have a variance. Some have no mean and so do not have a variance.
Just for mental clarification on your side, if a distribution has a mean then $\bar{x}\approx\mu,$ but if it does not then $\bar{x}\approx\text{nothing}$. That is it gravitates nowhere and any calculation just floats around the real number line. It doesn't mean anything. The same is true if you calculate a standard deviation for a distribution that does not have one. It has no meaning.
The variance is a property of a distribution. You are correct in that it can be used to scale the problem, but it is deeper than that. In some theoretical frameworks, it is a measure of our ignorance, or more precisely, uncertainty. In others, it measures how large of an effect chance can have on outcomes.
Although variance is a conceptualization of dispersion, it is an incomplete conceptualization. Both skew and kurtosis further explain how the dispersion operates on a problem.
For many problems in a null hypothesis framework of thinking, the Central Limit Theorem makes the discussion of problems simpler and so it doesn't hurt that there is a linkage between the normal distribution, with its very well defined distributional properties, and the use of the standard deviation. However, this is more true for simple problems than complex ones. This is also less true for Bayesian methods which do not use a null hypothesis and which do not depend on the sampling distribution of the estimator.
The average absolute deviation is a valuable tool in parameter free and distribution free methods, but less valuable for the uniform distribution. If you actually had a bounded uniform distribution, then the mean and the variance are known.
Let me give you a uniform distribution problem that may not be as simple as you think. Consider that a new enemy battle tank has appeared on the battlefield. You do not know how many they have, let alone that they existed. You want to estimate the total number of tanks.
Tanks have serial numbers on their engines, or used to before someone figured this out. The probability of capturing any one specific serial number is $1/N$ where $N$ is the total of the tanks. Of course you do not know $N$, so this is an interesting problem. You need to know N. You can only see the distribution of captured serial numbers and not know if the largest number captured is also the last tank built. It probably is not.
In that case, the mean and standard deviation provide the most powerful tools to solve the problem, despite the intuition that the standard deviation is a bad estimator.
It will be true that it is a bad estimator for certain problems, but you need to learn them on a case by case basis.
Statistical tools are chosen based on needs, rules of math and trade-offs between real world costs and limitations and the demands of the problem. Sometimes that is the variance, but sometimes it is not. The best thing to do is to learn why the rules are designed the way they are and that is too long for a posting here.
I would recommend a good practitioners book on non-parametric statistics and if you have had calculus a good introductory practitioners book on Bayesian methods.
fGarch::stdFit
uses the same method as MASS:fitdistr
: maximum likelihood estimation. However, they use different parametrizations of the likelihood. The parameters in the output are linked in this way:
$$\text{df} = \nu$$
$$\text{mean} = m$$
$$\text{sd} = \sqrt{s^2 \frac{\nu}{\nu-2}}$$
The one on the left (fGarch::stdFit
) is a parametrization in terms of the moments, while the one on the right (MASS::fitdistr
) is in terms of the transformation from the usual standard Student-t. They are equivalent, but the moment parametrization may be less confusing (I have often seen the mistake that scale parameter = standard deviation, which is not the case for Student-t, unlike the normal distribution).
For the Student-t distribution, the MLE is obtained from a numerical optimization of the likelihood. As such, it requires a starting guess for the parameters, and when you don't provide one, a default is computed for you. MASS::fitdistr
uses an initial guess for location using the median, one for the scale using interquartile range, and 10 for the degrees of freedom. fGarch::stdFit
bases its initial guess on the mean, standard deviation, and 4 degrees of freedom.
Since they use the same general method (aside from the initial guess and the parametrization),the results should be essentially the same in general, once you convert from one parametrization to the other. They may vary depending on which initial guess is more appropriate for your data, and whatever (different) numerical difficulties may arise during the optimization.
Your 4th question feels like a different question which should be asked separately.
Best Answer
“Variance“ has a definite meaning. Variance always means the second central moment, and when we estimate or test the variance, we are estimating or testing this quantity.
“Scale” is more general. It refers to spread of the data in some way but without committing to discussing the second central moment. After all, the second central moment might not exist!
I like the definition I’m seeing on Wikipedia for a scale parameter $s$ (and other parameters $\theta$):
$$F(x; s, \theta)=F(x/s;1,\theta)$$
So if some $s$ allows us to stretch or compress the CDF to some standardized CDF, we call it a scale parameter. It might be related to the variance, but maybe not.
https://en.wikipedia.org/wiki/Scale_parameter