In your previous question it was already noted that
First of all, for the MM to work, you will need to have higher order moments to ensure that the sums necessary for the MM converge.
In this case the MLE indicates that the $\nu<3$. Then, it does not makes sense at all to use the method of moments. The method of moments is very restrictive and, in this case, the MLE approach is giving you a good fit.
If you want to consider an alternative method you could use Bayesian inference.
Since the question was updated, I update my answer:
The first part (To compute the skewness, why not standardize both the mean and the variance?) is easy: That is precisely how it's done! See the definitions of skewness and kurtosis in wiki.
The second part is both easy and hard. On one hand we could say that it is impossible to normalize random variable to satisfy three moment conditions, as linear transformation $X \to aX + b$ allows only for two. But on the other hand, why should we limit ourselves to linear transformations? Sure, shift and scale are by far the most prominent (maybe because they are sufficient most of the time, say for limit theorems), but what about higher order polynomials or taking logs, or convolving with itself? In fact, isn't it what Box-Cox transform is all about -- removing skew?
But in the case of more complicated transformations, I think, the context and the transformation itself becomes important, so maybe that is why there are no more "moments with names". That does not mean that r.v.s are not transformed and that the moments are not calculated, on the contrary. You just chose your transformation, calculate what you need and move on.
The old answer about why centralized moments represent shape better than raw:
The keyword is shape. As whuber suggested, by shape we want consider the properties of the distribution that are invariant to translation and scaling. That is, when you consider variable $X + c$ instead of $X$, you get the same distribution function (just shifted to the right or left), so we would like to say that its shape stayed the same.
The raw moments do change when you translate the variable, so they reflect not only the shape, but also a location. In fact, you can take any random variable, and shift it $X \to X + c$ appropriately to get any value for its, say, raw third moment.
The same observation holds for all odd moments and to lesser extent for even moments (they are bounded from below and lower bound does depend on shape).
The centralized moment, on the other hand, does not change when you translate the variable, so that's why they are more descriptive of the shape. For example, if your even centralized moment is large, you known that random variable has some mass not too close to mean. Or if your odd moment is zero, you known that your random variable has some symmetry around mean.
The same argument extends to scale, which is transformation $X\to cX$. The usual normalization in this case is division by standard deviation, and the corresponding moments are called normalized moments, at least by wikipedia.
Best Answer
fGarch::stdFit
uses the same method asMASS:fitdistr
: maximum likelihood estimation. However, they use different parametrizations of the likelihood. The parameters in the output are linked in this way:$$\text{df} = \nu$$ $$\text{mean} = m$$ $$\text{sd} = \sqrt{s^2 \frac{\nu}{\nu-2}}$$
The one on the left (
fGarch::stdFit
) is a parametrization in terms of the moments, while the one on the right (MASS::fitdistr
) is in terms of the transformation from the usual standard Student-t. They are equivalent, but the moment parametrization may be less confusing (I have often seen the mistake that scale parameter = standard deviation, which is not the case for Student-t, unlike the normal distribution).For the Student-t distribution, the MLE is obtained from a numerical optimization of the likelihood. As such, it requires a starting guess for the parameters, and when you don't provide one, a default is computed for you.
MASS::fitdistr
uses an initial guess for location using the median, one for the scale using interquartile range, and 10 for the degrees of freedom.fGarch::stdFit
bases its initial guess on the mean, standard deviation, and 4 degrees of freedom.Since they use the same general method (aside from the initial guess and the parametrization),the results should be essentially the same in general, once you convert from one parametrization to the other. They may vary depending on which initial guess is more appropriate for your data, and whatever (different) numerical difficulties may arise during the optimization.
Your 4th question feels like a different question which should be asked separately.