Solved – How to calculate mean, median, mode, std dev from distribution

distributionsmeanr

How to calculate mean, variance, median, standard deviation and modus from distribution? If I randomly generate numbers which forms the normal distribution I've specified the mean as m=24.2 standard deviation as sd=2.2:

> dist = rnorm(n=1000, m=24.2, sd=2.2)

Then I can do following:

Mean:

> mean(dist)
[1] 24.17485

Variance:

> var(dist)
[1] 4.863573

Median:

> median(dist)
[1] 24.12578

Standard deviation:

> sqrt(var(dist))
[1] 2.205351

Mode aka Modus (taken from here):

> names(sort(-table(dist)))[1]
[1] "17.5788181686221"

Is this the whole magic, or is there something else that I did not
realized?
Can I somehow visualize my bell shaped normal distribution with vertical lines representing (mean, median…)?
What does those attributes say about distribution?

PS: code is in R

Best Answer

First a general comment on the mode:

You should not use that approach to get the mode of (at least notionally) continuously distributed data; you're unlikely to have any repeated values (unless you have truly huge samples it would be a minor miracle, and even then various numeric issues could make it behave in somewhat unexpected ways), and you'll generally just get the minimum value that way. It would be one way to find one of the global modes in discrete or categorical data, but I probably wouldn't do it that way even then. Here are several other approaches to get the mode for discrete or categorical data:

x = rpois(30,12.3)

tail(sort(table(x)),1)   #1: category and count; if multimodal this only gives one

w=table(x); w[max(w)==w] #2: category and count; this can find more than one mode

which.max(table(x))      #3: category and *position in table*; only finds one mode

If you just want the value and not the count or position, names() will get it from those

To identify modes (there can be more than one local mode) for continuous data in a basic fashion, you could bin the data (as with a histogram) or you could smooth it (using density for example) and attempt to find one or more modes that way.

Fewer histogram bins will make your estimate of a mode less subject to noise, but the location won't be pinned down to better than the bin-width (i.e. you only get an interval). More bins may allow more precision within a bin, but noise may make it jump around across many such bins; a small change in bin-origin or bin width may produce relatively large changes in mode. (There's the same bias-variance tradeoff all over statistics.)

Note that summary will give you several basic statistics.

[You should use sd(x) rather than sqrt(var(x)); it's clearer for one thing]

In respect of q.2 yes; you could certainly show mean and median of the data on a display such as a histogram or a box plot. See here for some examples and code that you should be able to generalize to whatever cases you need.

Related Solutions

Solved – Standardized residuals in R’s lm output

If you look at the code for plot.lm (by typing stats:::plot.lm), you see these snippets in there (the comments are mine; they're not in the original):

r <- residuals(x)                                # <---  r contains residuals

...

if (any(show[2L:6L])) {
    s <- if (inherits(x, "rlm")) 
        x$s
    else if (isGlm) 
        sqrt(summary(x)$dispersion)   
    else sqrt(deviance(x)/df.residual(x))        #<---- value of s
    hii <- lm.influence(x, do.coef = FALSE)$hat  #<---- value of hii

...

    r.w <- if (is.null(w)) 
        r                                        #<-- r.w  for unweighted regression
    else sqrt(w) * r
    rs <- dropInf(r.w/(s * sqrt(1 - hii)), hii)  # <-- std. residual in plots

So - if you don't use weights - the code clearly defines its standardized residuals to be the internally studentized residuals defined here:

http://en.wikipedia.org/wiki/Studentized_residual#How_to_studentize

which is to say:

$${\widehat{\varepsilon}_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}$$

(where $\widehat{\sigma}^2={1 \over n-m}\sum_{j=1}^n \widehat{\varepsilon}_j^{\,2}$, and $m$ is the column dimension of $X$).

Best Answer

Related Solutions

Solved – Standardized residuals in R’s lm output

Related Question