How to calculate mean, variance, median, standard deviation and modus from distribution? If I randomly generate numbers which forms the normal distribution I've specified the mean as m=24.2
standard deviation as sd=2.2
:
> dist = rnorm(n=1000, m=24.2, sd=2.2)
Then I can do following:
Mean:
> mean(dist)
[1] 24.17485
Variance:
> var(dist)
[1] 4.863573
Median:
> median(dist)
[1] 24.12578
Standard deviation:
> sqrt(var(dist))
[1] 2.205351
Mode aka Modus (taken from here):
> names(sort(-table(dist)))[1]
[1] "17.5788181686221"
- Is this the whole magic, or is there something else that I did not
realized? - Can I somehow visualize my bell shaped normal distribution with vertical lines representing (mean, median…)?
- What does those attributes say about distribution?
PS: code is in R
Best Answer
First a general comment on the mode:
You should not use that approach to get the mode of (at least notionally) continuously distributed data; you're unlikely to have any repeated values (unless you have truly huge samples it would be a minor miracle, and even then various numeric issues could make it behave in somewhat unexpected ways), and you'll generally just get the minimum value that way. It would be one way to find one of the global modes in discrete or categorical data, but I probably wouldn't do it that way even then. Here are several other approaches to get the mode for discrete or categorical data:
If you just want the value and not the count or position,
names()
will get it from thoseTo identify modes (there can be more than one local mode) for continuous data in a basic fashion, you could bin the data (as with a histogram) or you could smooth it (using
density
for example) and attempt to find one or more modes that way.Fewer histogram bins will make your estimate of a mode less subject to noise, but the location won't be pinned down to better than the bin-width (i.e. you only get an interval). More bins may allow more precision within a bin, but noise may make it jump around across many such bins; a small change in bin-origin or bin width may produce relatively large changes in mode. (There's the same bias-variance tradeoff all over statistics.)
Note that
summary
will give you several basic statistics.[You should use
sd(x)
rather thansqrt(var(x))
; it's clearer for one thing]--
In respect of q.2 yes; you could certainly show mean and median of the data on a display such as a histogram or a box plot. See here for some examples and code that you should be able to generalize to whatever cases you need.