Solved – Median and mode

discrete datamedianmode

Consider grouped data with frequencies and more specifically data that are discrete.

If I have for example different xi with the same frequencies, what is the right answer: there is no mode or all the modalities are modes?

Example value/ frequency , 0 / 42, 1 / 42, 2 / 42, 3 /42, 4 / 42, 5 / 42, 6 / 42, so what is the right answer no mode or multi modes??

For the median, some literature computes the median as the average of both xi between $F<0.5$ and $F>0.5$ OR gives the interval $[a,b]$. Others say that the median is the first xi having $F \ge 0.5$ because in the discrete case, the xi is an integer, so maybe the average did not give an integer. Also the median is the middle value so why do we have to create a fictitious value (average)? to develop this point
a)To determine the median of a discrete statistical series: It stores values of the variable in ascending order. When the total number N is odd, the median is the value of the series of row (N+1/2) When the total number N is even, the median is the half sum of the ranks (N/2) and (N/2 +1)

it is clear point

b)To determine the statistical median of a discrete series where each value x n is assigned frequencies, We can calculate the cumulative increasing numbers to exceed or equal to the half of the total N. (i want to verify this last point)

From the previous example
The average in the case even number and serie of data, it is clear that we compute the average of the both middle values, in the case of grouped data value/ frequency , 0 / 42, 1 / 42, 2 / 42, 3 /42, 4 / 42, 5 / 42, 6 / 42, the median is the average of the both modalities (2 and3) or the first modality having (Ni>=N/2(=126)) in the example it will be 'x= me=2 or 3 '(42+42+42=126) also let's suppose that We get an no integer value (1.5 for example, it is logical to say we have mode 1.5 children??) that is why i prefer this definition "To determine the statistical median of a discrete series where each value x n is assigned frequencies, We can calculate the cumulative increasing numbers to exceed half of the total N. "

Best Answer

Your question raises small questions of terminology and more interesting questions of how to think about data. I will stick with your question and focus on discrete variables. Most of what I say carries over to continuous variables, but with some need for re-wording and/or some differences in procedures.

First off, the mode is usually introduced and defined as just the most common value, namely the value with the highest frequency. That is the strictest sense of the term mode. When data are discrete, we look at frequencies and identify the value with the highest frequency.

When there are ties for the highest frequency, then so also there are ties for mode. With these invented data

value       frequency 
 0            1
 1           42 
 2            1
 3            1
 4           42 
 5            1 
 6            1

there is clearly a tie, with two modes at 1 and 4. However, had the frequencies been 42 and 41 it's my guess that most experienced users of statistics would still say that there were two modes, regardless of the rule that the mode is the value with the highest frequency. So, it is also true that a mode is a value with a pronounced peak in a frequency distribution, i.e. with a frequency notably higher than neighbouring values. (It's possible and common for a mode to be either the minimum or the maximum.)

Don't ask for a precise rule, or even rule of thumb, on what counts as pronounced or notable; it's what is obvious when you graph it and the decision comes quickly with a little experience.

The importance and interest of modes often lies in what they indicate, which is sometimes that there are qualitatively different groups being mixed together in the sample, such as men and women or healthy and sick people. Sometimes there are physical reasons for having two modes. In some climates, there are two common states of cloudiness of the sky, almost cloud-free days and clouded-over days.

I've not seen the term modality being used except occasionally as meaning the number of modes. Statistical people certainly talk about bimodality, meaning that the data are bimodal, or have two modes; or multimodality, meaning that the data are multimodal, or have many modes. Some of these terms are a little unnecessary and arguably relict from a time when there was a stronger inclination among scholars and scientists to invent words based on Latin and Greek roots, but they are quite often used.

The second part of your question I read as asking whether the median should be computed as the average of two modes. I may be missing something here, but I guess you are mixing in a quite different question. The computation of medians has nothing to do with modes at all. It's just the convention that with an even number of values, you should report the median as the mean of the two middle values. That's a convention, but it is taught in introductory statistics courses as a rule to be followed. With grouped data, the principle is still the same. It's quite possible with discrete data that interpolation will cause the median to be reported as a value that is not observable, e.g. 2.5 children. That's not something to worry about.

Back to terminology: I'd assert that modality is not another word for mode. Still less can it be used to refer to any value.

EDIT: I tried to pitch my original answer in a way that should help others apart from the OP. I focused on what seemed the more interesting question of a mode is, and downplayed a question about the median which seems confused on a point well covered in just about every elementary text. I've not tried to keep pace with repeated edits of the original question with more emphasis on how to compute medians.

Related Solutions

Solved – Identifying modes in floating point data

I saw you said you prefer Python, but there are a bunch of R libraries for this, see Highest Density Region function: http://cran.r-project.org/web/packages/hdrcde/hdrcde.pdf

The second iteration of your looking for the median wouldn't work, as your modes would balance each other. Better off calculating the steepest points of ascent in the cdf.

Solved – Unimodality Test for Discrete Distribution

The implication of the question is that these datasets tabulate counts of values drawn independently from a discrete distribution defined on an ordered set of values such as $1,2,\ldots, 10.$ When that is the case, these counts have a multinomial distribution.

If by "mode" we mean a strict local maximum height in the graph (padding the left and right of the graph with zeros), or something like that, and if the counts are all relatively large (more than 5 or so ought to do), then an attractive method to assess the number of modes in the underlying distribution is with bootstrapping. The problem this solves is that the number of modes in the distribution might differ from the number of modes in the data. By reconstructing the experiment from the distribution defined by the data, we can see to what extent the number of modes might vary. This is "bootstrapping."

Carrying out the bootstrapping is easy: write a function to compute the number of modes in a graph and another one to repeatedly sample from the graph's data and apply that function to the sample. Tabulate its results. ExampleR code is below. When given a dataset like the second one in the question, it plots this chart of the bootstrapped mode frequencies:

In 676 of 1000 bootstrap samples there were two modes; in 293 there were three; and in 31 there were four. This indicates the data are consistent with an underlying distribution with two or perhaps three modes. There is some possibility of four. The likelihood of more than four is tiny.

These results intuitively make sense, because in the dataset the frequencies of the values $8,9,10$ were close and relatively small. It is possible the true frequency of $9$ is less than those of either $8$ or $10,$ causing there to be modes at $1,8,$ and $10.$ The bootstrapping gives us a sense of how much variation in modes is likely based on the random variation implied by the assumed sampling scheme.

The results for the first set of data are always two modes. That is because the variation among counts in the thousands or tens of thousands is so small that it is extremely unlikely these data came from a distribution with any other modes besides the obvious ones at $1$ and $8.$

#
# Compute strict modes.
# Input consists of the counts in the data, in order, including any zeros.
#
n.modes <- function(x) {
  n <- length(x)+1
  i <- c(0, x) < c(x, 0)
  sum(i[-n] & !i[-1])
}
#
# Bootstrap the mode count in a dataset.
#
n.modes.boot <- function(x, n.boot=1e3) 
    tabulate(apply(rmultinom(n.boot, sum(x), x), 2, n.modes), ceiling(length(x)/2+1))
#
# Plot the bootstrap results.
#
library(ggplot2)
n.modes.plot <- function(f) {
  X <- data.frame(Frequency=f / sum(f))
  X$Count <- factor(1:nrow(X))
  X <- subset(X, Frequency > 0)
  ggplot(X, aes(Count, Frequency, fill=Count)) + geom_col(show.legend=FALSE)
}
#
# Show some examples.
#
x <- c(70, 30,20,40,60,70,110,170,180,165)
f <- n.modes.boot(x)
print(n.modes.plot(f))

Best Answer

Related Solutions

Solved – Identifying modes in floating point data

Solved – Unimodality Test for Discrete Distribution

Related Question