Solved – Tukey’s Hinges: Grouping Data

descriptive statisticsdistributionsexploratory-data-analysis

I am having trouble figuring out how to group data using the inter-quartile ranges calculated with a box and whiskers plot as well as looking at Tukey's Hinges. I understand the IQR and Tukey's hinges are not the same thing and that there are different interpretations of Tukey's Hinges. Basically, I did a calculation using SPSS and the output provided me the weighted quartiles and below that, Tukey's hinges. My question is about whether or not you can use the hinges or the weighted quartile values within the data to group the data. For example. Let's say, this is your set of data (this is not the data set I'm working, but just a simple example to illustrate my question):

1
2
2.5
2.5
2.5
3
3
4
5
5
6
7
7.5
7.5
8
9
9
10

Let's then say your IQR and Tukey Hinges are 25% = 2.5, 50% = 5, 75% = 7.5 (This is just random, this may not be the actual case for the data but I am using these values just to explain my question).

Now let's say you want to divide the data into 4 groups using the IQR and/or Tukey's Hinges. Which group do the values which are the hinge or "division point" between the groups go? When I was reading up on Tukey's Hinges, Tukey stated that the hinge is a point of division in the data but that is still vague when trying to group data. So, would a groups be like this:

Group A
1
2
2.5
2.5
2.5

Group B
3
3
4
5
5

Group C
6
7
7.5
7.5

Group D
8
9
9
10

Or can you exclude the "hinge" values and group the data like this?

Group A
1
2

Group B
3
3
4

Group C
6
7

Group D
8
9
9
10

I am having a tough time finding good explanations and research papers I could use to back up why I choose one methodology over another. This project is for an internship, none of the data I've written out in this e-mail is being used in my analysis. I am simply confused about dividing groups based on Tukey's Hinges. I would appreciate your thoughts. Thanks!

Best Answer

You are going to have problems dividing a set of data into four equal parts if the number of pieces of data is not a multiple of $4$.

One approach might be to duplicate some of the data. So if you have $4n$ data points, just divide into four sets of $n$ points by rank. If you have $4n-1$ points, duplicate the median, including it in both the second and third sets, so again you have four sets of $n$ points by rank. If you have $4n-2$ points, duplicate the first and third quartile points, including the first quartile in both the first and second sets and including the third quartile in the third and fourth sets, so again you have four sets of $n$ points by rank. And if you have $4n-3$ points, duplicate the median and the first and third quartile points, including them in the relevant sets and again you have four sets of $n$ points by rank. There are other approaches.

In your example with $18$ data points, that would give four equally sized subsets sets of

  • 1st group: $1, 2, 2.5, 2.5, 2.5$
  • 2nd group: $2.5, 3, 3, 4, 5$
  • 3rd group: $5, 6, 7, 7.5, 7.5$
  • 4th group: $7.5, 8, 9, 9, 10$

Quantiles (note the change from r to n) are difficult to define easily. Wikipedia gives 10 estimate types while Eric Langford gives 15 methods in the Journal of Statistics Education.