[Math] Relationship between subset medians and the median

data analysismedianstatistics

Suppose we have a set of data $A = \{a_1, a_2, \dots, a_n\}$ and $B = \{b_1, b_2, \dots, b_n \}$. So there are $2n$ elements in total. Further suppose the median of $A$ and $B$ is $a$ and $b$, respectively. I'm wondering if whether there is any relationship between the median $m$ of $A \cup B$ and $a$, $b$?

Now suppose we divide the data sets into $A_L$, $A_R$, $B_L$, and $B_R$ so that $A_L$ contains all the elements of $A$ that are less than or equal to the median $a$, $A_R$ contains all the elements that are greater than or equal to $a$, and with a similar definition for $B$.

Can we narrow down the search for the actual median by knowing $a$ and $b$? For example, can we decide if $m$ lies in any one or more of $A_L$, $A_R$, $B_L$, or $B_R$ (but not all)?

I'm not sure if this question makes sense since I thought of it randomly but I am extremely curious to know if there is a real answer or to understand why there isn't.

Best Answer

Yes, the median $m$ of $A\cup B$ (taken as a disjoint union of multisets) always lies in the interval between $a$ and $b$. Suppose without loss of generality that $a \le b$. Switching the roles of $A$ and $B$ will yield the same result in the other case of $b < a$.

Every element of $B_L$ is $\le b$, and every element of $A_L$ is $\le a \le b$. The cardinality of each of these (multi)sets is at least $(n+1)/2$, so there are at least $n+1$ elements in $A\cup B$ that are $\le b$. By definition, $m \le b$. Likewise every element in $A_R$ and $B_R$ is $\ge a$, so at least $n+1$ elements are $\ge a$ and thus $m \ge a$. So $m \in [a,b]$.

This allows us to consider $A_R$ and $B_L$ only in searching for the median. It's possible that the median is contained in $A_L$, but only if it's exactly equal to $a$ in which case it's already in $A_R$. Likewise it could be in $B_R$ but it still suffices to check in $B_L$.

A very similar observation is one of the keys to the famous deterministic linear-time algorithm for median finding, but there you split the set into many more parts ($n/5$ I believe) and find the median of their respective medians.

Related Question