Solved – Determining an optimal discretization of data from a continuous distribution

continuous datadiscrete data

Suppose you have a data set $Y_{1}, …, Y_{n}$ from a continuous distribution with density $p(y)$ supported on $[0,1]$ that is not known, but $n$ is pretty large so a kernel density (for example) estimate, $\hat{p}(y)$, is pretty accurate. For a particular application I need to transform the observed data to a finite number of categories to yield a new data set $Z_{1}, …, Z_{n}$ with an implied mass function $g(z)$.

A simple example would be $Z_{i} = 0$ when $Y_{i} \leq 1/2$ and $Z_{i} = 1$ when $Y_{i} > 1/2$. In this case the induced mass function would be

$$ \hat{g}(0) = \int_{0}^{1/2} \hat{p}(y) dy, \ \ \ \hat{g}(1) = \int_{1/2}^{1} \hat{p}(y)dy$$

The two "tuning parameters" here are the number of groups, $m$, and the $(m-1)$ length vector of thresholds $\lambda$. Denote the induced mass function by $\hat{g}_{m,\lambda}(y)$.

I'd like a procedure that answers, for example, "What is the best choice of $m, \lambda$ so that increasing the number of groups to $m+1$ (and choosing the optimal $\lambda$ there) would yield a negligible improvement?". I feel like perhaps a test statistic can be created (maybe with the difference in KL divergence or something similar) whose distribution can be derived. Any ideas or relevant literature?

Edit: I have evenly spaced temporal measurements of a continous variable and am using an inhomogenous Markov chain to model the temporal dependence. Frankly, discrete state markov chains are much easier to handle and that is my motivation. The observed data are percentages. I'm currently using an ad hoc discretization that looks very good to me but I think this is an interesting problem where a formal (and general) solution is possible.

Edit 2: Actually minimizing the KL divergence would be equivalent to not discretizing the data at all, so that idea is totally out. I've edited the body accordingly.

Best Answer

I'm going to share the solution I came up with to this problem a while back - this is not a formal statistical test but may provide a useful heuristic.


Consider the general case where you have continuous observations $Y_{1}, Y_{2}, ..., Y_{n}$; without loss of generality suppose the sample space of each observations is the interval $[0,1]$. A categorization scheme will depend on a number of categories, $m$, and the locations thresholds which divide the categories, $0 < \lambda_{1} < \lambda_{2} < \cdots < \lambda_{m-1} < 1$.

Denote the categorized version of $Y_{i}$ by $Z_{i}(m, {\boldsymbol \lambda})$, where ${\boldsymbol \lambda} = \{ \lambda_{1}, \lambda_{2}, \cdots, \lambda_{m-1} \}$. Thinking of the discretization of the data as a partitioning of the original data into classes, the variance of $Y_{i}$ can be thought of as a combination of variation within and between groups, for a fixed value of $m, {\boldsymbol \lambda}$:

\begin{equation} {\rm var}(Y_{i}) = {\rm var} \Big( E(Y_{i} | Z_{i}(m, {\boldsymbol \lambda})) \Big) + E \Big( {\rm var}(Y_{i} | Z_{i}(m, {\boldsymbol \lambda})) \Big). \end{equation}

A given categorization is successful at producing homogenous groups if there is relatively little within group variance, quantified by $E( {\rm var}(Y_{i} | Z_{i}(m, {\boldsymbol \lambda}) )$. Therefore, we seek a parsimonious grouping that confers most of the variation in $Y_{i}$ to the ${\rm var}( E(Y_{i} | Z_{i}(m, {\boldsymbol \lambda}) )$ term. In particular, we want to choose $m$ so that by adding additional levels, we do not add significantly to the within group homogeneity. With this is mind, we define the optimal ${\boldsymbol \lambda}$ for a fixed value of $m$ to be

\begin{equation} {\boldsymbol \lambda}^{\star}_{m} = {\rm argmin}_{\boldsymbol \lambda} E \Big( {\rm var}(Y_{i} | Z_{i}(m, {\boldsymbol \lambda})) \Big) \end{equation}

A rough diagnostic for determining what choice of $m$ is adequate is to look at the dropoff in $E \Big( {\rm var}(Y_{i} | Z_{i}(m, {\boldsymbol \lambda}^{\star}_{m} )) \Big)$ as a function of $m$ - this trajectory is monotonically non-increasing and after it decreases sharply, then you can see that you're gaining relatively less precision by including more categories. This heuristic is similar in spirit how a "Scree Plot" is sometimes used to see how many principal components explain "enough" of the variation.

Related Question