Regression – Applying Empirical Logit Transformation on Percentage Data

data transformationlogitproportion;regression

I have already used the logit transform on my outcome variables (which are displayed in percentages). However, this obviously gives me -INF values and since my data includes a lot of zeros in some instances, this makes it hard to analyse.

I have now tried an empirical logit transform, adding the smallest non-zero promotion to the numerator and denominator of my variables to remove the -INF values (as suggested in http://www.esajournals.org/doi/abs/10.1890/10-0340.1).

However, now my data are very non-normal again. I have tried experimenting with error terms to add to the logit transform but since have had no luck.

Is there any way I can find a value to add to my transformation to ensure normality?

Best Answer

I've had luck with setting epsilon to half of the smallest non-zero value and replacing all 0 values with epsilon and all 1 values with 1-epsilon. Then apply the logit transformation.

This method keeps the original form of the logit transformation, but allows 1 and 0 to be transformed to values that match the overall shape of the intended transformation (note the black dots in the figure at raw=0 and 1). In particular, it preserves the quality that 0.5 is transformed to 0, and the rest of the values are symmetric.

On the other hand, adding the smallest non-zero value as described in the paper changes the shape of the curve and destroys the symmetry.

Comparing two methods of ways to adjust the logit transformation to deal with zeros

Related Solutions

Solved – Clustering of very skewed, count data: any suggestions to go about (transform etc)

It is not wise to transform the variables individually because they belong together (as you noticed) and to do k-means because the data are counts (you might, but k-means is better to do on continuous attributes such as length for example).

In your place, I would compute chi-square distance (perfect for counts) between every pair of customers, based on the variables containing counts. Then do hierarchical clustering (for example, average linkage method or complete linkage method - they do not compute centroids and threfore don't require euclidean distance) or some other clustering working with arbitrary distance matrices.

Copying example data from the question:

-----------------------------------------------------------
customer | count_red  |    count_blue   | count_green     |
-----------------------------------------------------------
c0       |    12      |        5        |       0         |
-----------------------------------------------------------
c1       |     3      |        4        |       0         |
-----------------------------------------------------------
c2       |     2      |       21        |       0         |
-----------------------------------------------------------
c3       |     4      |        8        |       1         |
-----------------------------------------------------------

Consider pair c0 and c1 and compute Chi-square statistic for their 2x3 frequency table. Take the square root of it (like you take it when you compute usual euclidean distance). That is your distance. If the distance is close to 0 the two customers are similar.

It may bother you that sums in rows in your table differ and so it affects the chi-square distance when you compare c0 with c1 vs c0 with c2. Then compute the (root of) the Phi-square distance: Phi-sq = Chi-sq/N where N is the combined total count in the two rows (customers) currently considered. It is thus normalized distance wrt to overall counts.

Here is the matrix of sqrt(Chi-sq) distance between your four customers
 .000   1.275   4.057   2.292
1.275    .000   2.124    .862
4.057   2.124    .000   2.261
2.292    .862   2.261    .000

And here is the matrix of sqrt(Phi-sq) distance 
.000    .260    .641    .418
.260    .000    .388    .193
.641    .388    .000    .377
.418    .193    .377    .000

So, the distance between any two rows of the data is the (sq. root of) the chi-square or phi-square statistic of the 2 x p frequency table (p is the number of columns in the data). If any column(s) in the current 2 x p table is complete zero, cut off that column and compute the distance based on the remaining nonzero columns (it is OK and this is how, for example, SPSS does when it computes the distance). Chi-square distance is actually a weighted euclidean distance.

Solved – Analysing rank-ordered data using mlogit

My experience is still rather limited with mlogit package, but if I read Croissant vignette correctly (see the beginning of sec. 1.2 Model description, page 7), the alt variable in your model is specified as alternative specific with a generic coefficient and NOT as an individual specific covariate---those variables are placed between the pipes.

Best Answer

Related Solutions

Solved – Clustering of very skewed, count data: any suggestions to go about (transform etc)

Solved – Analysing rank-ordered data using mlogit

Related Question