Solved – Linear regression on grouped data

categorical datacorrelationlinear model

I have a sample composed by 200.000 projects. Each project is defined according to its size ($S$) and the presence of active users ($U$). The values for $S$ are greater than zero; on the other hand, $U$ is a binary variable (1 if the project has active users, 0 otherwise).

I would like to check if $S$ and $U$ are related each other.

Since I was not able to find a direct relation between $S$ and $U$, I've ordered the projects according to their sizes (ascending order). Then I've grouped the projects in 10 groups (20.000 projects per group), and for each group, I've counted the number of projects having active users.

Since it seems that the number of projects having active users increases from one group to the next one, I would like to know how to proceed. It makes sense to run a linear regression analysis between the sum of project sizes per group and the number of active users per group? I should use a correlation test?

enter image description here

Best Answer

(I can't comment yet.) I'm assuming you tried a t-test or a non-parametric version, or those tests don't answer your question. In addition to your linear regression approach, you might also consider a logistic regression on the ungrouped data, which would avoid the possibly arbitrary grouping step.

Related Question