Solved – What type of regression should be used in predicting Click Through Rate

interpretationmodelingr-squaredregression

I'm looking for a model to predict CTR (click-through-rate)

I have the following data:
For each ad I know the number of impressions, clicks and some other attributes (which are mainly dummy variables).

The CTR per ad is calculated as follows: #clicks / #impressions.

I have two questions regarding predicting CTR:

  1. I am wondering which model should be used to predict the CTR. I tried a linear regression, but the R-squared is very low (around 10%-15%). A logistic regression is not an option as my dependent variable is not a 0/1 variable.

  2. When I run a linear regression with clicks as dependent variable and impressions, etc. as explanatory variables, my R-squared suddenly is around 85-95%. How is it possible that this differs so much from taking CTR as dependent variable?

EDIT:
I followed the approach from kjetil, which works perfectly.

Best Answer

You should try logistic regression. Let $x=\text{number of clicks}$, $n=\text{number of impressions}$. Then $\text{CTR}=x/n$, and in modeling that proportion directly you loose information. A logistic regression (possibly quasibinomial) gets at least the variance structure correct. In R you could do something like:

mod <- glm( cbind(x,n-x) ~ size + etc..., data=your_data_frame, family=quasibinomial())

A similar post with answer and example is Count explanatory variable, proportion dependent variable

EDIT

Good that you have tried some of my suggestions. Here some answers to your further edits:

  1. Look at this part of the output *Dispersion parameter for quasibinomial family * do there seem to be a substantial reduction?

  2. Look at the Deviance Residuals: from the output. Maybe most of the difference is in the extremes, but you could extract all of the residuals by resid(your_glm_object, type="deviance") and then plot against each other the residuals for each model, or their histograms.

  3. There is a version of AIC for quasi-models called QAIC (similar for BIC I suppose). A small paper about this in R (by Ben Bolker) is here. QAIC is implemented in some R packages, listed there.

  4. Getting predictions from this models: Use something like predict(your_glm_object, type="response", newdata=your_data_frame_with_new_data) For details see ?predict.glm.