I'm looking for a model to predict CTR (click-through-rate)
I have the following data:
For each ad I know the number of impressions, clicks and some other attributes (which are mainly dummy variables).
The CTR per ad is calculated as follows: #clicks / #impressions.
I have two questions regarding predicting CTR:
-
I am wondering which model should be used to predict the CTR. I tried a linear regression, but the R-squared is very low (around 10%-15%). A logistic regression is not an option as my dependent variable is not a 0/1 variable.
-
When I run a linear regression with clicks as dependent variable and impressions, etc. as explanatory variables, my R-squared suddenly is around 85-95%. How is it possible that this differs so much from taking CTR as dependent variable?
EDIT:
I followed the approach from kjetil, which works perfectly.
Best Answer
You should try logistic regression. Let $x=\text{number of clicks}$, $n=\text{number of impressions}$. Then $\text{CTR}=x/n$, and in modeling that proportion directly you loose information. A logistic regression (possibly quasibinomial) gets at least the variance structure correct. In R you could do something like:
A similar post with answer and example is Count explanatory variable, proportion dependent variable
Good that you have tried some of my suggestions. Here some answers to your further edits:
Look at this part of the output *Dispersion parameter for quasibinomial family * do there seem to be a substantial reduction?
Look at the Deviance Residuals: from the output. Maybe most of the difference is in the extremes, but you could extract all of the residuals by
resid(your_glm_object, type="deviance")
and then plot against each other the residuals for each model, or their histograms.There is a version of AIC for quasi-models called QAIC (similar for BIC I suppose). A small paper about this in
R
(by Ben Bolker) is here. QAIC is implemented in someR
packages, listed there.Getting predictions from this models: Use something like
predict(your_glm_object, type="response", newdata=your_data_frame_with_new_data)
For details see?predict.glm
.