I would suggest you look at books on categorical data analysis (cf. Alan Agresti's Categorical Data Analysis, 2002) for better explanation and understanding of ordered logistic regression. All the questions that you ask are basically answered by a few chapters in such books. If you are only interested in R
related examples, Extending Linear Models in R by Julian Faraway (CRC Press, 2008) is a great reference.
Before I answer your questions, ordered logistic regression is a case of multinomial logit models in which the categories are ordered. Suppose we have $J$ ordered categories and that for individual $i$, with ordinal response $Y_i$,
$p_{ij}=P(Yi=j)$ for $j=1,..., J$. With an ordered response, it is often easier to work with the cumulative probabilities, $\gamma_{ij}=P(Y_i \le j)$. The cumulative probabilities are increasing and invariant to combining adjacent categories. Furthermore, $\gamma_{iJ}=1$, so we need only model $J–1$ probabilities.
Now we want to link $\gamma_{ij}$s to covariates $x$. In your case, Sat
has 3 ordered levels: low
, medium
, high
. It makes more sense to treat them as ordered rather than unordered. The remaining variables are your covariates. The specific model that you are considering is the proportional odds model and is mathematically equivalent to:
$$\mbox{logit } \gamma_j(x_i) = \theta_j - \beta^T x_i, j = 1 \ldots J-1$$
$$\mbox{where }\gamma_j(x_i)=P(Y_i \le j | x_i)$$
It is so called because the relative odds for $Y \le j$ comparing $x_1$ and $x_2$ are:
$$\left(\frac {\gamma_j(x_1)}{1-\gamma_j(x_1)}\right) / \left(\frac {\gamma_j(x_2)}{1-\gamma_j(x_2)}\right)=\exp(-\beta^T (x_1-x_2))$$
Notice, the above expression does not depend on $j$. Of course, the assumption of proportional odds does need to be checked for a given dataset.
Now, I will answer some (1, 2, 4) questions.
How can one understand if the model
gave a good fit? summary(house.plr)
shows Residual Deviance 3479.149 and
AIC (Akaike Information Criterion?) of
3495.149. Is that good? In the case those are only useful as relative
measures (i.e. to compare to another
model fit), what is a good absolute
measure? Is the residual deviance
approximately chi-squared distributed?
Can one use "% correctly predicted" on
the original data or some
cross-validation? What is the easiest
way to do that?
A model fit by polr
is a special glm
, so all the assumptions that hold for a traditional glm
hold here. If you take care of the parameters properly, you can figure out the distribution. Specifically, to test the if the model is good or not you may want to do a goodness of fit test, which test the following null (notice this is subtle, mostly you want to reject the null, but here you don't want to reject it to get a good fit):
$$H_o: \mbox{ current model is good enough }$$
You would use the chi-square test for this. The p-value is obtained as:
1-pchisq(deviance(house.plr),df.residual(house.plr))
Most of the time you'd hope to obtain a p-value greater than 0.05 so that you don't reject the null to conclude that the model is good fit (philosophical correctness is ignored here).
AIC should be high for a good fit at the same time you don't want to have a large number of parameters. stepAIC
is a good way to check this.
Yes, you can definitely use cross validation to see if the predictions hold. See predict
function (option: type = "probs"
) in ?polr
. All you need to take care of is the covariates.
What information does pr contain? The
help page on profile is generic, and
gives no guidance for polr
As pointed by @chl and others, pr
contains all the information needed for obtaining CIs and other likelihood related information of the polr fit
. All glm
s are fit using iteratively weighted least square estimation method for the log likelihood. In this optimization you obtain a lot of information (please see the references) which will be needed for calculating Variance Covariance Matrix, CI, t-value etc. It includes all of it.
How does one interpret the t values for each coefficient? Unlike some model >fits, there are no P values here.
Unlike normal linear model (special glm
) other glm
s are don't have the nice t-distribution for the regression coefficients. Therefore all you can get is the parameter estimates and their asymptotic variance covariance matrix using the max-likelihood theory. Therefore:
$$\text{Variance}(\hat \beta) = (X^T W X)^{-1}\hat \phi$$
Estimate divided by its standard error is what BDR and WV call t-value (I am assuming MASS
convention here). It is equivalent to t-value from normal linear regression but does not follow a t-distribution. Using CLT, it is asymptotically normally distributed. But they prefer not to use this approx (I guess), hence no p-values. (I hope I am not wrong, and if I am, I hope BDR is not on this forum. I further hope, someone will correct me if I am wrong.)
There are many thousands of tests one can apply to inspect a logistic regression model, and much of this depends on whether one's goal is prediction, classification, variable selection, inference, causal modeling, etc. The Hosmer-Lemeshow test, for instance, assesses model calibration and whether predicted values tend to match the predicted frequency when split by risk deciles. Although, the choice of 10 is arbitrary, the test has asymptotic results and can be easily modified. The HL test, as well as AUC, have (in my opinion) very uninteresting results when calculated on the same data that was used to estimate the logistic regression model. It's a wonder programs like SAS and SPSS made the frequent reporting of statistics for wildly different analyses the de facto way of presenting logistic regression results. Tests of predictive accuracy (e.g. HL and AUC) are better employed with independent data sets, or (even better) data collected over different periods in time to assess a model's predictive ability.
Another point to make is that prediction and inference are very different things. There is no objective way to evaluate prediction, an AUC of 0.65 is very good for predicting very rare and complex events like 1 year breast cancer risk. Similarly, inference can be accused of being arbitrary because the traditional false positive rate of 0.05 is just commonly thrown around.
If I were you, your problem description seemed to be interested in modeling the effects of the manager reported "obstacles" in investing, so focus on presenting the model adjusted associations. Present the point estimates and 95% confidence intervals for the model odds ratios and be prepared to discuss their meaning, interpretation, and validity with others. A forest plot is an effective graphical tool. You must show the frequency of these obstacles in the data, as well, and present their mediation by other adjustment variables to demonstrate whether the possibility of confounding was small or large in unadjusted results. I would go further still and explore factors like the Cronbach's alpha for consistency among manager reported obstacles to determine if managers tended to report similar problems, or, whether groups of people tended to identify specific problems.
I think you're a bit too focused on the numbers and not the question at hand. 90% of a good statistics presentation takes place before model results are ever presented.
Best Answer
Look at function
axis
help page. Something like the following should work: