I was wondering that ANOVA is based on linearity assumption. If Yes, how can I check the linearity between observations. I know using plot would be useful but it is difficult to find the degree of polynomial trend. Is there another way to check the linearity assumption.
Solved – The best way to check the linearity assumption in ANOVA (except using scatter plots)
linearity
Related Solutions
Frank Harrell's recommendations get to the heart of the matter in general, but there may be issues with your dataset that are also giving you problems.
It seems that there must be outliers of x1
and x2
that are beyond the limits of the values displayed in your plots. I see no other way for almost all the martingale residuals to be at or below 0 and for the lowess fits to rise at both ends in the ways that they do. Your first task is to figure out what's going on with that issue and clean up the data if necessary. Examine the actual distributions of both those predictor variables and the martingale residuals (available with the residuals()
function applied to a coxph
object), making sure that you can see all the data points. That might indicate some problems with the data set that, once fixed, could remove the nonlinearity issue.
If you still want to allow for nonlinearity with respect to those predictors, restricted cubic splines are almost certainly the best choice. Although the math might be complicated, just think of a restricted cubic regression spline as the simplest smooth curve that best fits the data subject to some simple constraints: the numbers and locations of the knots on the x-axis, and linear rather than polynomial relations at the ends of the data range to prevent awkward behavior. Chapter 2 of Harrell's course notes discusses how to choose numbers and positions of knots. With 30,000 data points you can use many knots, but 5 are often quite adequate. Specifying the number of knots to rcs()
and using the default positions typically works well.
If you are not directly interested in the relations of x1
or x2
to outcome and are simply controlling for them to analyze your predictor of main interest, then you don't have to worry much about the initially confusing display of p-values and hazard ratios for each of the terms associated with the splines. If you do care about their relations to outcome, take advantage of the facilities provided by anova()
in the rms
package to get an overall estimate of their significance including all those terms. Yes, there is a learning curve in starting to use that package. In particular, it's important to save the output from the datadist()
function as run on your (cleaned) data frame (e.g., dd <- datadist(mydata)
, and then specify that as an option (e.g., options(datadist="dd")
before running any of the linear modeling functions like cph()
. I recall my initial hesitancy at starting to use that package and only wish that I had started even earlier than I did.
Quoting from Harrell's Regression Modeling Strategies, second edition, page 494:
When correlations among predictors are mild, plots of estimated predictor transformations without adjustment for other predictors (i.e., marginal transformations) may be useful. Martingale residuals may be obtained quickly by fixing $\hat \beta$ = 0 for all predictors. Then smoothed plots of predictor against residual may be made for all predictors.
So there is nothing necessarily wrong with examining martingale residuals from a null model for linearity, provided that "correlations among predictors are mild." As noted in Table 20.3 on page 494, there are several acceptable ways to use martingale residuals in this context, depending on whether you want to estimate transformations versus checking for nonlinearity, and whether you wish to adjust for other predictors in the process.
The apparent differences among the cited references represent different ways the martingale residuals were calculated and used.
If you display residuals for a model null in the continuous predictor of interest, then you will display the shape of the relation of the predictor to outcome. So if that's a reasonably straight line you have close to a linear relation, as in the first example you cite (and in the zoomed-out picture for your data). If you don't get a straight line, the shape of the curve might suggest a useful functional form to try for that predictor. As the martingale residuals can't go above 1, the actual value at each point along the curve is not a hazard ratio, but the shape of the curve indicates the general shape of the relation of hazard to the predictor. That's what you do when you try to estimate the functional form of the relation.
If instead you check the relation of the martingale residuals to the predictor from a model including an estimated coefficient for that predictor, then you hope for a flat horizontal smoothed line, indicating that the single linear coefficient for the predictor adequately represents the contribution of the predictor to outcome. That's what you do when you test for linearity in the predictor, as in the 2nd and 3rd references.
Nevertheless, particularly with this many data points, starting with restricted spline fits for continuous predictors might provide a simpler and more general way to both test for and adjust for nonlinearities. If the relation of a predictor to outcome is linear, then the higher-order spline terms will have non-significant coefficients. If the relation is non-linear, you can adjust for any reasonable degree of nonlinearity by adding more knots. You ultimately should calibrate the full model (checking linearity with respect to the overall linear predictor, and correcting for optimism), as with the calibrate()
function in Harrell's rms
package in R.
Best Answer
There is no formal linearity assumption regarding variables in a linear regression and there can also be non-linear interaction terms between different categorical variables in an ANOVA. For ANOVAs as well as regressions, there are assumptions of homoskedasticity and normality of residuals (and some others).
Those assumptions are equivalent to the response variable being linearly related to a continuous predictor in a regression. If there was a non-linear relationship that your model doesn't capture, the residuals would not be homoskedastic. They would be larger at those areas where the real trend diverges from your line.
You could still add a higher order term, perhaps the square of the variable, to your regression's predictors and see if that solves your homoskedasticity problem. For example
$$ Y = \beta X^2 + \epsilon$$
is a perfectly acceptable linear regression. The linearity lies at the $\beta$ parameter, not the variable $X$. As long as $\epsilon$ is normal and homoskedastic the assumptions of the regression are met. An example of what you cannot do would be
$$Y = X^{\text{sin}(\beta)} + \epsilon$$
since it is not linear in $\beta$ nor easy to render linear (as you sometimes can by taking the log of all terms)
In an ANOVA with only categorical predictors, you have the assumptions that the residuals (after subtracting a constant value per category) are normal within each category and have the same variance for the different categories.