Solved – Regression based for example on days of week

categorical datacategorical-encodingregression

I need a little help to move in the right direction. It's a long time since I studied any stats and the jargon seems to have changed.

Imagine that I have a set of car-related data such as

  • Journey time from town A to town B
  • Distance from town A to town B
  • Engine size
  • Driver's shoe size
  • Make and model of car
  • Day of week

I want to predict journey time.

I imagine there's a strong correlation between time and distance and probably a weaker one to engine size (and none to shoe size). Presumably multiple regression analysis / ANOVA is the tool to use. But how do I include day of week, since just coding it as Sunday=1, Monday=2 etc feels very wrong?

Having used Excel's regression tool, for example, how do I interpret the results? Presumably if R is close to 1 this is good (although if there are many data items it seems as though it can be small yet still be significant). But some sources refer to r-squared which seems to be the SD, so a value close to zero is good. It also shows the t Stat, P-value, F and Significance F, whatever they may be. Can anyone recommend a good reference source?

Best Answer

What you need is a solid review of regression methodology. However, these questions are sufficiently basic (don't take that the wrong way) that even a good overview of basic statistics would probably benefit you. Howell has written a very popular textbook that provides a broad conceptual foundation without requiring dense mathematics. It may well be worth your time to read it. It is not possible to cover all of that material here. However, I can try to get you started on some of your specific questions.

First, days of the week are included via a coding scheme. The most popular is 'reference category' coding (typically called dummy coding). Lets imagine that your data are represented in a matrix, with your cases in rows and your variables in columns. In this scheme, if you had 7 categorical variables (e.g., for days of the week) you would add 6 new columns. You would pick one day as the reference category, generally the one that is thought of as the default. Often this is informed by theory, context, or the research question. I have no idea which would be best for days of the week, but it also doesn't really matter much, you could just pick any old one. Once you have the reference category, you could assign the others to your new 6 variables, then you simply indicate whether that variable obtains for each case. For example, say you pick Sunday as the reference category, your new columns / variables would be Monday-Saturday. Every observation that took place on a Monday would be indicated with a $1$ in the Monday column, and a $0$ elsewhere. The same would happen with observations on Tuesdays and so on. Note that no case can get a $1$ in 2 or more columns, and that observations that took place on Sunday (the reference category) would have $0$'s in all of your new variables. There are many other coding schemes possible, and the link does a good job of introducing them. You can test to see if the day of the week matters by testing the nested model with all of the new 6 variables dropped vs. the full model with all 6 included. Note that you should not use the tests that are reported with standard output, as these are not independent and have intrinsic multiple comparison problems.

It has been a long time since I've looked at how Excel does statistics, and I don't remember it very clearly, so someone else may be able to help you more there. This page seems to have some information about the specifics of regression in Excel. I can tell you a little more about the statistics typically reported in regression output:

  • An $r$-score that's close to $1$ indicates that the value response variable can be almost completely determined by the values of the predictor variables. Clearly this would be a large effect, but it is not a-priori clear that this is 'good'--that is an entirely different and philosophically thorny issue.
  • It is not clear what they mean by '$r$', given that you are doing multiple regression (where $r$ is not typically reported). '$r$' is a measure of linear, bivariate association, that is, it applies to straight-line relationships between (only) 2 variables. It is possible to get an $r$-score between the predicted values from your model and the response values, however. In that case, you are using 2 variables (and if your model is appropriately specified, the relationship should be linear). This version is called the 'multiple $r$-score', but it's rarely discussed or reported by software.
  • R-squared is simply the square of $r$ (i.e., $r\times r$); it is not the standard deviation. It will also tend towards $1$ as the relationship becomes more determinitive, not $0$. Thus, if you think an $r$ close to $1$ is 'good', you should think an $R^2$ close to $1$ is 'good' also. However, you should know that the multiple $r$ (and multiple $R^2$) is highly biased in multiple regression. That is, the more predictors you add to your model, the higher these statistics will go, whether there is any relationship or not. Thus you should be cautious about interpreting them.
  • Sometimes output will list $t$-statistics for the individual predictors and an $F$-statistic for the model as a whole, in order to determine 'significance'. These are random variables that are computable by statistical tests and that have a known distribution when the degrees of freedom are specified.
  • By comparing the realized value (that is, the value you found) against the known distribution, you can determine the probability of finding a value as extreme or more extreme than yours if the null hypothesis is true. That probability is the $p$-value.
  • The $t$-value is used when you are testing only one parameter, whereas the $F$-value can be used in testing multiple parameters (e.g., as I discussed above regarding days of the week). The $p$-value associated with the $F$ is the probability that at least $1$ parameter is 'significant'. Another way to think about it is, 'does the model with all the parameters tested by the $F$ included do a better job of predicting the response than the null model'.
  • I am guessing that what you call the 'significance $F$' is the $F$-value that would need to be matched or exceeded for a test to be 'significant', presumably at the .05 level.

One last point that's worth emphasizing is that this process cannot be divorced from its context. To do a good job of analyzing data, you must keep your background knowledge and the research question in mind. I alluded to this above regarding the choice of the reference category. For example, you note that shoe size should not be relevant, but for the Flintstones it probably was! I just want to include this fact, because it often seems to be forgotten.