Solved – linear regression with categorical variables – what are the estimates of reference levels

categorical dataregression coefficients

I've been trying with limited success to understand 'reference' levels when doing linear regressions on categorical variables in R. This post:

Effect of reference factor on T-test significance for linear regression in R

helped a bit, but there is still a lot I don't get. I wonder if someone could please answer my questions below and/or point me to relevant literature/posts.

Here's the example I used. I simulated some data about temperatures (Temp) measured at the same time of the day but in different months (Month) and places (Place). There are 12 months {Jan…Dec} and 6 places {A…F}. I assigned a constant value to each Month and Place, and calculated Temp as Month + Place + a random error. Here's what came out:

   Month Place  Temp
1    May     E 23.39
2    Dec     B 14.06
3    Nov     E 12.56
4    Nov     F 10.32
5    Aug     C 24.22
6    Jul     A 32.13
7    Oct     E 16.83
8    Feb     E  7.22
9    Jul     E 29.74
10   Aug     F 26.95
11   Jun     D 23.77
12   Jun     B 31.13
13   Dec     D  7.71
14   Sep     B 29.00
15   Jun     C 22.56
16   May     F 20.82
17   Nov     C  6.39
18   May     B 30.33
19   Aug     A 32.13
20   May     A 26.00
21   Apr     D 23.48
22   Jan     F  2.21
23   Oct     B 18.58
24   Mar     B 24.65
25   Dec     C  5.29
26   Jan     B 12.13
27   May     C 20.44
28   May     D 22.52
29   Oct     F 13.12
30   Jan     A  5.56
31   Apr     A 22.79

So I ran lm, with the intercept suppressed:

m1 <- lm(Temp~0+Month+Place,data=tdata)

sm1 <- summary(m1)

sm1

Call:
lm(formula = Temp ~ 0 + Month + Place, data = tdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9146 -0.7258  0.0000  0.6939  1.5626 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
MonthApr  24.3526     1.0980  22.179 2.64e-12 ***
MonthAug  30.9107     0.9721  31.797 1.87e-14 ***
MonthDec  10.3452     1.1508   8.990 3.44e-07 ***
MonthFeb   8.2012     1.7111   4.793 0.000286 ***
MonthJan   6.5432     0.9721   6.731 9.61e-06 ***
MonthJul  31.4256     1.0949  28.703 7.68e-14 ***
MonthJun  27.1452     1.1508  23.589 1.14e-12 ***
MonthMar  20.5690     1.6879  12.186 7.68e-09 ***
MonthMay  25.3779     0.8503  29.846 4.48e-14 ***
MonthNov  13.2277     1.1315  11.690 1.31e-08 ***
MonthOct  16.4136     1.1315  14.506 7.92e-10 ***
MonthSep  24.9190     1.6879  14.763 6.28e-10 ***
PlaceB     4.0810     0.9910   4.118 0.001045 ** 
PlaceC    -5.6213     0.9910  -5.672 5.76e-05 ***
PlaceD    -2.4352     1.0431  -2.335 0.034977 *  
PlaceE    -0.9812     1.0299  -0.953 0.356896    
PlaceF    -3.8106     0.9550  -3.990 0.001342 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.366 on 14 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9959 
F-statistic: 438.7 on 17 and 14 DF,  p-value: 3.789e-16

The Estimates are pretty close to the values I used.
However, Place A has no Estimate or statistics.
I understand this is a necessity of the method used. But suppose that I need to make a report for an end user who wants to know about all places and all months.

Q1: What Estimate value should I put in that report for Place A? Zero? What error, t-value, Pr? Does it make sense at all to ask this?

Before using R, I used to run this kind of regressions in Maxima (function lsquares_estimates). It was much more complicated, and I could not calculate errors and other stats very easily, but it showed me something interesting.
I ran the exact same problem. One 'dependent equation' was eliminated, and I got non-numerical coefficients as the answer. I.e. there was a real variable added to each coefficient, like:

[[Jan=2.732253167457174-1.0*%R1,Feb=
4.386938346498089-1.0*%R1,Mar=16.76057521120221-1.0*%R1,Apr=
20.53682857932274-1.0*%R1,May=21.56761011969353-1.0*%R1,Jun=
23.33233046418485-1.0*%R1,Jul=27.61310220652083-1.0*%R1,Aug=
27.09988869583722-1.0*%R1,Sep=21.11123314105077-1.0*%R1,Oct=
12.60181221694821-1.0*%R1,Nov=9.416826587161458-1.0*%R1,Dec=
6.529998842026064-1.0*%R1,A=%R1+3.812202188790789,B=%R1+
7.892475858976436,C=%R1-1.80953066617677,D=%R1+
1.37794521871563,E=%R1+2.828665973718834,F=%R1]]

I guess this is equivalent to saying that F has been considered as a 'reference', and once we choose the value of F, all other values are determined.
In other cases I got more than one dependent equation eliminated, and multiple %R's were present in the solution (multiple references?).
At the time I looked a bit in the literature, and I found (I think in Wikipedia, but I can't find it back) that in such cases one can look for the values of the %R's such that the numerical range of the coefficients is minimal.
So in my example, if I take the RHS of the solution as if it were a vector, calculate its norm:

18.0*%R1^2-359.1752780077565*%R1+4061.64136679127

and differentiate it with respect to %R1 I get:

36.0*%R1-359.1752780077565

When this quantity is 0, the range spanned by the coefficients is minimised, and that happens for:

%R1=9.977091055771005

which, substituted in the original solution, yields:

[[Jan=-7.244837888313856,Feb=-5.590152709272942,Mar
=6.783484155431175,Apr=10.55973752355171,May=
11.5905190639225,Jun=13.35523940841382,Jul=17.6360111507498
,Aug=17.12279764006619,Sep=11.13414208527974,Oct=
2.624721161177185,Nov=-0.5602644686095726,Dec=-
3.447092213744966,A=13.78929324456182,B=17.86956691474747,C
=8.16756038959426,D=11.35503627448666,E=12.80575702948986,
F=9.97709105577103]]

Here the coefficients are all different from the ones I had used, but the ranking is correct within each explanatory variable. So this solution would be satisfactory if the end user only wanted to compare Months among them and Places among them; not really if he wanted to know if Places or Months are more important in determining the temperature. For that I guess R's result is better (although I still need to understand a lot about that).

Q2: What do you think about the above method to 'normalise' somehow the coefficients so that all of them can have a defined value? Is it statistically valid? Could one apply this in R? And would it then be possible to calculate the error, t-value and P for each (again, assuming it makes sense)?

Sorry for the long post; I am a beginner in statistics and I can't do technical jargon shortcuts.

Best Answer

If you have two factors in your model you cannot consider one specified level of one factor (place A in your example) without simultaneously considering which level of the other factor you want to talk about. So in your example you have 12 estimates for place A corresponding to the 12 months. If you alter the order so you get estimates for all 6 places they will all be for the same one month (presumably April). Trying to simultaneously estimate all the months and all the places is, as you suspect, impossible.

Related Question