Please consider the following OLS model:
As a rule of thumb, a VIF score of max. 5 is considered acceptable collinearity in most fields. So for the better part I believed this model to be quite nice and significant (R-squared .516).
However, by adding only 2 variables (at the bottom of the model), many values change significantly. For instance, GCI-505
goes from being significant at the 10% level (p=.008) to being insignificant (p=.677). INCID
reacts similarly, whereas HOF_PDI
and HOF_LTO
have a reverse change:
My question is: what does this mean for my model? It doesn't seem very robust, although I though that it should have been with a significant F-score, and not too high VIF scores. What am I missing / lacking in knowledge?
[edit 1]
Some explaining of the variables: the model tries to explain the extent to which a company is behaving socially responsible. The dataset holds 4,000 companies from 35 countries.
First, each company has a CSR score (the DV), a marketvalue and bookvalue (MKT_AVG_LN
and PTBV_AVG_LN)
, a number of treaties on responsible investment they have signed (SIGN
), and a number of incidents that have been reported in the media (INCID
).
Second, each company is headquartered in one of 35 countries. These countries all have certain properties, for instance from which legal family their legal system stems and the average level of education. These country-specific values have been included as a variable for each company. For instance, a company with LAW_FRA
-dummy set to 1 is headquartered in a country with a french legal system. When a company has LAW_SR
set to 5.45
, this means that the shareholder rights index for the country the company in headquartered in, is 5.45
. LAW_FRA
, LAW_SCA
and LAW_ENG
(not included in the model) are dummies for legal family. The rest of the variables are continuous.
There are possibly 2 problems with my dataset
- I have "elevated" country-variables to a company level;
- the number of companies from each country is very unequally divided (for instance, 40% comes from the USA whereas only 1% of companies is headquartered in Singapore. This means that if USA has a
LAW_SR
score of5.45
, 40% of companies in the dataset have aLAW_SR
score of5.45
whereas only a few companies will have aLAW_SR
value of3.00
(theLAW_SR
score for Singapore).
[edit 2]
In SPSS Mixed Models has the following dialogue:
I am confused about which variables are factor, covariate and residual weight. As I believe continuous company-variables, such as a companies' market value and number of signed treaties (MKT_AVG_LN
and SIGN
) are factors, and company variables that are lifted from a country-level (for instance, LAW_SR
, which is a numerical value of the relative strength of shareholders under company law of a country) are covariates. Is this correct?
Also, I have a few dummy variables. For instance, each company operates under either French, German, English or Scandinavian law. Here, LAW_FRA
, LAW_GER
, LAW_ENG
, LAW_SCA
are dummy variables indicating for each company the origin of company law in the country where the company is headquartered. These dummy values are derived from the variable LAW_FAMILY
, which for each company is set to either FRA
, GER
, ENG
or SCA
. In mixed models, should I discard the dummy variables and use LAW_FAMILY
as a covariate also?
Best Answer
The fact that additional variables make large changes in your model doesn't mean there is collinearity nor does it mean it isn't "robust" (although I guess that depends on what you mean by "robust"). Nor does a reasonable VIF and a high F mean that additional variables won't have an effect. Only completely independent variables will have no effect on the other coefficients, and variables can be a long way from independent without having high VIF.
It means that controlling for those additional variables makes a large difference in the other relationships.
You haven't said what any of these variables are, so it's hard to be specific. However, let's imagine simpler models:
Model 1: log(income) as effect of race/ethnic group Model 2: log(income) as effect of race/ethnic group + age
The coefficients in both models will likely all be significant (if it's a reasonable sample size). But the coefficient of race/ethnicity will (I bet) drop in the 2nd model because the average age of different ethnic groups is different, and income tends to be related to age.
EDIT in response to edit in post:
Since you have country level and company level variables, you should not use OLS regression as the data are not independent. You need to account for this. One way is with mixed models (aka multilevel models). I don't know how
SPSS
does this, but inSAS
it would bePROC MIXED
and inR
it would benlme
orlme4
.