Solved – What regression analysis should I perform on the data and why

fixed-effects-modelgeneralized linear modelleast squaresregressionspss

I am a law student researching which factors influence the CSR (corporate social responsibility, GSE_RAW) behavior of companies. As my studies didn't offer any statistics courses, I'm having trouble to understand what type of statistical analysis I should perform on my data. After describing the data, I hope some of you can tell me more about this.

Two groups of possible factors / variables influencing CSR have been identified: company-specific and country-specific.

First, company-specific variables are

  • MKT_AVG_LN: the marketvalue of the company
  • SIGN: the number of CSR treaties the company has signed
  • INCID: the number of reported CSR incidents the company has been involved in

Second, each of the 4,000 companies in the dataset is headquartered in one of 35 countries. For each country, I have gathered some country-specific data, among others:

  • LAW_FAM: the legal family the countries' legal system stems from (either French, English, Scandinavian, or German)
  • LAW_SR: relative protection the countries' company law gives to shareholders (for instance, in case of company default)
  • LAW_LE: the relative effectiveness of the countries' legal system (higher value means more effective, thus for instance less corrupted)
  • COM_CLA: a measurement for the intensity of internal market competition
  • GCI_505: mesurement for the quality of primary education
  • GCI_701: measurement for the quality of secondary education
  • HOF_PDI: power distance (higher value means more hierarchical society)
  • HOF_LTO: country time orientation (higher means more long-term orientation)
  • DEP_AVG: the countries' GDP per capita
  • CON_AVG: the countries' average inflation over the 2008-2010 timeframe

In order to make an analysis on this data, I "raised" the country-level data to the company-level. For instance, if Belgium has a COM_CLA value of 23, then all Belgian companies in the dataset have their COM_CLA value set to 23. The variable LAW_FAM is split up into 4 dummy variables (LAW_FRA, LAW_SCA, LAW_ENG, LAW_GER), giving each company a 1 for one of these dummies.

This all results in a dataset like this:

COMPANY MKT_AVG_LN ... INCID ... LAW_FRA LAW_SCA ... LAW_SR LAW_LE COM_CLA ... etc
----------------------------------------------------------------------------------
   1      1.54          55          0       1          34     65     53
   2      1.44          16          0       1          34     65     53
   3      0.11           2          0       1          34     65     53
   4      0.38          12          1       0          18     40     27
   5      1.98         114          1       0          18     40     27
   .       .             .          .       .           .      .      .
   .       .             .          .       .           .      .      .
 4,000    0.87           9          0       1           5     14     18

Here, companies 1 to 3 are from the same country A, and 4 and 5 from country B.

My DV, GSE_RAW is a numerical value for each companies' CSR behavior given by a rating agency.

  • I believe the country-level variables are also called "categorical" variables, as many companies share the same value for these variables (in the example above, companies 1 to 3 all share the same values for LAW_FRA to COM_CLA). I believe to have found out that "categorical" variables are also known as fixed factors. Is all this true?
  • I believe an OLS regression analysis is not the proper model here because of the categorical (country-level) variables. It has been proposed to use "Generalized Linear Models" (GLS), using the country-level variables as (fixed?) "factors" and the company-level variables as "covariates". Is this correct? And as a subquestion: why exactly is OLS not appropriate because of the country-level variables? What is it what they do in the OLS calculations that makes them set off the regression?

[edit 1] I am using SPSS for statistical analysis

[edit 2] Here my attempt to create a GLM using this data. However, I am unable to not get the "you haven't specified a custom model" Do I have to select all 4 variables here (becaus I want a beta and significance level for all 4 of them to construct a regression model)? And if so, why do I have to do this twice? I already said in a previous dialogue box that DEP_AVG and CON_AVG are fixed factors and that SIGN and INCID are covariates. Why would I, for instance, insert INCID here as a covariate, but not include it in the model building dialogue? Also, I really don't understand the output I'm getting, since it is very different from ordinary OLS output (the only output I'm slightly comfortable with).

  • Am I now doing the right analysis?
  • How can I get a regression model from this?

enter image description here

Best Answer

Whether a variable is categorical depends only on the variable, not on any "sharing" of common values. In your case, LAW_FAM is categorical because it has four discrete categories: FRA, SCA, ENG, GER. In particular, LAW_FAM is nominal: the categories have no ordering. You could have several countries which happen to have exactly the same DEP_AVG, but that doesn't make DEP_AVG a categorical variable.

I would suggest that you look at Multilevel/Hierarchical Models, since you have hierarchical data: country-level data and company-level data nested within countries.

Your post is very good: you include enough details to help us help you. One more thing that would also help us point you in the right direction is to know what software you will be using for your analysis.

EDIT: You ask about Generalized Linear Models, which are chosen for specific kinds of dependent variables. For example, if you were wanting to predict a categorical variable, you'd use Logistic Regression (which is done with a GLM).