Solved – getting different intercept values in R and Java for simple linear regression

javarregression

I have linear regression code written in R and I have to do the same thing in Java. I used Apache Commons math library for this. I used the same data in R code and in Java code, but I got different intercept value. I could not figure out what stupid thing I have done in the code.

R Code:

test_trait <- c( -0.48812477 , 0.33458213, -0.52754476, -0.79863471, -0.68544309, -0.12970239,  0.02355622, -0.31890850,0.34725819 , 0.08108851)
geno_A <- c(1, 0, 1, 2, 0, 0, 1, 0, 1, 0)
geno_B <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 
fit <- lm(test_trait ~ geno_A*geno_B)
fit

R Output:

Call:
lm(formula = test_trait ~ geno_A * geno_B)

Coefficients:
  (Intercept)         geno_A         geno_B  geno_A:geno_B  
    -0.008235      -0.152979      -0.677208       0.096383 

Java Code (includes EDIT1):

package linearregression;
import org.apache.commons.math3.stat.regression.SimpleRegression;
public class LinearRegression {
    public static void main(String[] args) {

        double[][] x = {{1,0},
                        {0,0},
                        {1,0},
                        {2,1},
                        {0,1},
                        {0,0},
                        {1,0},
                        {0,0},
                        {1,0},
                        {0,0}
        };

        double[]y = { -0.48812477,
                       0.33458213,
                      -0.52754476,
                      -0.79863471,
                      -0.68544309,
                      -0.12970239,
                       0.02355622,
                      -0.31890850,
                       0.34725819,
                       0.08108851
        };
        SimpleRegression regression = new SimpleRegression(true);
        regression.addObservations(x,y);

        System.out.println("Intercept: \t\t"+regression.getIntercept());
// EDIT 1 -----------------------------------------------------------
System.out.println("InterceptStdErr: \t"+regression.getInterceptStdErr());
System.out.println("MeanSquareError: \t"+regression.getMeanSquareError());
System.out.println("N: \t\t\t"+regression.getN());
System.out.println("R: \t\t\t"+regression.getR());
System.out.println("RSquare: \t\t"+regression.getRSquare());
System.out.println("RegressionSumSquares: \t"+regression.getRegressionSumSquares());
System.out.println("Significance: \t\t"+regression.getSignificance());
System.out.println("Slope: \t\t\t"+regression.getSlope());
System.out.println("SlopeConfidenceInterval: "+regression.getSlopeConfidenceInterval());
System.out.println("SlopeStdErr: \t\t"+regression.getSlopeStdErr());
System.out.println("SumOfCrossProducts: \t"+regression.getSumOfCrossProducts());
System.out.println("SumSquaredErrors: \t"+regression.getSumSquaredErrors());
System.out.println("XSumSquares: \t\t"+regression.getXSumSquares());
// EDIT1 ends here --------------------------------------------------

    }
}

Java Output:

Intercept:      -0.08732359363636362

Java Output of EDIT1:

Intercept:      -0.08732359363636362
InterceptStdErr:    0.17268454347538026
MeanSquareError:    0.16400973355415271
N:          10
R:          -0.3660108396736771
RSquare:        0.13396393475863017
RegressionSumSquares:   0.20296050132281976
Significance:       0.2982630977579106
Slope:          -0.21477287227272726
SlopeConfidenceInterval: 0.4452137360615129
SlopeStdErr:        0.193067188937234
SumOfCrossProducts:     -0.945000638
SumSquaredErrors:   1.3120778684332217
XSumSquares:        4.4

I will greatly appreciate your help. Thanks !

Best Answer

As there seems to be a misunderstanding about the statistical aspect of the procedures you used, here are some hints:

  • Your R model is actually a multiple linear regression with geno_A and geno_B treated as numeric variables, and it includes an interaction term, which is why you get four parameter estimates. I hope you really want geno_A to be treated as numeric and not a categorical variable, otherwise you will have to dummy recode it.
  • Your Java code does not match the above model and there are two problems: first, you didn't include the interaction term, which is simply the product of geno_A and geno_B (see here for an illustration on how to code interaction between two variables); second, you are using SimpleRegression but you should use OLSMultipleLinearRegression instead.

Here is your code rewritten as LinearRegression.java to fit the following two models: a simple additive model and a model with interaction. Its output agrees with R.

Java

% javac -cp commons-math3-3.1.1.jar LinearRegression.java && java -cp commons-math3-3.1.1.jar:. LinearRegression
First model: y = int + genoA + genoB
Intercept: -0,032   beta1: -0,105   beta2: -0,605

Second model: y = int + genoA + genoB + genoA:genoB
Intercept: -0,008   beta1: -0,153   beta2: -0,677   beta2: 0,096

R

> lm(test_trait ~ geno_A + geno_B)

Call:
lm(formula = test_trait ~ geno_A + geno_B)

Coefficients:
(Intercept)       geno_A       geno_B  
   -0.03233     -0.10479     -0.60492  

> lm(test_trait ~ geno_A * geno_B)

Call:
lm(formula = test_trait ~ geno_A * geno_B)

Coefficients:
  (Intercept)         geno_A         geno_B  geno_A:geno_B  
    -0.008235      -0.152979      -0.677208       0.096383  
Related Question