Solved – Difference Between Linear Regression in Machine Learning and Statistical Model

machine learningspark-mllib

I had the understanding that the major difference between machine learning and statistical model is, the later "assumes" certain type of distribution of data & based on that different model paradigm as well as statistical results we obtain (e.g. p-values, F-statistics, t-stat, etc.). But in case of machine learning, we don't bother about distribution of data and more interested in prediction.

When I was going through Mllib doc, I found for linear regression we are specifying a distribution. But Mllib is a machine learning package. So, I've the following questions:

1) Is my understanding between ML & statistical method is wrong?

2) Is spark is using statistical modeling for linear regression and GLMs?

Thanks!

Note: There are lot of wonderful post regarding the difference between machine learning and statistical method. But this more related to spark MLLIB.

Best Answer

  1. Unfortunately, the dichotomy you describe is invalid. ML models (almost always) define a response distribution. For example, the extremely popular gradient boosting machine library XGBoost defines particular learning objectives (e.g. linear, logistic, Poisson, Cox, etc.).
  2. The implementation of linear regression and GLMs in Spark's MLlib is definitely based on standard Statistical theory for linear models. For example, quoting directly from pyspark/mllib/regression.py's LinearRegressionWithSGD comments: Train a linear regression model using Stochastic Gradient Descent (SGD). This solves the least squares regression formulation f(weights) = 1/(2n) ||A weights - y||^2 which is the mean squared error. i.e. this is a standard linear regression algorithm for Gaussian response. The implementation of the particular algorithm might be optimised such that it works for very large datasests (see for example this excellent thread on "Why use gradient descent for linear regression, when a closed-form math solution is available?") but the theory behind an algorithm is exactly the same.
Related Question