I had the understanding that the major difference between machine learning and statistical model is, the later "assumes" certain type of distribution of data & based on that different model paradigm as well as statistical results we obtain (e.g. p-values, F-statistics, t-stat, etc.). But in case of machine learning, we don't bother about distribution of data and more interested in prediction.
When I was going through Mllib doc, I found for linear regression we are specifying a distribution. But Mllib is a machine learning package. So, I've the following questions:
1) Is my understanding between ML & statistical method is wrong?
2) Is spark is using statistical modeling for linear regression and GLMs?
Thanks!
Note: There are lot of wonderful post regarding the difference between machine learning and statistical method. But this more related to spark MLLIB.
Best Answer
pyspark/mllib/regression.py
'sLinearRegressionWithSGD
comments:Train a linear regression model using Stochastic Gradient Descent (SGD). This solves the least squares regression formulation f(weights) = 1/(2n) ||A weights - y||^2 which is the mean squared error.
i.e. this is a standard linear regression algorithm for Gaussian response. The implementation of the particular algorithm might be optimised such that it works for very large datasests (see for example this excellent thread on "Why use gradient descent for linear regression, when a closed-form math solution is available?") but the theory behind an algorithm is exactly the same.