Solved – Is it ok to call model learning in machine learning an “estimator”

estimationmachine learningterminology

Spark doc contains this:

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

But Wikipedia says:

An "estimator" or "point estimate" is a statistic (that is, a function of the data) that is used to infer the value of an unknown parameter in a statistical model. The parameter being estimated is sometimes called the estimand. It can be either finite-dimensional (in parametric and semi-parametric models), or infinite-dimensional (semi-parametric and non-parametric models). If the parameter is denoted $\theta$ then the estimator is traditionally written by adding a circumflex over the symbol: $\widehat {\theta }$. Being a function of the data, the estimator is itself a random variable; a particular realization of this random variable is called the "estimate". Sometimes the words "estimator" and "estimate" are used interchangeably.

I would have called the machine learning model itself the "estimator". Are the 2 definitions above close enough, or off, or are there important distinctions to be aware of?

Best Answer

Spark's documentation refers to internal names of Spark's procedures and objects, not to theoretical concepts. In statistics estimator is basically a procedure that estimates something given the data and returns an estimate.

It is hard to answer your question without knowing what you mean by model. Statistical model describes the phenomenon of interest in terms of probability theory (notice that there are also other models, e.g. mechanistic models). It is not yet an estimator of anything, but just an abstract description of the problem. You need an estimator to estimate the parameters of your model (also in case of so-called non-parametric models). In machine learning people often call algorithms used to estimate something as models, but in general: model is a theoretical description of the problem and estimator is a procedure that generates the estimates of its parameters. Estimator is a general procedure, not its implementation, so function foo() in software XYZ is not an estimator, but rather it's an implementation of some estimator.

Related Question