Machine Learning – Maximizing Likelihood vs. Minimizing Cost: A Comprehensive Comparison

machine learningoptimization

I keep coming across two different kinds of optimization:

  1. Cases where you maximize the likelihood of the data directly (for example CRF learning, or EM).
  2. Cases where you minimize some cost function (for example, fitting least squares)

I also have noticed that people use gradient methods for solving each of these two kinds of problems.

For maximization, the gradient update rule looks like this. The intuition is you want to maximize so you climb the hill of the curvature in the direction of the gradient.

$$\lambda_{i+1} = \lambda_i + \frac{\partial f(x)}{\partial \lambda_i}$$

For minimization, you want to minimize the cost function, so you subtract the gradient to roll down the hill of the curvature.

$$\lambda_{i+1} = \lambda_i – \frac{\partial f(x)}{\partial \lambda_i}$$

It also seems like some optimization packages ask you to flip the sign of a maximization problem to get a minimization problem instead. Example:

Note that since minimize only minimizes functions, the sign parameter is introduced to multiply the objective function (and its derivative) by -1 in order to perform a maximization.

  1. Is minimization more canonical?
  2. Am I getting this right? That is, is my description of the
    machine learning landscape correct?
  3. How do I know when I should minimize a cost function or maximize a likelihood? (Or a log likelihood).

My first thought was that maximizing log likelihood is for unsupervised learning (where you can't generate a cost function, because there are no labels) — but CRF learning maximizes log likelihood directly too.

Best Answer

You already know a lot. Two observations.

Take linear regression. Minimizing the squared error turns out to be equivalent to maximizing the likelihood. Loosely one could say that minimizing the squared error is an intuitive method, and maximizing the likelihood a more formal approach that allows for proofs using properties of for example the normal distribution. The outcomes can overlap.

Second minimizing or maximizing is often AFAIK arbitrary. Minimizing the negative is the same as maximizing the positive. There are a lot of routines that are written in the minimization mode: this is sort of coincidence. For reasons of parsimony/readability this has become standard.