Machine Learning – Misuse of ‘Conditioned On’ and ‘Parametrized By’

machine learningterminology

Say, $X$ is dependent on $\alpha$. Rigorously speaking,

  • if $X$ and $\alpha$ are both random variables, we could write $p(X\mid\alpha)$;

  • however, if $X$ is a random variable and $\alpha$ is a parameter, we have to write $p(X; \alpha)$.

I notice several times that the machine learning community seems to ignore the differences and abuse the terms.

For example, in the famous LDA model, where $\alpha$ is the Dirichlet parameter instead of a random variable.

enter image description here

Shouldn't it be $p(\theta;\alpha)$? I see a lot of people, including the LDA paper's original authors, write it as $p(\theta\mid\alpha)$.

Best Answer

I think this is more about Bayesian/non-Bayesian statistics than machine learning vs.. statistics.

In Bayesian statistics parameter are modelled as random variables, too. If you have a joint distribution for $X,\alpha$, $p(X \mid \alpha)$ is a conditional distribution, no matter what the physical interpretation of $X$ and $\alpha$. If one considers only fixed $\alpha$s or otherwise does not put a probability distribution over $\alpha$, the computations with $p(X; \alpha)$ are exactly the same as with $p(X \mid \alpha)$ with $p(\alpha)$. Furthermore, one can at any point decide to extend the model with fixed values of $\alpha$ to one where there is a prior distribution over $\alpha$. To me at least, it seems strange that the notation for the distribution-given-$\alpha$ should change at this point, wherefore some Bayesians prefer to use the conditioning notation even if one has not (yet?) bothered to define all parameters as random variables.

Argument about whether one can write $p(X ; \alpha)$ as $p(X \mid \alpha)$ has also arisen in comments of Andrew Gelman's blog post Misunderstanding the $p$-value. For example, Larry Wasserman had the opinion that $\mid$ is not allowed when there is no conditioning-from-joint while Andrew Gelman had the opposite opinion.

Related Question