[Math] Does choice of initial Theta effect Perceptron’s convergence

machine learning

I am studying the Perceptron algorithm. Right now I am trying to understand if the initialization choice of $\theta$ effects the algorithm's ability to converge. I have tried experimenting by applying the algorithm to a linearly separable training set with different initial $\theta$ values, one being zero and the others non-zero values and the algorithm converged every time. However, I can't come up with a way of mathematically generalizing this pattern. I am trying to figure out how to generalize this pattern; come up with a proof for why the algorithm always converges.

Best Answer

As long as the data set is linearly separable, the perceptron algorithm will always converge in $ \frac{R^2}{\gamma^2} $ iterations. The initialization does not matter.

The proof is a standard thing they explain in any ML course at university (not super trivial to come up with but simple to understand by reading the actual proof). See these notes for example:

http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf

I found it by googling "perceptron convergence proof", hope it helps.

In fact the beauty of the prove above is that it shows that if you initialize it to zero then you errors are bounded by $R^2/\gamma^2$. So in a way, it shows that if you start with the vector 0 and update, you eventually converge in a finite number of iterations.

If the initialization affects the speed of convergence, that depends on the data set and in what order you see ur points.

Related Solutions

[Math] Machine learning : Perceptron, purpose of bias and threshold

In terms of linear separability: using a bias allows the hyperplane that separates the feature space into two regions to not have to go through the origin. Without a bias, any such hyperplane would have to go through the origin, and that may prevent the separability we want.

Simple example: suppose we have two inputs $x$ and $y$ that can take on the values $0$ or $1$, and we want the output to be a $1$ when both inputs are $1$ (an $\land$ logic circuit, basically). This means that the separating hyperplane cannot go through the origin of the feature space (where $x$ and $y$ are $0$). But without using a bias (or, equivalently, by having a threshold of $0$), you can't move the hyperplane (the set of points $(x,y)$ for which $x\cdot w_1 + y \cdot w_2=0$ given some weights $w_1$ and $w_2$) away from the origin, because for any weights, the origin is a solution to this equation.

Understanding how EM algorithm actually works for missing data

Having studied this, I would remark that there is quite a lot going on with the EM algorithm if it's unfamiliar. Yours is an interesting question - some of the presentations I have seen on the EM algorithm sometimes don't explicitly specify how imputation of missing data is handled.

Having seen the presentations in some of the more well-known machine learning/statistics textbooks and tutorial papers, I would say that the clearest way in which you can see how imputation of missing values occurs in the EM algorithm is the viewing how it works through expected sufficient statistics.

Missing data imputation using the EM algorithm.

You are entirely correct that the EM algorithm is for maximum-likelihood estimation in the presence of latent variables (which can defined to be missing data), and that imputation/inference of these latent variables is a subroutine for parameter estimation.

And yes, you are correct that you use current estimates of the parameters $\theta^{k}$ (or an initial guess when $k=0$) to impute what the latent variables might be.

There are two senses in which the imputation of the missing data occurs:

In the E-step, when you compute the conditional distribution of the missing data $Z$, given the observed variables $X = x$, and current parameter estimates $\theta = \theta^{k}$, that is $p_{Z | X}(Z | X = x, \theta = \theta^{k})$. This is imputation in a probabilistic sense, and using Bayesian semantics, you are computing a posterior distribution on the values that missing data $Z$ can take given observed variables and current parameter estimates.
However, if you mean imputation in the sense of "filling in the missing data" using an appropriate point prediction, that occurs in both the 2nd part of the E-step and also the M-step.
In the E-step, when you come to compute the expected complete log-likelihood under the conditional distribution $p_{Z | X}(Z | X = x, \theta = \theta^{k})$, you will need to compute expected sufficient statistics, which will require not only filling in the missing values $Z$ with the mean of the conditional distribution/posterior $p_{Z | X}(Z | X = x, \theta = \theta^{k})$, but also requires the variance of the posterior.
In the M-step, these expected sufficient statistics are used to update your parameters.

If you are unfamiliar with the expected sufficient statistic viewpoint, I refer you to pages "Section 11.6 Fitting models with missing data", p374-p376, of Machine Learning: A Probabilistic Perspective by Kevin Murphy, which shows you how to deal with the specific context of missing data EM in the multivariate Normal rather than bivariate case (the multivariate case seems to be the norm in most of the main ML textbooks)

Lastly, I would remark that your confusion may be arising from a faulty partitioning of your data into missing and observed variables; and the formulae you have specified in brackets are incorrect due to this faulty partitioning. Again, I refer you to Murphy, but here is the gist of what happens for multivariate Normal distributions. The exposition I've added is from my handwritten notes as Murphy is not entirely clear, in my view, on how the partitioning works.

Our data matrix, is say $\mathbf{Y} \in \mathbb{R}^{D \times N}$, where the $n$th observation of $D$ input variables is a D-dimensional vector $y_n \in \mathbb{R}^D$. The case where $y_n$ is a random vector that has a multivariate Normal distribution, $y_n \sim N(\mu, \Sigma)$ where $\mu \in \mathbb{R}^{D}$ and $\Sigma \in \mathbb{R}^{D \times D}$ is often used to show how this works analytically.

Each of the $y_n$ will have missing components, which we denote as $z_n$; and also observed components which we denote as $x_n$ (you can think of $x_n$ and $z_n$ as sub-vectors of $y_n$). We can now consider partitioning the $y_n$ into an observed sub-vector $x_n$, and missing data sub-vector $z_n$. We assume that each $x_n$ and $z_n$ have a joint Normal distribution, meaning that we can specify the following:

$$y_n = \begin{bmatrix} x_n \\ z_n \\ \end{bmatrix} \sim N \left( \begin{bmatrix} \mu_{n, x} \\ \mu_{n, z} \\ \end{bmatrix}, \begin{bmatrix} \Sigma_{n, xx}, \Sigma_{n, xz} \\ \Sigma_{n, zx}, \Sigma_{n, zz} \\ \end{bmatrix} \right)$$

Where $\mu_{n, x}$ and $\mu_{n, z}$ are sub-vectors of $\mu$ and $\Sigma_{n, xx}, \Sigma_{n, xz}, \Sigma_{n, zx}, \Sigma_{n, zz}$ are sub-matrices of $\Sigma$ using the relevant dimensions defined by $x_n$.

We can now use standard formulae for multivariate Normal conditional distributions to specify the posterior used in the E-step:

$$p(z_n | x_n, \mu, \Sigma) \sim N(m_n, V_n)$$

where the posterior mean and variance are:

$$m_n = \mu_{n, z} + \Sigma_{n, zx}\Sigma_{n, xx}^{-1}(x_n - \mu_{n, x})$$

$$V_n = \Sigma_{n, zz} - \Sigma_{n, zx}\Sigma_{n, xx}^{-1}\Sigma_{n, xz}$$

We now use the posterior mean $m_n$ and variance $V_n$ to compute the expected sufficient statistics $\mathbb{E}[x_n]$ and $\mathbb{E}[x_n x_n^T]$. These are used to compute the $Q(\theta, \theta^{k})$ function in the E-step, and to compute maximum likelihood parameter updates in the M-step, that is, $\mu^{k+1}$ and $\Sigma^{k+1}$.

Further references.

Whilst the expected sufficient statistics viewpoint, in my opinion, is excellent for seeing how data imputation is working in the EM algorithm, it is not as appropriate for understanding why it is we use the expected complete log-likelihood in the first place, nor why we can iteratively maximise this as a surrogate for the observed data/incomplete log-likelihood. The clearest exposition I have seen on this aspect of the EM algorithm is in Chapter 11 of a unpublished book callsed "Probabilistic Graphical Models" by the legendary Michael Jordan, which you can find here.

Best Answer

Related Solutions

[Math] Machine learning : Perceptron, purpose of bias and threshold

Understanding how EM algorithm actually works for missing data

Related Question