You can see k-means as a special version of the EM algorithm, which may help a little.
Say you are estimating a multivariate normal distribution for each cluster with the covariance matrix fixed to the identity matrix for all, but variable mean $\mu_i$ where $i$ is the cluster's index. Clearly, if the parameters $\{\mu_i\}$ are known, you can assign each point $p$ its maximum likelihood cluster (ie. the $\mu_i$ for which the distance to $p$ in minimal). The EM algorithm for this problem is almost equivalent to k-means.
The other way around, if you know which points belong to which cluster, you can estimate the optimal $\mu_i$. The closed form solution to this (which finds a global optimum) basically says that to find the maximum likelihood models $\{\hat\mu_i\}$ you integrate over all possible assignments of points to clusters. Since even with just thirty points and two clusters, there are about a billion such possible assignments, this is unfeasible to calculate.
Instead, we can take some guess as to the hidden parameters (or the model parameters) and iterate the two steps (with the posibility of ending up in a local maximum). If you allow each cluster to take a partial responsibility for a point, you end up with EM, if you just assign the optimal cluster, you get k-means.
So, executive summary: in probabilistic terms, there is a global solution, but it requires you to iterate over all possible clusterings. Clearly if you have an objective function, the same is true. You could iterate over all solutions and maximize the objective function, but the number of iterations is exponential in the size of your data.
Use Gaussian Mixture Modeling (aka: EM Clustering) instead.
It allows different variances, depending on your model. It can even allow different covariances if you use the most complex models.
Best Answer
Within-cluster-variance is a simple to understand measure of compactness (there are others, too).
So basically, the objective is to find the most compact partitioning of the data set into $k$ partitions.
K-Means, in the Lloyd version, actually originated from 1d PCM data as far as I know. So assuming you have a really bad telephone line, and someone is bleeping a number of tones on you, how do you assign frequencies to a scale of say 10 tones? Well, you can tell the other to just send a bulk of data, store the frequencies in a list, and then try to split it into 10 bins, such that these bins are somewhat compact and separated. Even when the frequencies are distorted by the transmission, there is a good chance they will still be separable with this approach.
This is also why k-means usually comes out best when you evaluate clusterings with any other measure of compactness. Because it's just two different measures for a similar concept.