This is just a soft-assignment or probablistic $k$ means.
I don't find anything on the web to help you more right now, but it is typically done with some kind of Expectation-Maximization method.
EDIT to answer the comments
The empirical idea is the following: assume first that your membership degrees are hard assignments i.e. that they take values in $\{0,1\}$. In this case, you consider the within class similarities only for those values that actually belong to a certain class. The first term in your objective function may be written as
$$
\sum_{i = 1}^N \sum_{k = 1}^K \sum_{j = 1}^N u_{ik}d_{ij}u_{jk} = \sum_{k = 1}^K \sum_{i = 1}^N \sum_{j = 1}^N u_{ik}d_{ij}u_{jk} = \sum_{k = 1}^K \sum_{i \in \mathcal{C}_k} \sum_{j \in \mathcal{C}_k} d_{ij},
$$
where $\mathcal{C}_k$ denotes the set of indices corresponding to the elements (hard) assigned to cluster $k$. So if we continue, you can consider the within class similarity of a cluster $k$ as $D_k := \frac{1}{2}\sum_{i \in \mathcal{C}_k}\sum_{j \in \mathcal{C}_k} d_{ij}$ (note the $\frac{1}{2}$ as all pairwise components are taken twice into account, but this is not really an issue.)
Hence, in case of hard assignments, by minimizing your objective function, you are actually minimizing the total within class distance:
$$
J = \sum_{k}D_k.
$$
Assume now that we are dealing with soft assignments. Soft assignments is pretty much an embedding of how much uncertainty we have in our assignments. In other words, we are not sure whether a certain input point $i$ should belong to class $k$ or class $l$. To this end, I will write in terms of probability, as to me, membership degrees are nothing other than probabilities (this has been a massive subject of discussion in my former group, heavy on fuzzy logic!)
Now let $p_{il}$ denotes the probability that point $i$ belongs to cluster $l$. I will write your objective function a little bit differently
$$
J = \sum_{k = 1}^K \sum_{i = 1}^N p_{ik} D_{ik},
$$
where $D_{ik} = \sum_{j = 1}^N p_{jk}d_{ij}$ is the contribution of $i$ to cluster $k$.
In this case, if $p_{ik}$ is small (i.e. you don't think that point $i$ should be part of cluster $k$), than you don't have this point contribute too much to the global within class similarities, therefore, you give a lower weight to the distances between point $i$ and all the other points $j$, when looking at cluster $k$.
Similarly, if some point $j$ shouldn't be part of cluster $k$ (i.e. $p_{jk}$ is small) than they shouldn't contribute to the within class distances too much either.
For the sake of understanding, consider two points $1$ and $2$ which you strongly believe to belong to cluster $1$ with probability / belief $.9$ and only with belief $.1$ to cluster $2$. $d_{ij}$ will be accounted with weight $.81$ (i.e. a strong weight) to $D_1$ and only with weight $.01$ to $D_2$.
Usually, you would want to minimize the within class distances while maximizing the outer class distances.
Best Answer
This is a special case of the theorem where it is assumed that no $c_k=x_i$.
The original paper this formula appeared in is:
The article can be found her:
and the theorem is Theorem 3, (a) Case 1 on page 44.