You can normalize the vectors in each cluster by their lengths and add them up, then normalize the sum. The result will be a unit vector in the direction of the centroid (a.k.a. prototype) vector. As far as the spherical k-means algorithm is concerned, the length of the centroid vector does not matter and is not used. This is because to calculate the cosine distance between each cluster member and the centroid, both vectors are normalized by their lengths. See the following excerpt from this article:
If you really need a centroid vector with a representative length, you can take the average of the lengths of the cluster members and multiply it by the unit centroid vector. But this would be completely your choice and would have nothing to do with the k-means algorithm (you could use any other type of averaging, arithmetic, geometric, or just the length of the average vector to compute the representative centroid lenght).
The formula posted by Vijay Rajan is effectively the same (except giving a centroid vector of non-unit length), but note that in that formula too the vectors must be normalized to unit length before applying the formula. When calculated properly, the centroid does indeed "bisect" the angle between the vectors. (I don't currently have the forum privilege to make this a comment on their response.)
- When analyzing non-independent observations (e.g. two eyes of same person) in regression, is mixed effect model the way to go?
In short: Yes.
Mixed models are capable of modelling the dependence or structure introduced in the data by the study design. In your example of measuring both eyes, you can use a mixed model with a random effect for individual, since individuals have two eyes and thus cause the dependence by being in the data twice.
However, you still cannot consider pseudoreplications to be true replicates in a mixed model. In many cases you can make more effective use of them in a mixed model, but the number of true replicates hasn't magically increased by changing the type of model.
That being said, the repeated measures you are describing are very common in medical research and can be modelled just fine with a mixed model.
- Mixed effect models are all regression based. How would I go about doing the equivalent of t-test or mann whitney u test while accounting for non-independence issue?
You can easily perform the equivalent of a $t$-test using a (mixed) regression model:
library(lme4)
lmer(y ~ x + (1 | rand))
Where x
is a two-level factor. The first group of x
will be the intercept and significance of x
as an explanatory variable means there is a significant difference between the two groups.
As for the Mann-Whitney-U test, I'm not sure you could do a test based on ranks with a mixed model. However, you probably don't need to since you can either use a generalized linear mixed model (e.g. glmer(..., family = 'poisson')
), or a non-linear mixed (see the nlme
package).
Although the nlme
package is great, I would recommend you not to jump to non-linear models too fast, because a GLMM is often easier to interpret and in many cases there is a logical choice for the theoretical distribution of the data-generating process in clinical research.
Alternatively, you could look into Bayesian hierarchical modelling, which is actually quite similar to mixed models, albeit a bit more difficult if you are not familiar with Bayesian statistics.
There are numerous models that try and model dependence or hierarchy. I am not familiar with "Cluster-correlated robust estimates of variance", but a mixed model with nested structure is essentially a hierarchical model.
Best Answer
Ties (exact same distances) are not programming errors.
You could break them with a random generator, but that could cause an infinite loop in theory (or at least some extra iterations).
Or you just don't change anything in such cases, then you are fine.
Another option would be to always assign to the "first" cluster, which should also be stable.
It does not really make a difference as long as you make a deterministic decision.
Beware that the k-means style approach for k-medoids works quite poor.