Solved – the difference between multimodal and multivariate

distributionsmodemultivariate analysisterminology

Can somebody explains me the difference between "multimodal" and "multivariate"?

For example, I have a dataset which contains different information. All information objects are connected together by a timestamp. Is this dataset multimodal or multivariate? If I create an algorithm for clustering these data, should I call this algorithm multimodal or multivariate?

Best Answer

Put very simply, "multi-modal" refers to a dataset (variable) in which there is more than one mode, whereas "multi-variate" refers to a dataset in which there is more than one variable.

Here is a simple demonstration, coded with R:

set.seed(5104)
x1mm = c(rnorm(50, mean=-2), rnorm(50, mean=2))
x1um = rnorm(100, mean=0.5, sd=sqrt(3))
plot(density(x1mm), main="multimodal data")
plot(density(x1um), main="unimodal data")

enter image description here

y = .5*x1um + rnorm(100)
plot(x1um, y, xlab="X", ylab="Y", main="bivariate data")

enter image description here

That's the gist of it. When you have response and regressor variables, and you want to fit a model that maps them, the use of "multivariate" depends on the nature of the mapping. When there is only one response and one covariate, we say this is simple regression; if there is more than one covariate, we say it is multiple regression; and if there is more than one response variable, we call it multivariate regression. In your case, I gather you are interested in clustering / unsupervised learning, so these distinctions don't really apply.

However, the clustering aspect makes this a little more interesting. In order to cluster successfully, you generally want your data to be multimodal in the full data space. The clusters / latent groupings are found by finding a partition that separates the data into unimodal subsets that are more coherent than the original (unpartitioned) superset.