Solved – Categorize statistical tests into univariate and multivariate methods

multivariate analysisterminologyunivariate

I am not sure about the following tests/methods whether they belong to the category of univariate or multivariate tests.

Univariate tests/methods:
t-test,ANOVA,ANCOVA, univariate linear regression (Y=a+bX)

Multivariate tests/methods:
MANOVA, multivariate linear regression (Y=a+bX+cX+…),
cluster analysis, partial least squares discriminant analysis (PLS-DA),
principal component analysis

Can anyone confirm whether these tests/methods are assigned to the right categories?

Best Answer

Usage of these terms is not completely consistent across statistical science and has also changed over time.

I'd call a method univariate if it can be applied to single variables and when applied to two or more variables yields separate results, meaning that the results for one variable are completely unaffected by which other variables you have or have not chosen. Means, medians and standard deviations are simple examples. Regression can be a univariate method in so far as a regression with one variable alone should return the mean of that variable as the predicted or fitted value. (It is an interesting and good small test of your favourite software to check that to be true.)

I'd call a method bivariate if it requires two variables and/or is applied to two variables together, results depending on both variables. Correlation requires two variables. Regression in the sense of $Y = a + bX$ is applied to two variables together and is, in that sense only, a bivariate method, although in practice the term "bivariate" seems not often used for such a flavour of regression: it is unnecessary rather than incorrect. More generally, "bivariate" is often redundant as a term as it is generally obvious that you have two variables or a collection of bivariate results (e.g. a correlation matrix for several variables).

The term multivariate is the most interesting. Over several decades the term has morphed with the emphasis shifting from whether you have multiple variables all put into one method to whether you have multiple response (outcome, target, dependent) variables or not, which has come to seem more crucial. A case in point is that multiple regression is the name often given when there is one response only, while multivariate regression is an appropriate name only when there are two or more responses. The distinction often provokes small corrections on this list. Principal component analysis is an example of what would generally be described as a multivariate method. In principle, principal component analysis can be applied to just one variable, and returns just that variable as the single principal component; making that point has not often seemed interesting or necessary. A different point is that the term "multiple" seems to be fading away slowly, as in essence not worth flagging. (Yes, we have lots of predictors, so?)

The division is one of practice rather than principle, and a test case is to ask statistical people around whether multiple regression is a multivariate method. I think you would get many no answers, with the exceptions mentioning multivariate regression as distinct. If you got a mix of yes and no, then that underlines my starting point that usage is not completely consistent across statistical science.

Furthermore, these terms are often used broadly and casually and not much depends on their exact meaning. A common example is that multivariate methods are often segregated off in texts or courses, and statistical people range from those who never use them in practice to those who use almost nothing else, but here "multivariate" is still just a label and some people might include regression in several flavours for a mix of reasons.

The example of t-tests for comparing means also shows how much depends on local terminology or is a matter of convention. You could say that the focus is on comparing two means for the same response variable, with a second variable indicating a group for unpaired data, or the response variable being organized in two columns for paired data. Depending on what your software does and what it calls things, that might be thought to involve one variable or two, but I don't see that anything depends on which label you use.

Related Solutions

Solved – Soft-thresholding vs. Lasso penalization

What i'll say holds for regression, but should be true for PLS also. So it's not a bijection because depeding on how much you enforce the constrained in the $l1$, you will have a variety of 'answers' while the second solution admits only $p$ possible answers (where $p$ is the number of variables) <-> there are more solutions in the $l1$ formulation than in the 'truncation' formulation.

Solved – Are PLS-DA and PLS-LDA the same

No, they are not the same.

In PLS-DA, the Y matrix consists of categorical variables of 0 and 1 where each column represents a class. To illusturate, let's assume you have 6 samples where each 2 samples belongs to a group your Y matrix would look something like this:

The results obtained from PLS-DA is in the same form of the Y matrix regardless of the number of latent variables used.

In PLS-LDA, however, the scores of X(with desired number of latent variables) obtained from PLS is used for LDA. It is very similar to PCA-LDA where PCA is used as a dimension reduction prior to LDA while same logic is exploited with PLS scores in PLS-LDA.

Reference: Chemometrics for Pattern Recognition, Richard G. Brereton

Best Answer

Related Solutions

Solved – Soft-thresholding vs. Lasso penalization

Solved – Are PLS-DA and PLS-LDA the same

Related Question