Solved – high dimensional data in data mining

data mininghigh-dimensional

Currently I am studying effect of high dimensions of data on clustering , for experiment purpose I want to use kdd dataset from UCI which contains 42 features.
Is kdd a high dimensional data or what is the threshold of number of dimensions beyond that we can say data is high dimensional ?

Best Answer

A way to see high dimension, is when there are more regressors/predictors than observations. If $p$ denotes the number of regressors and $n$ the number of observations, high dimension is when $p > n$ and even $p >> n$. If I remember well, penalized regressions (ridge, lasso) have been introduced partly in order to tackle this issue (classical OLS in this setting to do not give a unique solution).

Edit : As asked, some details about what I said. And I apologize about the fact that what I'm talking about is more relevant in a supervised framework. This definition (which can be thought as subjective of course) is the consequence of the following. If you consider a classical linear regression : $Y = X\beta + \epsilon$ then you have the OLS estimator : $\hat\beta = (X'X)^{-1}X'Y$ which is valid only if $(X'X)^{-1}$ exists. Or if $dim(X)=(n,p)$ with $n < p$ then $X'X$ is not full rank, then cannot be inverted and then no more $\hat\beta$ as previously. So switching from $n>p$ to $p > n$ is not trivial. Or multiple linear regression is widely used (especially in econometrics for instance, but also in epidemiology when studying genes...) hence I think it is a convenient way to define "high dimension" (but that's true : it's subjective) because you need to do something different from what you usually do.

For clustering, maybe it can be seen differently, with k-nearest neighbors curse of dimensionality is reached a long time before $p > n$...

Some references :

Related Question