Currently I am studying effect of high dimensions of data on clustering , for experiment purpose I want to use kdd dataset from UCI which contains 42 features.
Is kdd a high dimensional data or what is the threshold of number of dimensions beyond that we can say data is high dimensional ?
Solved – high dimensional data in data mining
data mininghigh-dimensional
Related Solutions
First some background:
R is good choice and have so many clustering methods in different packages. The functions include Hierarchical Clustering, Partitioning Clustering, Model-Based Clustering, and Cluster-wise Regression.
Connectivity based clustering or Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In this method, at different distances, different clusters will form, which can be represented using a dendrogram.
Centroid-based clustering: In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=8, cex=2)
table(km$cluster, iris$Species)
Distribution-based clustering: the clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. The problem is overfitting.
Density-based clustering: n density-based clustering,[8] clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to be noise and border points.
In density based cluster, a cluster is extend along the density distribution. Two parameters is important: "eps" defines the radius of neighborhood of each point, and "minpts" is the number of neighbors within my "eps" radius. The basic algorithm called DBscan proceeds as follows:
First scan: For each point, compute the distance with all other points. Increment a neighbor count if it is smaller than "eps".
Second scan: For each point, mark it as a core point if its neighbor count is greater than "mints"
Third scan: For each core point, if it is not already assigned a cluster, create a new cluster and assign that to this core point as well as all of its neighbors within "eps" radius.
Unlike other cluster, density based cluster can have some outliers (data points that doesn't belong to any clusters). On the other hand, it can detect cluster of arbitrary shapes (doesn't have to be circular at all).
library(fpc)
# eps is radius of neighborhood, MinPts is no of neighbors
# within eps
cluster <- dbscan(sampleiris[,-5], eps=0.6, MinPts=4)
plot(cluster, sampleiris)
plot(cluster, sampleiris[,c(1,4)])
# Notice points in cluster 0 are unassigned outliers
table(cluster$cluster, sampleiris$Species)
With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.
For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes. The three clustering algorithms include PROCLUS, P3C and STATPC.
To your question:
The package Package ‘orclus’ is available to perform subspace clustering and classification. The following is example from the manual:
# definition of a function for parameterized data simulation
sim.orclus <- function(k = 3, nk = 100, d = 10, l = 4, sd.cl = 0.05, sd.rest = 1, locshift = 1){
### input parameters for data generation
# k # nk # d # l
# sd.cl
# sd.rest # locshift
number of clusters
observations per cluster
original dimension of the data
subspace dimension where the clusters are concentrated
(univariate) standard deviations for data generation (within cluster subspace concentration) standard deviations in the remaining space
parameter of a uniform distribution to sample different cluster means
x <- NULL
for(i in 1:k){
# cluster centers
apts <- locshift*matrix(runif(l*k), ncol = l)
# sample points in original space
xi.original <- cbind(matrix(rnorm(nk * l, sd = sd.cl), ncol=l) + matrix(rep(apts[i,], nk), ncol = l, byrow = TRUE)
matrix(rnorm(nk * (d-l), sd = sd.rest), ncol = (d-l)))
# subspace generation
sym.mat <- matrix(nrow=d, ncol=d)
for(m in 1:d){
for(n in 1:m){
sym.mat[m,n] <- sym.mat[n,m] <- runif(1)
}
}
subspace <- eigen(sym.mat)$vectors
# transformation
xi.transformed <- xi.original %*% subspace
x <- rbind(x, xi.transformed)
}
clids <- rep(1:k, each = nk)
result <- list(x = x, cluster = clids)
return(result)
}
# simulate data of 2 classes where class 1 consists of 2 subclasses
simdata <- sim.orclus(k = 3, nk = 200, d = 15, l = 4, sd.cl = 0.05, sd.rest = 1, locshift = 1)
x <- simdata$x
y <- c(rep(1,400), rep(2,200))
res <- orclass(x, y, k = 3, l = 4, k0 = 15, a = 0.75)
res
# compare results
table(res$predict.train$class, y)
You may also be interested in HDclassif (An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data).
Any algorithm that relies on a distance metric in a high-dimensional space will suffer from the curse of dimensionality. In effect, all your observations are going to appear "far" from one another, with relatively little variation in the distance measurement, making the clustering very weak. You'd be much better off selecting informative features, and using only those to construct your distance metric for k-means. The ideal number of features is very much problem dependent, so there's no set guideline on a pre-defined number.
Best Answer
A way to see high dimension, is when there are more regressors/predictors than observations. If $p$ denotes the number of regressors and $n$ the number of observations, high dimension is when $p > n$ and even $p >> n$. If I remember well, penalized regressions (ridge, lasso) have been introduced partly in order to tackle this issue (classical OLS in this setting to do not give a unique solution).
Edit : As asked, some details about what I said. And I apologize about the fact that what I'm talking about is more relevant in a supervised framework. This definition (which can be thought as subjective of course) is the consequence of the following. If you consider a classical linear regression : $Y = X\beta + \epsilon$ then you have the OLS estimator : $\hat\beta = (X'X)^{-1}X'Y$ which is valid only if $(X'X)^{-1}$ exists. Or if $dim(X)=(n,p)$ with $n < p$ then $X'X$ is not full rank, then cannot be inverted and then no more $\hat\beta$ as previously. So switching from $n>p$ to $p > n$ is not trivial. Or multiple linear regression is widely used (especially in econometrics for instance, but also in epidemiology when studying genes...) hence I think it is a convenient way to define "high dimension" (but that's true : it's subjective) because you need to do something different from what you usually do.
For clustering, maybe it can be seen differently, with k-nearest neighbors curse of dimensionality is reached a long time before $p > n$...
Some references :
http://statweb.stanford.edu/~tibs/ElemStatLearn/ The Elements of Statistical Learning where you can find such a definition and a lot of other things
for an econometrics perspective, for instance this presentation : http://www.nber.org/econometrics_minicourse_2013/Chernozhukov.pdf or this paper http://arxiv.org/pdf/1105.2454v4.pdf High dimension problems is a very dynamic field of econometrics at the moment.