I'm doing my assignment for my "Modeling and Optimization" course, and now I have doubts on the first question:
What is the dimensionality of the data? What are the min, median, max,
mean, standard deviation and percentage missing data of each feature?
I can calculate those, but I'm not sure about the "dimensionality" of the data. Here's a sample of my dataset:
Sample mcg gvh alm mit erl pox vac nuc Class1 Class2
1 0.58 0.61 0.47 0.13 0.5 0 0.48 0.22 MIT non-CYT
2 0.43 0.67 0.48 0.27 0.5 0 0.53 0.22 MIT non-CYT
3 0.64 0.62 0.49 0.15 0.5 0 0.53 0.22 MIT non-CYT
4 0.58 0.44 0.57 0.13 0.5 0 0.54 0.22 NUC non-CYT
5 0.42 0.44 0.48 0.54 0.5 0 0.48 0.22 MIT non-CYT
6 0.51 0.4 0.56 0.17 0.5 0.5 0.49 NA CYT CYT
I've been told that dimensionality is usually referred to attributes or columns of the dataset. But in this case, does it include Class1 and Class2? and does dimensionality mean, the number of columns or, does it mean the names of columns?
Best Answer
Your assumption is correct, and you are also noticing subtleties. In a perfect world, the number of columns is the number of dimensions of a data set. However, some columns are similar, some are correlated, some are duplicates in some way, some are junk, some are useless, etc. so the actual number of dimensions can be unknown. Its a knotty problem. In your case I would go with your first assumption.