Solved – small sample size, large number of variables (most categorical) – how to proceed

I would be grateful for general guidance/advice about data analysis with some data that is problematic for me because of the small sample size, and the large number of categorical data. I realize this question is a bit vague, but that's because I'm not sure what to do. I'd like to come up with some good descripive characterization of the data, and see if ther's any way to make any meaningful inferences.

Sample size = 13, number of variables 93 (81 categorical, only 12 numerical). In the past I've worked with mostly numeric data with large sample sizes so I'm not sure how to proceed.

Given the small sample, I don't feel I can make any assumptions about normality.

Other than generating basic descriptive statistics (mean, std dev for the numeric data, and tables for the categorical data) what else can I do to meaningfully summarize this data?

Is it possible to generate some reliable confidence intervals for the means via some nonparametric tests with such limited data?

In addition to descriptive statistics, I am also wondering about simple linear regression. From what I have read, automatic selection of relevant variables, say via stepwise regression, is questionable to start with, and especially with such a mall sample won't be reliable. Also, I'm not so much interested in prediction, as to exploration of the relationships between a numeric response variable and the rest of the data.

I also worry about detecting collinearity.

With numeric data I could generate a correlation matrix, not sure this makes sense with this small sample, nor am I certain how to do the equivalent for the large set of categorical data. So not sure if there is a any automatic or semi-automatic way for stats to guide me to the relevant variables as a start and then take it from there. I.e., other than manually considering various combinations of independent variables, is there another way?

I'm using R.

Best Answer

I guess you have some working hypothesis. I would work like in the "good old time" when you had no cheap computing power and powerful stats programs at hand: regroup categorical variables that fit together on substantial grounds linked to your working hypothesis, not because they correlate.

In any case with n=13 you can run an exploratory analysis, but not something more complicated.

Once you have somewhat reduced the number of variables and gotten more meaningful "factors" you can work as in qualitative data analysis: take a spreadsheet and you sort the cases in the rows on some major, substantial issue and put your regrouped categorical variables in the columns and sort them, too, on the basis of your knowledge of the research field. Do you see a pattern ? Maybe you have to eliminate further variables. Are your conceptions now confirmed ?

If you work in political science (for instance) the cases might come from two different sub-samples: from countries that are democracies and others that are run by dictators. So, you will do a sub-sorting for democracies and another one for dictatorships.

You can find more of these ideas in Miles & Huberman, 1994, Qalitative Data Analysis. Sage .

A statistical program that might help you because it is very visual and you have so few cases is the free ViSta from Forest Young. See http://www.uv.es/visualstats/Book/

I am also somewhat puzzeld about the people that demanded you to work with this data set. Do they know their stats ?

If you explain the type of data set you work with maybe you get better answers.

I wish you good luck, and data with an easy structure and no missings!

Best Answer

Related Solutions

Solved – Validity of normality assumption in the case of multiple independent data sets with small sample size

Solved – How to tell if I the sample size is large enough for reliable feature selection using LASSO regression

Related Question