Solved – How many cases are needed for an extrapolation

rsamplesample-sizesmall-sample

In social science it is comon to use sample weight to post adjust the sample so that it fits to a given basci population. There are algorithms to calculate how many cases you need to have a representive sample of the population. But often we would like to analyse a bit deeper than just the sample as a whole e.g. male vs. female or age groups. Theoretically – if I'm not wrong – we need for each field a representive number of cases. So if I need 360 cases in the whole sample, I need for the analysis of male vs. female 360 * 2 (one for each field)

If we use weights, the question is on who many cases we the extrapolation is based. But how many cases I need at least to extrapolate my sample? I have the following table with observed cases per field:

                       Male                 ||            Female
                Education                   ||      Education
   age   |  low  | middle| high  |  sum     ||  low  | middle|  high |  sum  |  SUM
---------+-------+-------+-------+----------++-------+-------+-------+-------+------
18 - 24  |   26  |   60  |   23  |  109     ||   33  |   45  |   20  |   98  |  207
25 - 39  |    6  |   78  |   94  |  178     ||    8  |  114  |  147  |  269  |  447
40 - 49  |    1  |  105  |  134  |  240     ||    6  |  124  |  172  |  302  |  542
50 - 59  |    6  |   63  |  117  |  186     ||    8  |  117  |  146  |  271  |  457
60 - 69  |    3  |   48  |  110  |  161     ||   17  |   99  |  122  |  238  |  399
70+      |    7  |   30  |   75  |  112     ||   41  |   89  |   58  |  188  |  300
---------+-------+-------+-------+----------++-------+-------+-------+-------+-------
SUM      |   49  |  384  |  553  |  986     ||  113  |  588  |  665  | 1366  | 2352

If we take a look at the males with a low education in the different agegroups, I have doubt, that we can extrapolate them. But I can't argue that right now. Even if we take just the low educated males at all, I don't think that the amount of cases are enough to extrapolate them.

What I have to do is to analyse injuries in sex, education leavel and age groups. But because of this distribution table I don't know, if the reslaults are representive for the basic population wich is over all about 2 million people.

So here my questions:

  • How I can figure out if a number of cases is big enough to get extrapolated
  • Do I really need, as I mentioned above, the amount of samples per field e.g. 360 males in the age of 18-24 with a low education level to analyse this group?
  • On which categories (e.g. age groups, sex or combined) I can run representive calculations?

Maybe there is also something with which I can do the extrapolation with R and something like a test how good my extrapolation is…

Thanks for y'all help.
Dominik

Best Answer

There is no absolute threshold for this. Neither is there a test. It really depends on you, on how far you want to go, and on how much confidence you have in your data.

You could have a look at the precision of the estimates by computing confidence intervals or coefficients of variation. This could give you some guidance, or could help you to make an assessment.

As far as R is concerned, you might have a look at the Official Statistics task view. It contains several packages to handle survey data.

You might also be interested in what is called "small area estimation". These are techniques to estimate parameters from small (sub)populations. Small area estimation is covered in Sharon Lohr's textbook. A more complete reference on the topic is Rao (2003).

Follow-up

Regarding the threshold question, I would like to quote from Rao's 2003 book mentioned above:

A domain (area) is regarded as large if the domain specific sample is large enough to support direct estimates of adequate precision.

If you have a look at texts on small area estimation, they will hardly get more precise on the size issue. This isn't really a surprise. If you have a rather homogenous (sub)population, you can get away with rather small samples. In the extreme case, where a (sub)population is perfectly homogenous, a sample size of 1 will be enough to obtain perfectly representative estimates. If the (sub)populations become more heterogeneous, you will need larger sample sizes, ceteris paribus. That's why I said in the beginning that the answer depends on the situation at hand and that coefficients of variation and confidence intervals can provide some guidance.

Related Question