Solved – How many cases are needed for an extrapolation

rsamplesample-sizesmall-sample

In social science it is comon to use sample weight to post adjust the sample so that it fits to a given basci population. There are algorithms to calculate how many cases you need to have a representive sample of the population. But often we would like to analyse a bit deeper than just the sample as a whole e.g. male vs. female or age groups. Theoretically – if I'm not wrong – we need for each field a representive number of cases. So if I need 360 cases in the whole sample, I need for the analysis of male vs. female 360 * 2 (one for each field)

If we use weights, the question is on who many cases we the extrapolation is based. But how many cases I need at least to extrapolate my sample? I have the following table with observed cases per field:

                       Male                 ||            Female
                Education                   ||      Education
   age   |  low  | middle| high  |  sum     ||  low  | middle|  high |  sum  |  SUM
---------+-------+-------+-------+----------++-------+-------+-------+-------+------
18 - 24  |   26  |   60  |   23  |  109     ||   33  |   45  |   20  |   98  |  207
25 - 39  |    6  |   78  |   94  |  178     ||    8  |  114  |  147  |  269  |  447
40 - 49  |    1  |  105  |  134  |  240     ||    6  |  124  |  172  |  302  |  542
50 - 59  |    6  |   63  |  117  |  186     ||    8  |  117  |  146  |  271  |  457
60 - 69  |    3  |   48  |  110  |  161     ||   17  |   99  |  122  |  238  |  399
70+      |    7  |   30  |   75  |  112     ||   41  |   89  |   58  |  188  |  300
---------+-------+-------+-------+----------++-------+-------+-------+-------+-------
SUM      |   49  |  384  |  553  |  986     ||  113  |  588  |  665  | 1366  | 2352

If we take a look at the males with a low education in the different agegroups, I have doubt, that we can extrapolate them. But I can't argue that right now. Even if we take just the low educated males at all, I don't think that the amount of cases are enough to extrapolate them.

What I have to do is to analyse injuries in sex, education leavel and age groups. But because of this distribution table I don't know, if the reslaults are representive for the basic population wich is over all about 2 million people.

So here my questions:

How I can figure out if a number of cases is big enough to get extrapolated
Do I really need, as I mentioned above, the amount of samples per field e.g. 360 males in the age of 18-24 with a low education level to analyse this group?
On which categories (e.g. age groups, sex or combined) I can run representive calculations?

Maybe there is also something with which I can do the extrapolation with R and something like a test how good my extrapolation is…

Thanks for y'all help.
Dominik

Best Answer

There is no absolute threshold for this. Neither is there a test. It really depends on you, on how far you want to go, and on how much confidence you have in your data.

You could have a look at the precision of the estimates by computing confidence intervals or coefficients of variation. This could give you some guidance, or could help you to make an assessment.

As far as R is concerned, you might have a look at the Official Statistics task view. It contains several packages to handle survey data.

You might also be interested in what is called "small area estimation". These are techniques to estimate parameters from small (sub)populations. Small area estimation is covered in Sharon Lohr's textbook. A more complete reference on the topic is Rao (2003).

Follow-up

Regarding the threshold question, I would like to quote from Rao's 2003 book mentioned above:

A domain (area) is regarded as large if the domain specific sample is large enough to support direct estimates of adequate precision.

If you have a look at texts on small area estimation, they will hardly get more precise on the size issue. This isn't really a surprise. If you have a rather homogenous (sub)population, you can get away with rather small samples. In the extreme case, where a (sub)population is perfectly homogenous, a sample size of 1 will be enough to obtain perfectly representative estimates. If the (sub)populations become more heterogeneous, you will need larger sample sizes, ceteris paribus. That's why I said in the beginning that the answer depends on the situation at hand and that coefficients of variation and confidence intervals can provide some guidance.

Related Solutions

Sample Size for Correlation Analysis – Determine Optimal Sizes for Overall and Sub-Group Analyses

When it comes to sample size, bigger is better, but we often have to take what we get. With the smaller sample sizes, your estimates of the correlation are going to become extremely noisy, and comparisons between different estimates (which I expect is your primary goal in the subsets analyses) are going to be particularly noisy.

This online tutorial on standard errors (as a pdf) contains formulas for the SE of the correlation coefficient (as well as of the Fisher transformation of the correlation, which is a better scale to be measuring the SE). You'll see that the scales by approximately $1/\sqrt{n}$.

For a correlation of about 0.5, the SE with a sample size of 200 will be about 0.06; with a sample size of 50 it will be about double that.

Solved – How to be sure the sample size is large enough for conditional logistic regression

Though I am generally familiar with the technique and its utility in performing analyses with stratified and matched samples, I have not used conditional logistic regression models in my own work. Therefore, I can give you an example of a simulation-based power analysis approach in R that demonstrates how you can build in some uncertainty about parameter estimates and features of the data into your power analysis. Hopefully, this approach provides enough of a framework that you can tweak it for your needs.

To simplify, let's say that I am interested in examining the relation between a single binomial risk variable, $x_1$ and the probability that $y$=1. And, I would like to control for scores on the continuous variable $x_2$ when testing my hypothesis about the relation between $x_1$ and $y$.

So essentially my logistic regression model of interest looks something like the following (there is more than one way to represent this model of course):

$$P(y_i=1)=\frac{1}{1+e^{-(\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+r_i)}}$$

I am most interested in my power to detect a significant effect for my estimate of $\beta_1$ as a function of sample size:

sample.N<-seq(20, 250, by=10)

However, I am a bit uncertain about a number of features that could influence my statistical power.

Oftentimes researchers ignore various sources of uncertainty when performing a power analysis. I happen to think we can do better, though it does take more thought. To start, we can probably come up with some reasonable estimates for something like a base prevalence rate for our risk group ($x_1$) variable.

I am going to begin by assuming a relatively low risk factor rate of 35% (i.e., $P(x_1=1)=.35$). But I am not 100% confident of that estimate, so I build some uncertainty into my power analysis. Specifically, I am going to use a $Beta(35, 65)$ distribution to draw estimated values for $P(x_1=1)$.

prob.x1<-rbeta(1, 35, 65)
x1<-rbinom(n=sample.N[n], size = 1, prob = prob.x1)

I am also going to assume that there is some association between $x_1$ group membership and $x_2$ scores. To build this feature into my data I use a standardized difference score of .3 (which represents a small effect). I am, however, a little uncertain as to how good of an estimate that .3 is. So, I use a beta distribution to allow my analysis to consider $d$ values that are smaller than .3 (but still positive) and slightly larger (ranging up toward a moderate effect).

d.x1.x2<-rbeta(1, 15, 35)
x2<-rep(NA, sample.N[n])
x2[x1==1]<-rnorm(length(x2[x1==1]), mean = d.x1.x2) 
x2[x1==0]<-rnorm(length(x2[x1==0]), mean = 0)

I may have some general ideas about the coefficients I might expect to see in my model above. To start, $\hat{\beta_0}$ is the predicted log of the odds that $y=1$ when $x_1=0$ and $x_2=0$. It is difficult to think in terms of log odds, so instead I first specify an estimated value for $P(y=1|x_1=0, x_2=0)$, which again I am not 100% certain about. I use a $Beta(25, 75)$ distribution for which $E[Beta(25,75)]=.25$, which means that I generally believe that the probability that $y=1$ is .25 when $x_1=0$ and $x_2$ = 0.

p.y<-rbeta(1, 25, 75)
b0<-log(p.y/(1-p.y))#I then convert that value into logits

As for $\hat{\beta_1}$ and $\hat{\beta_2}$, I will posit conditional odds ratios of 2.5 and 1.75 respectively, meaning that in my approach to considering uncertainty I will draw from normal distributions of $\hat{\beta_1} \backsim N(ln(2.5), .5)$ and $\hat{\beta_2} \backsim N(ln(1.75), .5)$.

b1<-rnorm(n=1, mean=log(2.5), sd=.5)
b2<-rnorm(n=1, mean=log(1.75), sd=.5)

And finally, I should consider that my model is unlikely to fit the data perfectly. I know that my model residuals should have a mean of 0 and some variance, I am just uncertain about how much variance. So again, I build that uncertainty into the model by drawing standard deviations for my simulated data's residuals from a gamma distribution.

sd.e<-sqrt(rgamma(1, shape = 12, scale = 1/12)) 
e<-rnorm(n=sample.N[n], mean = 0, sd=sd.e)

Now I need to put this all together to create a vector $y$.

y<-ifelse(1/(1+exp(-(b0+b1*x1+b2*x2+e)))>=.50,1,0)
fit<-glm(y~x1+x2, family = 'binomial')

And I can save the $p$ value for my estimate of $\beta_1$ from the model.

p.val<-coef(summary(fit))[2,4]

Putting it all together in a single place with some annotation we get (note it will take a few minutes to run):

library(testit)
sample.N<-seq(20, 250, by=10)
n.sims<-10000
DF<-data.frame()
for(n in 1:length(sample.N)){
  power.vec<-vector()
  for(s in 1:n.sims){
    #Allow for uncertainty around my population estimate of P(x=1)
    #The mean of a beta(35,65) distribution is .35 (i.e., 35/(35+65)
    prob.x1<-rbeta(1, 35, 65)

    #------------------------------------------------------------------------------

    #Draw a random sample of size sample.N[n] for x1 based on population probability
    x1<-rbinom(n=sample.N[n], size = 1, prob = prob.x1)

    #------------------------------------------------------------------------------

    #Specify standardized difference on x2 as a function of x1
    #I assume a moderate effect size of d =.3, but assume that I may be wrong about it 
    #Again I am uncertain about this value so I give it a distribution:
    d.x1.x2<-rbeta(1, 15, 35)

    #------------------------------------------------------------------------------

    #And to use that difference to draw from two populations that have a standardized difference of .3:
    x2<-rep(NA, sample.N[n])
    x2[x1==1]<-rnorm(length(x2[x1==1]), mean = d.x1.x2) 
    x2[x1==0]<-rnorm(length(x2[x1==0]), mean = 0)

    #------------------------------------------------------------------------------
    #Now I specify estimates for b0, b1, and b2
    #b0 is a little tricky and it represents the probability in the model y=1 when x1=0 and x2=0
    #I start by specifying P(y=1)=.25 with some uncertainty
    p.y<-rbeta(1, 25, 75)
    b0<-log(p.y/(1-p.y))#I then convert that value into logits

    #for b1 and b2 I assume a normal distribution for each estimate
    #further I assume a larger coefficient for 
    b1<-rnorm(n=1, mean=log(2.5), sd=.5)
    b2<-rnorm(n=1, mean=log(1.75), sd=.5)

    #------------------------------------------------------------------------------

    #Finally I need to consider that my model will have some error here
    #I know that I am going to have some amount of error... but again not sure a specific value
    #So I choose to specify a standard deviation of my residuals using a gamma distribution
    sd.e<-sqrt(rgamma(1, shape = 12, scale = 1/12))
    #And now I can generate a residual for each case:
    e<-rnorm(n=sample.N[n], mean = 0, sd=sd.e)
    y<-ifelse(1/(1+exp(-(b0+b1*x1+b2*x2+e)))>=.50,1,0)
    fit<-glm(y~x1+x2, family = 'binomial')
    if(as.numeric(has_warning(glm(y~x1+x2, family = 'binomial')))==1){
      power.vec[s]<-NA
    }
    #get the p-value for b1
    else{
      p.val<-coef(summary(fit))[2,4]
      power.vec[s]<-ifelse(p.val<.05, 1, 0)
    }
  }
  temp.vec<-c(sample.N[n], mean(power.vec, na.rm = T), length(na.omit(power.vec)))
  DF<-rbind(DF, temp.vec)
}
colnames(DF)<-c('N', 'Power', 'Sim_success')
DF

Which returns:

     N     Power Sim_success
1   20 0.1121205        9026
2   30 0.2880463        9679
3   40 0.4113418        9875
4   50 0.5090398        9956
5   60 0.5645420        9978
6   70 0.5986381        9986
7   80 0.6480480        9990
8   90 0.6685011        9994
9  100 0.6968697        9999
10 110 0.7133280        9994
11 120 0.7343734        9999
12 130 0.7521000       10000
13 140 0.7569000       10000
14 150 0.7685000       10000
15 160 0.7843569        9998
16 170 0.7768777        9999
17 180 0.8012000       10000
18 190 0.8066000       10000
19 200 0.8130000       10000
20 210 0.8247000       10000
21 220 0.8196820        9999
22 230 0.8277000       10000
23 240 0.8361000       10000
24 250 0.8484848        9999

According to this analysis it would appear that a sample of around 180 would yield a power of .8 to detect a significant effect for my estimate of $\beta_1$.

I can also plot this power curve using the code below:

library(ggplot2)
g1<-ggplot(aes(x=N, y=Power), data = DF)
g2<-g1+stat_smooth(se=F, method = 'loess')+geom_hline(yintercept = .8, lty='dashed', color='red')
g3<-g2+xlab('Sample Size')+ylab('Power')
g4<-g3+ggtitle('Power Curve')
g4

which results in:

Now I understand that this analysis is not 100% in line with your problem, but I believe the simulated data can be tweaked to match your analysis and data. Note that I have made a lot of non-trivial decisions in specifying just how uncertain I am about certain features of the data and estimates for the model. These decisions should be based on your knowledge of the subject matter and existing literature, whereas my choices were relatively arbitrary.

Best Answer

Related Solutions

Sample Size for Correlation Analysis – Determine Optimal Sizes for Overall and Sub-Group Analyses

Solved – How to be sure the sample size is large enough for conditional logistic regression

Related Question