Solved – What stats tests can be used to find whether the increase/decrease in counts yearly are difference statistically

rregressionstatistical significance

I have read posts about the similar problems in the forum and they are suggesting using regression to see if there is any significant increases/decrease. I have data for 3 years and their counts i.e. the number of student attending X school. I would like to know if the increase in the admission number is statistically significant. This is what I did:

x = data.frame("year" = c(1,2,3), "count" = c(100,120,150))
reg = lm(count ~., x)
summary(reg)

The results are as follow:

    Call:
lm(formula = count ~ ., data = x)

Residuals:
     1      2      3 
 1.667 -3.333  1.667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   73.333      6.236   11.76   0.0540 .
year          25.000      2.887    8.66   0.0732 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.082 on 1 degrees of freedom
Multiple R-squared:  0.9868,    Adjusted R-squared:  0.9737 
F-statistic:    75 on 1 and 1 DF,  p-value: 0.07319

The results indicate that there is no statistically significant between years. Am I doing this correctly? and how can I compare between year 1 and year 3?

Thank you,

Best Answer

You can use an independent two-sample $t$-test to determine if the count in year three is significantly higher than in years one and two combined.

Break your dataset into two samples: one containing the first two years and the other containing the third year. From sample one you can estimate the mean and standard deviation. From sample two you can only estimate the mean. Therefore we should assume that variability is the same from year to year (just as in your regression model).

For sample one we have $$\bar{x}_1 = \dfrac{1}{2} \left( 100 + 120 \right) = 110 $$ and $$ s_1^2 = (100 - 110)^2 + (120 - 110)^2 = 200. $$ From sample two we have $$ \bar{x}_2 = 150. $$ The pooled standard deviation is just $$ s_p = \sqrt{200}. $$

Our $t$-statistic is then $$ \dfrac{110 - 150}{\sqrt{200} \sqrt{\frac{1}{2} + 1}} = -2.30 \dots$$ The R command 2*pt(-2.31, df = 1) returns a p-value of 0.26 in the two-sided test, so it seems not to be a statistically significant increase.

Edit:

As pointed out by whuber in the comments, the choice of which years to compare must not be influenced by the respective counts in those years. Otherwise the $t$-test is invalid.