Solved – Does it make sense for p-value to decrease with more data points

multiple regressionp-value

I'm running a multiple linear regression on a set of sports data. When I run the regression on one season, which has 380 data points and which I thought was a fair amount, I get quite a high p-value on one of my independent variables. However, when I run the regression on all my data points (I have more than 3000 data points in total), the p-value decreases from .97 to .02. As I add more data points, the p-value decreases further. My question is: is my variable really significant or am I just decreasing the p-value by adding more data points?

Best Answer

Let's say that your independent variable is $x_i$ and its regression coefficient is $\beta_i$. The p-value for $\beta_i$ is $P(t<| t^* |)+P(t>|t^*|)$ where $t^*=\frac{\beta_i}{\sqrt{(X'X)^{-1}_{ii}\frac{RSS}{n-q}}}$. $RSS$ is the residual sum of squares.

The p-value is large when $|t^*|$ is small, small when $|t^*|$ is large. But when $n$ grows, $RSS/(n-q)$ get smaller and $|t^*|$ larger, so the p-value decreases just because $n$ grows.

This is why "in large samples is more appropriate to choose a size of 1% or less rather than the 'traditional' 5%." (M. Verbeek, A Guide to Modern Econometrics, 3rd edition, §2.5.7, p. 32). If you choose 1%, your coefficient is not statitically significant when $p=0.02$.

Related Question