Solved – Does the p-value in the incremental F-test determine how many trials I expect to get correct

anovaf-testmulticollinearitypolynomial

I've implemented an incremental F-test program that evaluates the fit of an unrestricted model $M_{UR}$ against the restricted model $M_R$ using the F statistic $\frac{SSE_{R} – SSE_{UR}}{SSE_{UR}}\frac{n-p-1}{j}$. In this instance, I'm interested in comparing a polynomial with order $p$, against another (the restricted model) with order $p-1$. It is worth noting that this necessarily makes $j = 1 $.

In order to validate this program, I create data using randomly generated polynomials, add gaussian noise to it, and see if the incremental F-test identifies the correct polynomial order (i.e. if the data is created from a $3^{rd}$ order polynomial, I would expect to get order $3$). In detail, the framework is as follows:

For i = 1 : $n_trials$:
    1. Randomly choose a polynomial order between 2 and 10
    2. Populate the coefficients of this polynomial with values between -5 and 5
    3. Evaluate this polynomial at abscissa values X = [0,0.01,0.02,...3.00]
    4. Add gaussian noise ~N(0,0.01) to each output of P(X)
    5. For p = 3:10 :
           a. Fit the tuples (X,P(X)) using polynomials of order p-1 and p
           b. Compare the results using the FTest, if it fails, exit. If it passes
              try increasing p = p+1
    6. Return the last polynomial order p-1 that passed the F-Test (at p-value 0.05)

Having done this for $n_{trials} = 3000$, I'm finding that the algorithm incorrectly identifies the order on average $200$ to $300$ times. However, if I've chosen a p-value of $0.05$, shouldn't I only expect to see errors $5\%$ of the time, that is $0.05\cdot3000 = 150$?

I also noticed that, if I change the range of X from $[0, 0.01, … ,3.00]$ to $[0, 0.1, … , 30.0]$, the F-test fails much more frequently, even though the number of data points is the same between the two experiments! Is this an artifact of the multicollinearity problem with polynomials?

Best Answer

There are a lot of issues here. The question specifically is about a difference in performance based on the range of values of $x$. This is easily explained. These tests compare amounts of variation of residuals compared to the fits. A polynomial of degree $d$ and coefficients bounded in absolute value by $k$ (equal to $5$ here) can have a range over the coordinates from $0$ to $u$ at least equal to $k\left(u + u^2 + \cdots + u^d\right)$ = $k u\left(u^{d}-1\right)/\left(u-1\right)$. When you change $u$ from $3$ to $30$ the change in potential ranges is huge. E.g., for $d=10$ the maximum in one case is on the order of $3^{11}$ and in the other case it is $10^{10}$ times as great. At this point, the noise (whose standard deviation is a tiny $0.01$) is inconsequential. Thus, even when the coefficient of $p^{10}$ is incredibly tiny, it will have an important (and therefore detectable) effect on the data.

Here is a plot of ten of your random polynomials (all of order $10$). Note the astronomical scale on the y-axis and observe how the highest term dominates the values.

You ought to consider a different universe of models. For instance, use polynomials of the form

$$p(x) = \sum_{i=0}^d \alpha_i \left(\frac{x}{u}\right)^i$$

defined on the range $[0,u]$. Here is a collection of them, once more with the coefficients varying randomly in $[-5,5]$ and all still of tenth order:

A rigorous test would add noise with standard deviation about the same as the variation in the polynomial values: around $10$ or so.

There are other concerns here: please read the replies by @gung and @jbowman. Consider, too, that you are using a restricted version of forward stepwise regression and do some research on the pros and cons of that approach for model building. Finally, note that in general, unless theory specifically indicates a polynomial model and suggests its order, fitting polynomials to data can be a deceptively poor approach: a tiny bit of overfitting can result in models that are grossly bad because higher degree polynomials can (and often do) vary so wildly in between the data values and will be horrible extrapolators.

Related Solutions

Tukey HSD Test – Does the Tukey HSD Test Correct for Multiple Comparisons?

It is not necessary to correct for multiple comparisons when using Tukey's HSD. The procedure was developed specifically to account for multiple comparison and maintains experiment-wise alpha at the specified level (conventionally .05). Page 210 of Maxwell and Delaney's book on experimental design has explanations and examples of the procedure.

Variance Analysis – How to Get F-Test P-Value

The first thing to notice is that since this is a variance test, you can have F's that are either large or small being significant, whereas often F tables assume you're doing ANOVA type calculations (where only large values of F can cause rejection).

So you need to make use of the fact that the lower tail of $F(\nu_1,\nu_2)$ is the same as the reciprocal of the upper tail of $F(\nu_2,\nu_1)$.

There's a little more discussion of that here

How do I tell which tail I am in? -- The median of an F-distribution in the cases you'll need to worry about for a variance test will be close to 1. So if the F-statistic is less than 1, assume you need the lower tail. If it's bigger than 1, assume you need the upper tail.

In the numerical example in your question, F=0.5 -- you want a lower tail for F.

So to find that, you need to swap the degrees of freedom, and the F-values will all be the inverses of the ones you need. Since you need the area below 0.5, it's the same as finding the area above 1/0.5 = 2 on an $F_{11,9}$.

So you need to worry first about the highest $\alpha$ you can find (0.1 in the indicated tables).

Since the tables you linked have df1 on the columns, you need to find the 11 column and the 9 row in this case.

You don't have an 11, so let's look at 10 and 12:

    ...     10       12
 ⁞
 9        2.41632   2.37888

So how do you deal with the fact that there's no 11?

Well, first, notice that as long as df2 is at least 3 (and it will be for a variance test in an exam), the table of critical values decrease as either d.f. increases

So if were just getting a lower bound om the p-value, look at the next lower d.f. (i.e. compare with df1=10 in this case).

[For more accuracy see this post on interpolation, which discusses interpolation in degrees of freedom for the F toward the end. If your test is looming I doubt you have time to learn anything more than linear interpolation though. That suggests linear interpolation in the reciprocal of the degrees of freedom.]

The value at df1 10, df2= 9 is 2.41632 which is bigger than your 2. So you're nearer to 1 than the 0.1 value.

Which means that your lower-tailed p-value is >0.1

What if the problem was similar to the one in the question but the F was say $0.4$ instead of $0.5$?

1/0.4 = 2.5 which means it's further into the tail than the two 0.10 values above (2.41632, 2.37888). So the lower-tail p<0.10 .

Now compare with the 5% values. We see it's less than both the 12,9 and 10,9 values (which are both just above 3). So the lower tail p>0.05. So $0.05<p<0.10$.

What if the problem was similar to the one in the question but the F was in between the values for 10 and 12?

Now let's say the F ratio was 0.323.

This is between the 0.05 value for 10,9 and 12,9 d.f. - so is p<0.05 or >0.05 ?

Possibility 1: say it's approximately 0.05.

Possibility 2: is to say that it must at least the next smaller (p>0.025)

Possibility 3: use interpolation (but this time in the significance level, not the df), as described at the interpolation link I gave before. That suggests linear interpolation in $\log \alpha$.

Personally, if I were ever possessed to do an F-test of variances in practice*, yet somehow unable to access even a calculator (with which to do a quick numerical integration), I'd choose option 3. If I couldn't do that for some reason, I'd choose option 1. However, the expectations of the person marking it might well be option 2.

* if I'd been taking powerful hallucinogens, or had suffered severe head trauma, or some other incident somehow rendering me no longer able to appreciate what a really bad idea this would likely be.

Two tailed p-values

It appears that it's intended that you just double one tailed p-values to obtain two-tailed ones.

That's fine as far as it goes, so just stick with that, but for a discussion of some of the issues in more detail, see the discussion in the example at the end of the answer here

[May add some more detail later]

Best Answer

Related Solutions

Tukey HSD Test – Does the Tukey HSD Test Correct for Multiple Comparisons?

Variance Analysis – How to Get F-Test P-Value

Related Question