Without the numbers I can just give the general advice, graph it. You have 2 x 3 x 2 x 2 x 2 so that's not that hard. Take one of your 2 level variables and calculate it's effects across everything else. That's what you'll plot. You can now plot 3x2 or 2x3 panel graphs. You'd have 4 panels with the remaining two factors crossing them, one divided horizontally, and the other vertically.
Stare at that for awhile.
Maybe several permutations. Change what's on the x-axis, the panels, line types, and effects through as many iterations as you need until you come to an understanding. If you've got some theoretical reasons for directions of causality take the highest level and make it panels and the lowest and make it your line types or x-axis.
Hopefully you'll come up with something. This could take a lot of work and a long time to see the nature of the relationship. But going through several versions of the graphs can often allow you to see what's going on.
(I recently had to do the same thing for 10 x 3 x 2 x 2 x 2. It took months looking at graphs off and on and letting them incubate before the nature of the 5 way came through.)
UPDATE:I just noticed your question on splitting by proficiency. What you described is an absolute no-no. An interaction in one proficiency group while none in another does not tell you that there are differences between those groups. An interaction with proficiency would. So leaving proficiency out of your interactions gives you much less information. Your interactions mean something and the meaning is in the data, not in more tests.
As I ask most of my students, if you're going to test everything afterwards what was the point of doing an ANOVA in the first place?
Assuming equal $n$s [but see note 2 below] for each treatment in a one-way layout, and that the pooled SD from all the groups is used in the $t$ tests (as is done in usual post hoc comparisons), the maximum possible $p$ value for a $t$ test is $2\Phi(-\sqrt{2}) \approx .1573$ (here, $\Phi$ denotes the $N(0,1)$ cdf). Thus, no $p_t$ can be as high as $0.5$. Interestingly (and rather bizarrely), the $.1573$ bound holds not just for $p_F=.05$, but for any significance level we require for $F$.
The justification is as follows: For a given range of sample means, $\max_{i,j}|\bar y_i - \bar y_j| = 2a$, the largest possible $F$ statistic is achieved when half the $\bar y_i$ are at one extreme and the other half are at the other. This represents the case where $F$ looks the most significant given that two means differ by at most $2a$.
So, without loss of generality, suppose that $\bar y_.=0$ so that $\bar y_i=\pm a$ in this boundary case. And again, without loss of generality, suppose that $MS_E=1$, as we can always rescale the data to this value. Now consider $k$ means (where $k$ is even for simplicity [but see note 1 below]), we have $F=\frac{\sum n\bar y^2/(k-1)}{MS_E}= \frac{kna^2}{k-1}$. Setting $p_F=\alpha$ so that $F=F_\alpha=F_{\alpha,k-1,k(n-1)}$, we obtain $a =\sqrt{\frac{(k-1)F_\alpha}{kn}}$. When all the $\bar y_i$ are $\pm a$ (and still $MS_E=1$), each nonzero $t$ statistic is thus $t=\frac{2a}{1\sqrt{2/n}} = \sqrt{\frac{2(k-1)F_\alpha}{k}}$. This is the smallest maximum $t$ value possible when $F=F_\alpha$.
So you can just try different cases of $k$ and $n$, compute $t$, and its associated $p_t$. But notice that for given $k$, $F_\alpha$ is decreasing in $n$ [but see note 3 below]; moreover, as $n\rightarrow\infty$, $(k-1)F_{\alpha,k-1,k(n-1)} \rightarrow \chi^2_{\alpha,k-1}$; so $t \ge t_{min} =\sqrt{2\chi^2_{\alpha,k-1}/k}$. Note that $\chi^2/k=\frac{k-1}k \chi^2/(k-1)$ has mean $\frac{k-1}k$ and SD$\frac{k-1}k\cdot\sqrt{\frac2{k-1}}$. So $\lim_{k\rightarrow\infty}t_{min} = \sqrt{2}$, regardless of $\alpha$, and the result I stated in the first paragraph above is obtained from asymptotic normality.
It takes a long time to reach that limit, though. Here are the results (computed using R
) for various values of $k$, using $\alpha=.05$:
k t_min max p_t [ Really I mean min(max|t|) and max(min p_t)) ]
2 1.960 .0500
4 1.977 .0481 <-- note < .05 !
10 1.840 .0658
100 1.570 .1164
1000 1.465 .1428
10000 1.431 .1526
A few loose ends...
- When k is odd: The maximum $F$ statistic still occurs when the $\bar y_i$ are all $\pm a$; however, we will have one more at one end of the range than the other, making the mean $\pm a/k$, and you can show that the factor $k$ in the $F$ statistic is replaced by $k-\frac 1k$. This also replaces the denominator of $t$, making it slightly larger and hence decreasing $p_t$.
- Unequal $n$s: The maximum $F$ is still achieved with the $\bar y_i = \pm a$, with the signs arranged to balance the sample sizes as nearly equally as possible. Then the $F$ statistic for the same total sample size $N = \sum n_i$ will be the same or smaller than it is for balanced data. Moreover, the maximum $t$ statistic will be larger because it will be the one with the largest $n_i$. So we can't obtain larger $p_t$ values by looking at unbalanced cases.
- A slight correction: I was so focused on trying to find the minimum $t$ that I overlooked the fact that we are trying to maximize $p_t$, and it is less obvious that a larger $t$ with fewer df won't be less significant than a smaller one with more df. However, I verified that this is the case by computing the values for $n=2,3,4,\ldots$ until the df are high enough to make little difference. For the case $\alpha=.05, k\ge 3$ I did not see any cases where the $p_t$ values did not increase with $n$. Note that the $df=k(n-1)$ so the possible df are $k,2k,3k,\ldots$ which get large fast when $k$ is large. So I'm still on safe ground with the claim above. I also tested $\alpha=.25$, and the only case I observed where the $.1573$ threshold was exceeded was $k=3,n=2$.
Best Answer
First, multicolinearity indicates that there is a linear relationship among your independent variables. Correlation is neither a necessary nor a sufficient condition for collinearity (although, with only 3 IVs, it is very hard to have one without the other - with more IVs, it is entirely possible).
Second, if you are deciding between ridge and lasso, I would go with ridge regression here. See this thread for some notes on ridge regression with categorical variables. Ridge regression produces biased parameter estimates in order to reduce the variance of the estimates. It won't (usually) remove variables entirely. Lasso removes some variables from the equation and that probably isn't what you want here, especially if the interaction is important.
Third, I think partial least squares is a better solution to collinearity than principal components, because PLS also considers the relationship with the dependent variable. However, with only three independent variables, you are likely to get a single component and I think it is unlikely that that will give you a useful result. Also, see this thread for some notes on PLS with categorical variables.
Finally, have you considered regression trees and their offshoots such as random forests?