What makes a test statistic "extreme" depends on your alternative, which imposes an ordering (or at least a partial order) on the sample space - you seek to reject those cases most consistent (in the sense being measured by a test statistic) with the alternative.
When you don't really have an alternative to give you a something to be most consistent with, you're essentially left with the likelihood to give the ordering, most often seen in Fisher's exact test. There, the probability of the outcomes (the 2x2 tables) under the null orders the test statistic (so that 'extreme' is 'low probability').
If you were in a situation where the far left (or far right, or both) of your bimodal null distribution was associated with the kind of alternative you were interested in, you wouldn't seek to reject a test statistic of 60. But if you're in a situation where you don't have an alternative like that, then 60 is unsual - it has low likelihood; a value of 60 is inconsistent with your model and would lead you to reject.
[This would be seen by some as one central difference between Fisherian and Neyman-Pearson hypothesis testing. By introducing an explicit alternative, and a ratio of likelihoods, a low likelihood under the null won't necessarily cause you to reject in a Neyman-Pearson framework (as long as it performs relatively well compared too the alternative), while for Fisher, you don't really have an alternative, and the likelihood under the null is the thing you're interested in.]
I'm not suggesting either approach is right or wrong here - you go ahead and work out for yourself what kind of alternatives you seek power against, whether it's a specific one, or just anything that's unlikely enough under the null. Once you know what you want, the rest (including what 'at least as extreme' means) pretty much follows from that.
Your understanding is mostly correct. Let $X$ be a random variable that follows the same distribution as your test statistic under the null hypothesis. The p value is the probability that a randomly drawn $X$ is at least as large as the test statistic you computed. If that probability is very low, then that is good reason to believe that the null hypothesis does not hold.
You just need to be careful about the difference in terminology between p value and significance level. A significance level is a pre-specified cutoff p value, below which you reject the null hypothesis and above which you do not have enough evidence to reject the null hypothesis. The p value itself is just a probability-valued function of the test statistic that gets smaller as the test statistic gets more extreme (i.e. the CDF of the distribution of the test statistic under the null).
So the significance level does not determine the probability of rejecting the null hypothesis. The significance level determines the largest probability of rejecting the null that you would consider evidence enough to reject the null. When you set a significance level, you are setting an upper bound, below which you find the probability of observing the null too extreme to believe it was randomly drawn from the null distribution.
You might have been confused by someone talking about type 1 error rates and such. All that stuff means is that, if you run the experiment many times, if the null hypothesis is true ever time, and you set your significance level to $\alpha$, you will reject the null hypothesis $\alpha \times 100$% of the time purely due to random chance. Understanding this can help you set reasonable $\alpha$ levels if you do plan to do null hypothesis testing.
Best Answer
The significance level ($\alpha$) is the rate at which you make Type I errors when the null hypothesis is true (or, for composite hypotheses, the maximum rate under the null).
You choose that rate.
Then any test statistic that's more extreme than the one that cuts off $\alpha$ in the tail (i.e. a test statistics more in keeping with the alternative) will cut off a smaller area. That area is the p-value. So when the p-value is small, it means your sample yields a test statistic inside the rejection region.
(the picture is similar for two-tailed tests, but then yellow and green areas occur in both tails)
In order that you actually get that rate of rejection when the null is true, you need to reject that proportion of more extreme cases under the null -- so if your test statistic cuts off a smaller area (green) than the significance level, it's in the region of sample arrangements (in this case, those with unusually large means) that will lead you to reject.