Hypothesis Testing – How to Compare Means with ‘Actual’ Size Instead of ‘Sample’ Size?

gameshypothesis testingmeanmodelingstatistical significance

From chess se: What is white's increased advantage in chess90 as compared to chess870? (Chess960 can be split into 2 subsets, chess90 and chess870)


Intro:

It is known in standard chess that white has an advantage over black. This holds true for the variant chess960 (…sort of), where there pieces behind the pawns are shuffled subject to certain restrictions resulting in 960 possible starting positions.

For the 960 possible starting positions (SPs), I notice that in some cases you have to give up castling rights on 1 side in order to gain castling rights on the other side. There are 90 such SPs. Consider splitting the 960 SPs into 2 collections of 870 and 90. Call these collections, respectively, chess870 and chess90. I conjectured that chess90, on average, gives more of an 'advantage' to white as compared to chess870. Now, what is meant by advantage?

1 way that advantage is measured is the evaluation an engine gives of the starting position. The evaluation rules (of any position, not just the starting position) are: 0 is drawn, and positive/negative means, resp, white/black is favoured.

In chess960, one such set of evaluations is called the Sesse evaluations, the evaluations by an engine (I think some old version stockfish) of each chess960 position up to 'depth' of around 39. (More here.) For example, the standard SP (RNBQKBNR) is evaluated at 0.22.


And now:

I confirmed my conjecture in that the average Sesse eval for chess870 is 0.1790 while the average for chess90 is 0.1913. The percentage change from chess870 to chess90 is

$(0.1913-0.1790)/0.1790 = 0.06871508379 \approx 6.87 \%$, almost $7\%$

However, it's been pointed out to me that this is not necessarily 'significant' like say the percentage change from 0.01 to 0.03 is 200%, but this is hardly significant…in the sense that evaluations from 0.00 to 0.05 hardly show any advantage for white.

By Mobeus Zoom:

This percentage doesn't tell the whole picture or even close: there's a 200% difference between +0.01 and +0.03 but they're hardly distinguishable as chess engine evaluations. A better way would be to hypothesis test. (That said, it's an interesting question you ask and have started analysing.)


Questions:

Question 1: While I kinda have a feeling the measure of 'significance' here is gonna depend largely on a qualitative thing like when we consider an evaluation to mean something advantageous for white…

Is there a way to test for (quantitative) statistical significance here, such as with hypothesis testing?

Note: I'm not (necessarily) asking what could be mathematical or statistical ways in general to measure significance. So I don't think this should be closed as off-topic necessarily. I'm asking (but not necessarily asking only) if there exists a way to use statistics here.

Question 2: About hypothesis testing, it's been like half a decade since I've done statistics, but how exactly is hypothesis testing relevant here?

Maybe $H_0: \mu_1 = \mu_2; H_1: \mu_1 < \mu_2$ or something, but from what I understand hypothesis testing is like we get sample means $\hat \mu_1$ and $\hat \mu_2$ and then we compare these sample means to make guesses about the true means $\mu_1 = E[X_1]$ and $\mu_2 = E[X_2]$, each of which come from underlying variables, resp, $X_1$ and $X_2$. But what exactly are the $X_1$ and $X_2$ ? I don't see any randomness here unless we do like…

I guess:

$X_1$ is pick with uniform distribution a random evaluation from chess870 and $X_2$ is for chess90. Then the sample means are from picking some appropriate sample sizes from each of the populations chess870 and chess90.

  • However:

  • A. I recall in hypothesis testing or well statistics in general we don't usually know the true means, yet we have them right here as $\mu_1=0.1790$ and $\mu_2=0.1913$. What I understand of hypothesis testing is we don't know the true means and so we're making due with some limited data. Sure we can get some random sample of say 5% points each (4.5 round to 5 points for chess90 and 43.5 round to 44 points for chess870) and then do hypothesis testing for those points, but then why don't we just increase this to 100% points each? And so if we get a random sample of 5% points each then, what, we test for significance for, say, 10 random samples and see how many times we get a significant p-value?

  • B. I don't see how exactly any of the aforementioned is testing the significance of the 6.87%. Or is the point to not test the percentage change of the true means but really to test for significance between whatever the sample means turn out to be?

Best Answer

1. Don't confuse statistical significance and practical significance

Your result is statistically significant if the probability of finding a difference as large as the one you have (or larger), under the null hypothesis that the advantage to white is the same under chess870 and chess90, is less than some preset criterion, conventionally $p \lt .05$.

It is practically significant if the difference you did find, +0.191 versus +0.1790, is large enough for anyone to care about. That isn't a statistical question.

2. With all the data, you don't need to care about statistical significance

You're in an unusual situation because you can evaluate the advantage for literally every possible starting position. If you're totally happy with your definition of advantage (that is, the number generated by this one evaluation engine), then your results are a statement of fact: chess90 starting positions on average yield a greater white advantage than chess870 positions.

Things are more complicated if you want to generalise, for instance to other evaluation engines. If you can only obtain results from some evaluation engines, statistical significance once again comes into play, as you're trying to draw conclusions about evaluation engines in general based on information from a limited sample.

Related Question