I have two measurements of something. You can think of them both as their own curve with known error on each point. They look quite similar, and if I calculate the R2 value it comes out to >0.9. But what I want to be able to calculate is a P value comparing these two curves (i.e., what is the probability that the difference i'm looking at is just due to noise?). Now I could easily do a student t-test at each point and come up with a P-value at each point, sure. But is there some way to come up with an over-all P value that uses all the points and not just one? Thanks very much for any help.
Solved – Comparison of two curves
p-valueprobability
Related Solutions
You can find a related question here.
Your method is pretty similar to a standard one, namely "Reliability plot".
Say that you choose a target class among two in a binary classification problem (1 and 0). Given $N$ records in the test set indexed by $i$, let $P_i$ be the real probability of record $i$, it could be either $1$ or $0$ for a binary classification problem. Instead $p_i$ is the probability output assigned by the model. You can compute the Mean Square Error or Brier Score as follows $$BS = \frac{1}{N}\sum_{i=1}^N\big(P_i-p_{i}\big)^2$$
This score might be written as the sum of two terms: Calibration and Refinement. The Calibration component captures how well the model represents the true distribution of the data; while Refinement component captures how much the model discriminate between classes.
If your model outputs just $k$ distinct probabilities the data is actually partitioned in $k$ subsets. Let $p_j$ be the probability for the subset $j$, $j=1,\dots,k$. You can compute the probability $r_j$ of the corresponding group $j$ in the test set. These test set records are the ones the model returns $p_j$ as output. If $N_j$ is the total number of records in the test set predicted as $p_j$ you can compute $r_j$ dividing the number of records of class 1 by $N_j$. Then $$C=\frac{1}{N}\sum_{j=1}^{k}N_j(r_j-p_j)^2$$ $$R=\frac{1}{N}\sum_{j=1}^{k}N_jr_j(1-r_j)$$
In case your model outputs too many distinct probabilities I think you can set up $k$ bins and proceed in the same way as above.
Your $P_{bin}$ looks similar to $r_j$ but it is not clear how you defined right-decisions-in-bin
. If your model outputs probability you have to set up a threshold to make a decision and then you can talk about right decisions.
You might be looking for the two sample K-S test.
Matlab Stats toolbox has an implementation, kstest2:
kstest2(x1,x2) returns a test decision for the null hypothesis that the data in vectors x1 and x2 are from the same continuous distribution, using the two-sample Kolmogorov-Smirnov test. The alternative hypothesis is that x1 and x2 are from different continuous distributions.
Best Answer
Define error as the difference between real and observed value.
Suppose your errors at different points are independent and normally distributed (i.e. no systematic error).
If you know the standard deviation of each error, you also know the standard deviation of difference between the value at two curves. Now you have the vector of differences, with known standard deviation of each difference. Divide each difference by its standard deviation, and you have vector of normalized values with standard deviation of 1 each. The null hypothesis is that they are distributed as N(0,1). Test it with any normality test.