The concept you're calling 'exceptionality' is simply a combined variable (via a weighted average) from two or more variables standardized to a Z-score. If there were a way of observing 'exceptionality' as sampled data, you could potentially fit a (standardized) multiple regression with your variables to find the best weights to use.
Let's consider two random variables $A$ and $B$, which are standardized to $Z_A$ and $Z_B$ respectively (meaning each follows a standard normal distribution, i.e. mean of 0 and variance of 1).
The weighted average of $Z_A$ and $Z_B$, where $w_A$ and $w_B$ are the respective weights for $Z_A$ and $Z_B$, is then:
$$
W = \frac{w_A}{w_A+w_B} \cdot Z_A + \frac{w_B}{w_A+w_B} \cdot Z_B
$$
Note that $w_A$ and $w_B$ are constants, whereas $Z_A$ and $Z_B$ are random variables.
Therefore, the expected value of $W$ is as follows:
$$
\text{E}(W) = \frac{w_A}{w_A+w_B} \cdot \text{E}(Z_A) + \frac{w_B}{w_A+w_B} \cdot \text{E}(Z_B) = 0
$$
The variance of $W$, assuming the independence of $Z_A$ and $Z_B$, is:
$$
\text{Var}(W) = \left(\frac{w_A}{w_A+w_B}\right)^2 \cdot \text{Var}(Z_A) + \left(\frac{w_B}{w_A+w_B}\right)^2 \cdot \text{Var}(Z_B) \\ = \left(\frac{w_A}{w_A+w_B}\right)^2 + \left(\frac{w_B}{w_A+w_B}\right)^2
$$
The variance of $W$, depending on the discrepancy between weights $w_A$ and $w_B$, must fall inside the interval $[.5,1)$. Although the mean is 0, because the variance is not 1, $W$ does not follow a standard normal distribution and therefore cannot be treated as a $Z$-score.
To make inferences like "a value of $W$ (the weight-averaged Z-scores) $= 1$ is greater than ~84% of observations" would involve having to standardize by dividing $W$ by its standard deviation. Therefore, the Z-score of $W$ becomes:
$$
Z_W = \frac{\frac{w_A}{w_A+w_B} \cdot Z_A + \frac{w_B}{w_A+w_B} \cdot Z_B}{\sqrt{\left(\frac{w_A}{w_A+w_B}\right)^2 + \left(\frac{w_B}{w_A+w_B}\right)^2}}
$$
A value of $1$ for $Z_W$ would indicate that it's greater than ~84% of observations of $Z_W$.
Please let me know if you have any follow-up questions.
If you generate the $z$-statistics from conversion of $W$-statistics from Wilcoxon signed-rank test, that would correct for the non-normality of the data. For example, for $N_r \ge 10$, a $z$-statistic can be calculated as $z = \frac{W}{\sigma_W}, \sigma_W = \sqrt{\frac{N_r(N_r + 1)(2N_r + 1)}{6}}$. What that would then be is equivalent to a number of standard deviations of interval change of ranked data. To see whether this interval difference is significant, one converts the $z$-statistic into probability of no difference, and if the probability of no difference is small, e.g., $p<0.05$, one would accept the alternative hypothesis of a significant difference as more likely.
In other words, the above is just a description of what the Wilcoxon signed-rank test, available in most every stats package, does to calculate a probability. In Excel or R-language, this link may help.
The second part of your question relates to what was actually learned. If so, you might consider looking not at marks as percentage, but marks as 100-mark. For example, consider a student who gets a 96 on a final exam before taking the course, and a 98 (on different but similar questions) following the course. What that student did not know then went from (very approximately) 4% of the course material down to 2%, an improvement of twice (but very noisy). Similarly, a student whose mark went from 50% to 75% would have improved as much, proportionately, but still does not likely know as much of the material as the first student did before taking the course.
Summarizing, what answer you obtain depends on how you pose the question. If you want a more exact answer than the above, refine the question a bit more, and I (or someone else) may take a stab at it.
EDIT The OP has refined the question a bit. It now appears to be focused on an improvement of grades, where the information content of courses is arbitrarily assigned equal weights independent of course content, difficulty of material, knowledge of students concerning material and so forth. We still do not know what is on the tests, at least not explicitly. In a classroom environment, a typical assumption is for tests such that the first end semester test covers first semester work, and the end of second semester test covers work for both semesters. In such a case, change of grades is the only measure that is available, and improvement is measured as change in grade point average (GPA), irrespective of which courses are being taken and what the student knows, does not know, or has learned/not learned. The change of GPA can be ranked from best to worst, and the best means greatest improvement in GPA, which is a measure important to some people within the context of some school, but not very meaningful in any other context. One can take the change of GPA scores in each semester, irrespective of how many courses they represent, or what those courses are, and compare them using a one-sample Wilcoxon test, i.e., a particular student improvement compared to all improvements, and extract a one-sided probability, or a $z$ or $W$ stat from that. The probability would then have the meaning that the student's GPA from each semester (i.e., not the overall, cumulative GPA) improved significantly between semesters or not significantly compared to other, individual-semester GPA improvements. What that probability means in more general terms, is, well, not much. I would not want, on a job interview, to be known as the one who ranked first second or third for GPA improvement, or to be known as someone whose GPA improved significantly because it speaks to inconsistent performance, and poor initial performance.
Best Answer
First, it is unlikely that several different tests of inhibition would be totally uncorrelated across the population of your subjects. Because we're dealing with Z-scores, I suppose that means they're not completely independent tests. So if Evans 1996 says you'd need to know correlations to get a meaningful composite Z-score, that is correct.
Second, as far as I can see, the link is assuming that the four z-scores are completely independent. Suppose we use that independence assumption to get a combined score that weights each of the tests equally. Then we have four independent random variables $Z_1, Z_2, Z_3, Z_4$ each with $E(Z_i) = 0,$ and $Var(Z_i) = SD(Z_i) = 1.$ Let $A = \frac 14\sum_i Z_i.$
Then $$E(A) = E\left(\frac 14 \sum Z_i\right) = \frac 14 E\left(\sum Z_i\right)\\ = \frac 14 \sum E(Z_i) = \frac 14(0+0+0+0) = 0.$$ And $$V(A) = V\left(\frac 14 \sum Z_i\right) = \frac{1}{16} V\left(\sum Z_i\right)\\ = \frac{1}{16}\sum V(Z_i) = \frac{1}{16}(1+1+1+1) = \frac{4}{16} = \frac 14.$$
Addendum: Here are data simulated in R for four positively correlated tests administered to 50 subjects.
Means and standard deviations of the scores for the 50 subjects are found below, and from them, the z-scores for each subject relative to the rest of the group of 50.
Subject #27 had the lowest such z-score (-2.06), which (not surprisingly) puts that subject at about the 2nd percentile. Also, #27's scores on the four tests are shown below, followed by the corresponding individual z-scores relative to the group of 50, and percentages in a normal population below these z-scores.
Thus, in effect, one way to derive the z-score $-2.06$ as a 'combination' of z-scores $-2.83, 0.65 -4.56,$ and $-1.51$ is to use this subject's individual exam scores in the context of the other 49 subjects.