Solved – Interpreting two-sided, two-sample, Welch T-Test

distributionsjavascriptt-test

I’m writing some code (JavaScript) to compare benchmark results. I’m using the Welch T-test because the variance and/or sample size between benchmarks is most likely different. The critical value is pulled from a T-distribution table at 95% confidence (two-sided).

The Welch formula is pretty straight-forward, but I am fuzzy on interpreting a significant result. I am not sure if the critical value should be divided by 2 or not. Help clearing that up is appreciated. Also should I be rounding the degrees of freedom, df, to lookup the critical value or would Math.ceil or Math.floor be more appropriate?

  /**
   * Determines if the benchmark's hertz is higher than another.
   * @member Benchmark
   * @param {Object} other The benchmark to compare.
   * @returns {Number} Returns `1` if higher, `-1` if lower, and `0` if indeterminate.
   */
  function compare(other) {
    // use welch t-test
    // http://frank.mtsu.edu/~dkfuller/notes302/welcht.pdf
    // http://www.public.iastate.edu/~alicia/stat328/Regression%20inference-part2.pdf
    var a = this.stats,
        b = other.stats,
        pow = Math.pow,
        bitA = a.variance / a.size,
        bitB = b.variance / b.size,
        df = pow(bitA + bitB, 2) / ((pow(bitA, 2) / a.size - 1) + (pow(bitB, 2) / b.size - 1)),
        t = (a.mean - b.mean) / Math.sqrt(bitA + bitB),
        c = getCriticalValue(Math.round(df));

    // check if t-statistic is significant
    return Math.abs(t) > c / 2 ? (t > 0 ? 1 : -1) : 0;
  }

Update: Thanks for all the replies so far! My colleague posted some more info here, in case that affects the advice.

Best Answer

(1a) You don't need the Welch test to cope with different sample sizes. That's automatically handled by the Student t-test.

(1b) If you think there's a real chance the variances in the two populations are strongly different, then you are assuming a priori that the two populations differ. It might not be a difference of location--that's what a t-test evaluates--but it's still an important difference nonetheless. Don't paper it over by adopting a test that ignores this difference! (Differences in variance often arise where one sample is "contaminated" with a few extreme results, simultaneously shifting the location and increasing the variance. Because of the large variance it can be difficult to detect the shift in location (no matter how great it is) in a small to medium size sample, because the increase in variance is roughly proportional to the squared change in location. This form of "contamination" occurs, for instance, when only a fraction of an experimental group responds to the treatment.) Therefore you should consider a more appropriate test, such as a slippage test. Even better would be a less automated graphical approach using exploratory data analysis techniques.

(2) Use a two-sided test when a change of average in either direction (greater or lesser) is possible. Otherwise, when you are testing only for an increase or decrease in average, use a one-sided test.

(3) Rounding would be incorrect and you shouldn't have to do it: most algorithms for computing t distributions don't care whether the DoF is an integer. Rounding is not a big deal, but if you're using a t-test in the first place, you're concerned about small sample sizes (for otherwise the simpler z-test will work fine) and even small changes in DoF can matter a little.

Related Question