The statistical term deviance is thrown around a bit too much. Most of the time, programs return the deviance
$$ D(y) = -2 \log{\{p(y | \hat{\theta})\}},$$
where $\hat{\theta}$ is your estimated parameter(s) from model fitting and $y$ is some potentially observed/observable occurrence of the random quantity in question.
The more common deviance that you refer to would treat the deviance above as a function of two variables, both the data and the fitted parameters: $$ D(y,\hat{\theta}) = -2\log{\{p(y|\hat{\theta})\}}$$
and so if you had one $y$ value but two competing, fitted parameter values, $\hat{\theta}_{1}$ and $\hat{\theta}_{2}$, then you'd get the deviance you mentioned from $$-2(\log{\{p(y|\hat{\theta}_{1})\}} - \log{\{p(y|\hat{\theta}_{2})\}}).
$$
You can read about the Matlab function that you mentioned, glmfit()
, linked here. A more fruitful, though shorter, discussion of the deviance is linked here.
The deviance statistic implicitly assumes two models: the first is your fitted model, returned by glmfit()
, call this parameter vector $\hat{\theta}_{1}$. The second is the "full-model" (also called the "saturated model"), which is a model in which there is a free variable for every data point, call this parameter vector $\hat{\theta}_{s}$. Having so many free variables is obviously a stupid thing to do, but it does allow you to fit to that data exactly.
So then, the deviance statistics is computed as the difference between the log likelihood computed at the fitted model and the saturated model. Let $Y=\{y_{1}, y_{2}, \cdots, y_{N}\}$ be the collection of the N data points. Then:
$$DEV(\hat{\theta}_{1},Y) = -2\biggl[\log{p(Y|\hat{\theta}_{1})} - \log{p(Y|\hat{\theta}_{s})} \biggr]. $$
The terms above will be expanded into summations over the individual data points $y_{i}$ by the independence assumption. If you want to use this computation to calculate the log-likelihood of the model, then you'll need to first calculate the log-likelihood of the saturated model. Here is a link that explains some ideas for computing this... but the catch is that in any case, you're going to need to write down a function that computes the log-likelihood for your type of data, and in that case it's probably just better to create your own function that computes the log-likelihood yourself, rather than backtracking it out of a deviance calculation.
See Chapter 6 of Bayesian Data Analysis for some good discussion of deviance.
As for your second point about the likelihood test statistic, yes it sounds like you basically know the right thing to do. But in many cases, you'll consider the null hypothesis to be something that expert, external knowledge lets you guess ahead of time (like some coefficient being equal to zero). It's not necessarily something that comes as the result of doing model fitting.
In general layman's terms (and not just for this problem),
- Null Hypothesis $H_0$: no change or difference (i.e. the classifiers have the same performance, however you define it)
- Alternative Hypothesis: there is some sort of difference in performance
For your classifier performance comparison problem, I recommend reading Chapter 6 of Japkowicz & Shah, which goes into detail on how to use significance testing to assess the performance of different classifiers. (Other chapters give more background on classifier comparison - sounds like they might interest you too.)
In your case,
- to compare 2 classifiers (on a single domain) you may use a matched-pairs t-test where $t=\frac{\bar{d}}{\bar{\sigma}_d / \sqrt{n}}$, where $\bar{d} = \bar{\text{pm}}(f_1) - \bar{\text{pm}}(f_2)$ is the difference of the means of your performance measures (whatever you choose to use) based on applying the two classifiers $f_1$ and $f_2$, $n$ is the number of trials and $\bar{\sigma}_d$ is the sample standard deviation of the mean difference
- to compare multiple classifiers (on a single domain) you may use one-way ANOVA (i.e. an F-test) to check whether there is any difference among multiple means (though it cannot tell which are actually different) and then follow up with post-hoc tests, such as Tukey's Honest Significant Difference test to identify which pairs of classifiers exhibit significant differences.
The book goes into far more detail, so I do recommend reading that chapter.
And in terms of baselines, the tests I've mentioned don't distinguish between a baseline and a non-baseline. This is a good thing, as it gives you flexibility to decide which comparisons you should give more importance to in your analysis. The number of tests you actually do determines whether you should rely on 1. or 2. above.
Best Answer
The $t$-test is not a very good choice to compare classifiers, since their performance is not normally distributed. I suggest considering the Wilcoxon signed-ranks test instead.
For a thorough read on this topic please refer to this paper (available here):
The above paper lists multiple arguments against using the $t$-test for classifier comparison.