The fiducial argument is to interpret likelihood as a probability. Even if likelihood measures the plausibility of an event, it does not satisfy the axioms of probability measures (in particular there is no guarantee that it sums to 1), which is one of the reasons this concept was never so successful.
Let's give an example. Imagine that you want to estimate a parameter, say the half-life $\lambda$ of a radioactive element. You take a couple of measurements, say $(x_1, \ldots, x_n)$ from which you try to infer the value of $\lambda$. In the view of the traditional or frequentist approach, $\lambda$ is not a random quantity. It is an unknown constant with likelihood function $\lambda^n \prod_{i=1}^n e^{-\lambda x_i} = \lambda^n e^{-\lambda(x_1+\ldots+x_n)}$.
In the view of the Bayesian approach, $\lambda$ is a random variable with a prior distribution; the measurements $(x_1, \ldots, x_n)$ are needed to deduce the posterior distribution. For instance, if my prior belief about the value of lambda is well represented by the density distribution $2.3 \cdot e^{-2.3\lambda}$, the joint distribution is the product of the two, i.e. $2.3 \cdot \lambda^n e^{-\lambda(2.3+x_1+\ldots+x_n) }$. The posterior is the distribution of $\lambda$ given the measurements, which is computed with Bayes formula. In this case, $\lambda$ has a Gamma distribution with parameters $n$ and $2.3+x_1+\ldots+x_n$.
In the view of fiducial inference, $\lambda$ is also a random variable but it does not have a prior distribution, just a fiducial distribution that depends only on $(x_1, \ldots, x_n)$. To follow up on the example above, the fiducial distribution is $\lambda^n e^{-\lambda(x_1+\ldots+x_n)}$. This is the same as the likelihood, except that it is now interpreted as a probability. With proper scaling, it is a Gamma distribution with parameters $n$ and $x_1+\ldots+x_n$.
Those differences have most noticeable effects in the context of confidence interval estimation. A 95% confidence interval in the classical sense is a construction that has 95% chance of containing the target value before any data is collected. However, for a fiducial statistician, a 95% confidence interval is a set that has 95% chance of containing the target value (which is a typical misinterpretation of the students of the frequentist approach).
In plain language, you can interpret it like an "overall test"—it is testing a number of things at once. The most frequent way it is used, in my area of statistics in the social sciences at least, is referring to testing an entire factor instead of levels within it. Consider the following data frame:
set.seed(1839)
dat <- data.frame(x=rnorm(100),
y=rnorm(100),
z=factor(rep(c(letters[1:4]),25)))
head(dat)
x y z
1 1.0127014 -0.98199201 a
2 -0.6845605 0.37451740 b
3 0.3492607 -0.08189552 c
4 -1.6245010 -0.08237190 d
5 -0.5162476 1.14766587 a
6 -0.7025836 -0.67800240 b
y
is the dependent variable, x
is a continuous independent variable, and z
is a categorical independent variable with four factors (a, b, c, or d).
If we run the regression model we get:
mod1 <- lm(y~x+z, dat)
summary(mod1)
Call:
lm(formula = y ~ x + z, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.73332 -0.66347 0.03676 0.58965 2.25179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01422 0.19244 0.074 0.941
x 0.03245 0.10671 0.304 0.762
zb -0.15265 0.27293 -0.559 0.577
zc 0.22139 0.27229 0.813 0.418
zd -0.06219 0.27830 -0.223 0.824
Residual standard error: 0.962 on 95 degrees of freedom
Multiple R-squared: 0.02297, Adjusted R-squared: -0.01817
F-statistic: 0.5583 on 4 and 95 DF, p-value: 0.6935
Notice that the output is testing three specific contrasts at the end: a vs. b, a vs. c, and a vs. d. What if we want to know if the variable z
overall contributes any explanatory power to predicting y
? We can do an omnibus test that tests ALL of the levels to see if there is a significant difference in there at least once. We could do this by comparing a model with z
in it to one without z
in it:
mod1 <- lm(y~x+z, dat)
mod2 <- lm(y~x, dat)
anova(mod2, mod1)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ x + z
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 89.802
2 95 87.912 3 1.8899 0.6808 0.566
This is an omnibus test: It is not looking at one specific comparison, but seeing if the whole factor z
(i.e., all of it; omnibus derives from the Latin word "for all") is significant.
Best Answer
@Sycorax is correct. "Hard examples" is referring to the examples in the training set that are being mislabeled by the current version of the classifier. Oftentimes it is only used for the background class, which is too large a set for anyone to mine without some kind of a strategy (binary classification on imbalanced sets is hard).
This term was probably coined by Girshick (I think?) in the seminal article DPM and is now widely used in the Object detection community for instance in OHEM, where the negative windows used at each step of the training are chosen according to their current score.
The latter article is an example of Online hard examples mining (hence the title) whereas the ICIP article explores different Offline hard examples mining strategies.