Solved – “hard-mining”, “hard examples”, … – Does “hard” mean anything specific in statistics when not applied to problem difficulty

terminology

The conference paper

Jean Ogier Du Terrail,
Frédéric Jurie.
ON THE USE OF DEEP NEURAL NETWORKS FOR THE DETECTION OF SMALL VEHICLES IN ORTHO-IMAGES.
IEEE International Conference on Image Processing,
Sep 2017,
Beijing, China.

(PDF)
uses the terms
"hard-mining" (6×),
"hard mining" (2×),
"hard examples" (3×),
"hard example mining" (1×),
"hard negative" (2×),
"hard-negative samples" (1×) and
"hard-negative-mining strategies" (1×).

I have no idea what the "hard" specifyer means in this context. As it is mentioned in conjunction with bootstrapping, I suspect that it might be a term from statistics rather than GIS or AI/IR/machine learning/visual object detection or (deep convolutional) artificial neural networks. (It might, of course, be a remote-sensing-specific term.)

[…]

2.3. Hard-Mining strategies

Bootstrapping offers a lot of liberties
on how the hard examples are chosen.
One could for instance pick a limited number
of false positives per image or one could fix
a threshold and only pick a false positive if
its score is superior to a fixed threshold (0.5 for instance).
[…]

Does "hard" (in general, or within the terms listed above) mean anything specific in statistics, and if so, what? From the context, I don't suppose that it refers to the difficulty of the problem.

I figured it may be related "hard evidence", but that didn't help me in determining what it might mean here.

Best Answer

@Sycorax is correct. "Hard examples" is referring to the examples in the training set that are being mislabeled by the current version of the classifier. Oftentimes it is only used for the background class, which is too large a set for anyone to mine without some kind of a strategy (binary classification on imbalanced sets is hard).

This term was probably coined by Girshick (I think?) in the seminal article DPM and is now widely used in the Object detection community for instance in OHEM, where the negative windows used at each step of the training are chosen according to their current score.

The latter article is an example of Online hard examples mining (hence the title) whereas the ICIP article explores different Offline hard examples mining strategies.

Related Solutions

Solved – What does “fiducial” mean (in the context of statistics)

The fiducial argument is to interpret likelihood as a probability. Even if likelihood measures the plausibility of an event, it does not satisfy the axioms of probability measures (in particular there is no guarantee that it sums to 1), which is one of the reasons this concept was never so successful.

Let's give an example. Imagine that you want to estimate a parameter, say the half-life $\lambda$ of a radioactive element. You take a couple of measurements, say $(x_1, \ldots, x_n)$ from which you try to infer the value of $\lambda$. In the view of the traditional or frequentist approach, $\lambda$ is not a random quantity. It is an unknown constant with likelihood function $\lambda^n \prod_{i=1}^n e^{-\lambda x_i} = \lambda^n e^{-\lambda(x_1+\ldots+x_n)}$.

In the view of the Bayesian approach, $\lambda$ is a random variable with a prior distribution; the measurements $(x_1, \ldots, x_n)$ are needed to deduce the posterior distribution. For instance, if my prior belief about the value of lambda is well represented by the density distribution $2.3 \cdot e^{-2.3\lambda}$, the joint distribution is the product of the two, i.e. $2.3 \cdot \lambda^n e^{-\lambda(2.3+x_1+\ldots+x_n) }$. The posterior is the distribution of $\lambda$ given the measurements, which is computed with Bayes formula. In this case, $\lambda$ has a Gamma distribution with parameters $n$ and $2.3+x_1+\ldots+x_n$.

In the view of fiducial inference, $\lambda$ is also a random variable but it does not have a prior distribution, just a fiducial distribution that depends only on $(x_1, \ldots, x_n)$. To follow up on the example above, the fiducial distribution is $\lambda^n e^{-\lambda(x_1+\ldots+x_n)}$. This is the same as the likelihood, except that it is now interpreted as a probability. With proper scaling, it is a Gamma distribution with parameters $n$ and $x_1+\ldots+x_n$.

Those differences have most noticeable effects in the context of confidence interval estimation. A 95% confidence interval in the classical sense is a construction that has 95% chance of containing the target value before any data is collected. However, for a fiducial statistician, a 95% confidence interval is a set that has 95% chance of containing the target value (which is a typical misinterpretation of the students of the frequentist approach).

Solved – What does the word omnibus mean in statistics

In plain language, you can interpret it like an "overall test"—it is testing a number of things at once. The most frequent way it is used, in my area of statistics in the social sciences at least, is referring to testing an entire factor instead of levels within it. Consider the following data frame:

set.seed(1839)
dat <- data.frame(x=rnorm(100),
                  y=rnorm(100),
                  z=factor(rep(c(letters[1:4]),25)))
head(dat)

           x           y z
1  1.0127014 -0.98199201 a
2 -0.6845605  0.37451740 b
3  0.3492607 -0.08189552 c
4 -1.6245010 -0.08237190 d
5 -0.5162476  1.14766587 a
6 -0.7025836 -0.67800240 b

y is the dependent variable, x is a continuous independent variable, and z is a categorical independent variable with four factors (a, b, c, or d).

If we run the regression model we get:

mod1 <- lm(y~x+z, dat)
summary(mod1)

Call:
lm(formula = y ~ x + z, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.73332 -0.66347  0.03676  0.58965  2.25179 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.01422    0.19244   0.074    0.941
x            0.03245    0.10671   0.304    0.762
zb          -0.15265    0.27293  -0.559    0.577
zc           0.22139    0.27229   0.813    0.418
zd          -0.06219    0.27830  -0.223    0.824

Residual standard error: 0.962 on 95 degrees of freedom
Multiple R-squared:  0.02297,   Adjusted R-squared:  -0.01817 
F-statistic: 0.5583 on 4 and 95 DF,  p-value: 0.6935

Notice that the output is testing three specific contrasts at the end: a vs. b, a vs. c, and a vs. d. What if we want to know if the variable z overall contributes any explanatory power to predicting y? We can do an omnibus test that tests ALL of the levels to see if there is a significant difference in there at least once. We could do this by comparing a model with z in it to one without z in it:

mod1 <- lm(y~x+z, dat)
mod2 <- lm(y~x, dat)
anova(mod2, mod1)

Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ x + z
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 89.802                           
2     95 87.912  3    1.8899 0.6808  0.566

This is an omnibus test: It is not looking at one specific comparison, but seeing if the whole factor z (i.e., all of it; omnibus derives from the Latin word "for all") is significant.

2.3. Hard-Mining strategies

Best Answer

Related Solutions

Solved – What does “fiducial” mean (in the context of statistics)

Solved – What does the word omnibus mean in statistics

Related Question