It is advisable to use equal proportion of instances if your n( number of examples) is a small number or I would say when n is not a good representation of the population. But this practice generally does not give good results when testing it on different set.So it would be better to look for the general trend among negative images and other images, what is their ratio is generally in reality because that might give you a hint of the optimal ratio which should take and if are not sure with what should be the optimal ratio, then you can choose any ratio of negative class and according to that you can assign "class_weight" parameter value while training the model.
Softmax outputs a probability vector. That means that
- the elements are nonnegative and
- the elements sum to 1.
To train a classification model with $m \ge 3$ classes, the standard approach is to use softmax as the final activation with multinomial cross-entropy loss. For a single instance, the loss is
$$
\begin{align}
\mathcal{L}
&= -\sum_{j=1}^m y_j \log(p_j)
\end{align}
$$
where $y$ is a vector with one value of 1 and the rest zero and $p_j$ are our predicted probabilities from the softmax. If the single value of 1 in $y$ is at index $k$, then the loss achieves a minimum value of 0 when $p_k = 1$. When $p_k=1$, this implies that the rest of the $p_{j\neq k}$ are all 0 (because $p$ is a vector of probabilities, so the total is 1).
In a comment, OP proposes using ReLU instead of softmax. However, there are some problems with this proposal.
You can still encounter $\log(0)$, because ReLU can return zeros. (But this is not fatal, because we can "patch" it; a strictly positive ReLU activation like $\text{ReLU}(x)+\epsilon$ for some small $\epsilon>0$ avoids this.)
For ReLUs, the sum of $p$ can be any nonnegative value. This is not a probability. Because $-\log(p_k)$ decreases without bound as $p_k$ increases, the model will never stop training. (But this isn't fatal; penalizing the weights and biases or or otherwise constraining them will prevent them from drifting away to $\pm\infty$.) On the other hand, for softmax, the largest $p_k$ can ever be is 1, so minimum loss is 0.
ReLU does not force a tradeoff among the units, whereas softmax does. What this means is that if you use softmax want to increase the value of $p_k$, you have to decrease $\sum_{i\neq k} p_i$. The loss will be high whenever $p$ and $y$ are different. By contrast, the ReLU model can just return some vector of constants and have the same loss, no matter what the label is.
Consider the three-class case where the correct prediction is the second class, we have $$\mathcal{L}=-0\times \log(c)-1\times\log(c)-0\times\log(c)=-\log(c).$$ Likewise, this same loss of is obtained for the same $p$ and any label vector $y$.
Clearly, (3) is fatal because the model has no useful information about which class is the most likely. A model that can always reduce the loss by ignoring the input entirely is a bogus model.
The key detail about softmax is that it forces a tradeoff among the values of $p$, because assigning any probability to the incorrect class is penalized. The only softmax model which has 0 multinomial cross-entropy loss is the model that assigns probability of 1 to the correct class for all instances.
Softmax isn't the only function you could use. A function like
$$
f(z)_i = \frac{\text{softplus}(z_i)}{\sum_i \text{softplus}(z_i)}
$$
where the softplus function is
$$
\text{softplus}(x)=\log(1+\exp(x))
$$
could also work for a multi-class classification model because $f$ is
- positive (avoids divide by zero),
- non-negative and sums to 1 (is a probability), and
- monotonic increasing.
We care about monotonicity because we want the property that large $z_i$ imply large probabilities. A non-monotonic function like squaring or absolute value means that we predict a certain class for very large or very small values. See: Why is softmax function used to calculate probabilities although we can divide each value by the sum of the vector?
Best Answer
Yes, it can be a problem. A very similar example was used in the Unmasking Clever Hans Predictors and Assessing What Machines Really Learn paper by Lapuschkin et al (see below). They show an example of a neural network that learned to detect "horse" based on the fact that training set images with horses contained a textual tag.
The same might apply to your problem as well: if some of the classes have pictures with watermarks and some don't, the neural network can learn to simply detect the watermark to make the classification. It can also be a problem if you have different watermarks or different placements of watermarks in different classes. But it is not only about watermarks, if you have images from different sources for different classes (or different proportions of them for different classes) the model can learn to detect features of the images. An example of this was presented in another paper, where the model learned to detect a "husky" dog based on the fact alone that the pictures of huskies were all made in winter and contained snow.
Unfortunately, removing the watermarks does not necessarily need to help. In the examples used by Lapuschkin et al it did help, but keep in mind that there is no way to perfectly remove the watermark. You may be able to remove the watermark so it is not visible to a human, but in most such cases it is possible to detect the removed watermark. It can be done with neural networks, but also with much more primitive algorithms.
So the watermarks may be a problem and if using this data you should pay great attention to validating if or how your model treats the watermarks on the images (removed or not) and how they influence your results.
Finally, the images are watermarked most likely because you didn't pay for the commercial use license for the images. In such a case, it may be illegal for you to use the images to train the algorithm and you should first consult with a lawyer.