Solved – Dealing with sparse categories in binary cross-entropy

kerasneural networkssparse

In Keras, I'm using something similar to the Keras IMDB example to build a topic modelling example. However, unlike the example, which has a single "positive/negative" classification, I have
over a hundred topics which are not mutually exclusive. Every training example has a corresponding output which is a vector of zeros with 3 or 4 ones.
ex :[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0 ….. 0]

model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(120, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
      batch_size=batch_size,
      epochs=15,
      validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                        batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Of course the model quickly jumps up to 95-97% accuracy, but when I look at the output, of course its predicting nothing but zeroes. Clearly the class imbalance (every class has more negative examples then positive examples is causing my predictions to stay at 0) is there a way to tweak the model to understand sparse binary examples?

Best Answer

I think the problem is the sigmoid activation function in your output layer. Binary crossentropy computes the sigmoid again as part of the loss computation (see the description in tensor flow: https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits). Just changing the activation function in the output layer to linear worked in our (similarly structured) case.

Related Solutions

Solved – Cross correlation for very sparse binary data

You are right to think of correlation as the mean of the product of the standardized variables: this has great conceptual advantages over other definitions. It also leads to minimal loss of floating point precision. However, it is not necessarily the best approach for all computation, especially when speed is important.

Our goal is to make algebraic manipulations that lead to calculations involving, if possible, dot products of the original variables, for then the inherent sparse matrix operations ought to make short work indeed of those computations. Let, then, $\mathbf u$ and $\mathbf v$ be two of the columns of the original sparse binary matrix $\mathbb Y$ (so that the $n$ entries in each of $\mathbf u$ and $\mathbf v$ are only zeros and ones). Let $\mathbf 1 = (1,1,\ldots, 1)$ be an $n$-vector and write

$$\bar u = (u_1 + u_2 + \cdots + u_n)/n = \frac{1}{n}\mathbf u \cdot \mathbf 1$$

(and likewise $\bar v = \frac{1}{n}\mathbf v \cdot \mathbf 1$) for their means and

$$\mathbf u_0 = (u_1 - \bar u, u_2 - \bar u, \ldots, u_n - \bar u)$$

(and similarly for $\mathbf v_0$) for the centered vectors of their residuals. Then

$$||\mathbf u_0||^2 = \mathbf u_0 \cdot \mathbf u_0$$

shows how to find the lengths of the residual vectors which are used to standardize them to

$$\mathbf \upsilon = \mathbf u_0 / \sqrt{||\mathbf u_0||/n};\quad\mathbf \phi= \mathbf v_0 / \sqrt{||\mathbf v_0||/n}$$

whence, by definition,

$$\rho_{\mathbf u, \mathbf v} = \mathbf \upsilon \cdot \mathbf \phi.$$

(Apart from choices of when to divide by $n$, this appears to be what the code in the question is doing.)

Working backwards (by plugging the foregoing into this formula) easily yields

$$\eqalign{ \rho_{\mathbf u, \mathbf v} &=\mathbf u_0 / \sqrt{||\mathbf u_0||/n} \cdot \mathbf v_0 / \sqrt{||\mathbf v_0||/n} \\ &=n\frac{\left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right)}{\sqrt{\left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf u - \bar u \mathbf 1\right)\,\left(\mathbf v - \bar v \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right)}}. }$$

Using the distributive law to expand the dot products shows that

$$\eqalign{ \left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right) &= \mathbf u \cdot \mathbf v - \frac{2}{n}(\mathbf u \cdot \mathbf 1) (\mathbf 1 \cdot \mathbf v) + \frac{1}{n^2}(\mathbf u \cdot \mathbf 1)(\mathbf v \cdot \mathbf 1)\mathbf 1 \cdot \mathbf 1 \\ &= \mathbf u \cdot \mathbf v -\frac{1}{n}(\mathbf u \cdot \mathbf 1)(\mathbf v \cdot \mathbf 1) }$$

because $\mathbf 1 \cdot \mathbf 1 = n$. Similar formulas hold for the expressions in the denominator. This shows how the correlation coefficient can be computed in terms of dot products of the original (raw, sparse) vectors. Note in particular that the terms in the denominator can be written

$$\mathbf u \cdot \mathbf u -\frac{1}{n}(\mathbf u \cdot \mathbf 1)(\mathbf u \cdot \mathbf 1)=n \bar u - \frac{1}{n}(n \bar u)^2 = n(\bar u - \bar u^2)$$

An efficient implementation will compute column sums (from which their means $\bar u$ are immediately derived) and obtain all the dot products of all pairs of columns at once by means of a single matrix operation $\mathbb Y^\prime \mathbb Y$. Using these it is simple and fast to obtain the correlation coefficients with the preceding formula. To obtain lagged correlations at lag $k$, remove the first $k$ rows of $\mathbb Y$ (call this $\mathbb Y_{(k)}$ and separately remove the last $k$ rows (call this $\mathbb Y_{(-k)}$). The essential material for the computation can be found in the non-diagonal entries of $\mathbb Y_{(k)}^\prime \mathbb Y_{(-k)}$. The new denominators will scarcely differ from the old ones and so, to a high degree of accuracy, need not be recomputed at all; but for perfect accuracy note that the column sums in $\mathbb Y_{(k)}$ are of the form

$$\sum_{i=k+1}^n u_i = \left(\sum_{i=1}^n u_i\right) - u_k - u_{k-1} - \cdots - u_2 - u_1$$

which are easily obtained from the original column sums by means of just $k$ subtractions (and similarly for the column sums of $\mathbb Y_{(-k)}$.

Thus, computing the entire $21\times 60\times 60$ array comes down to performing $21$ multiplications of sparse binary matrices and adjusting the results. The total number of numeric operations involved (with each dot product requiring about $2n$ multiplications and $2n$ additions) will be approximately

$$2 \times 5271159 \times 60^2 \times 21 \approx 800 \times 10^9.$$

Unparallelized, but running as native code on a modern PC, this would take two minutes without any sparse matrix speed improvements. Tests in R (without using sparse arithmetic) with a $5271159 \times 6$ matrix took $1.3$ seconds; the quadratic scaling indicates R would thereby require $130$ seconds, confirming the two minute estimate. Given a random binary array of dimensions $5271159\times 60$ and mean of $1/40$, Mathematica 9 took one second to compute $\mathbb Y^\prime \mathbb Y$, suggesting the total computation for all $21$ lags should be around $20$ seconds.

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

For your problem, the good metric is the categorical_accuracy. What happens is that when you set the loss to be binary_crossentropy and metrics to accuracy then keras assumes that the good metric is binary_accuracy which is just plain wrong when there is more than 2 labels.

What you have to do is to specify explicitly that the metric is categorical_accuracy like this:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

see the details in this answer: https://stackoverflow.com/a/46038271/6338493

Best Answer

Related Solutions

Solved – Cross correlation for very sparse binary data

Solved – NNs: Multiple Sigmoid + Binary Cross Entropy giving better results than Softmax + Categorical Cross Entropy

Related Question