Cross entropy is defined on probability distributions, not single values. The reason it works for classification is that classifier output is (often) a probability distribution over class labels. For example, the outputs of logistic/softmax functions are interpreted as probabilities. The observed class label is also treated as a probability distribution: the empirical distribution (where the probability is 1 for the observed class and 0 for the others).
The concept of cross entropy applies equally well to continuous distributions. But, it can't be used for regression models that output a point estimate (e.g. the conditional mean) but not a full probability distribution. If you had a model that gave the full conditional distribution (probability of output given input), you could use cross entropy as a loss function.
For continuous distributions $p$ and $q$, the cross entropy is defined as:
$$H(p, q) = -\int_{Y} p(y) \log q(y) dy$$
Just considering a single observed input/output pair $(x, y)$, $p$ would be the empirical conditional distribution (a delta function over the observed output value), and $q$ would be the modeled conditional distribution (probability of output given input). In this case, the cross entropy reduces to $-\log q(y \mid x)$. Summing over data points, this is just the negative log likelihood!
No, it doesn't make sense to use TensorFlow functions like tf.nn.sigmoid_cross_entropy_with_logits
for a regression task. In TensorFlow, “cross-entropy” is shorthand (or jargon) for “categorical cross entropy.” Categorical cross entropy is an operation on probabilities. A regression problem attempts to predict continuous outcomes, rather than classifications.
The jargon "cross-entropy" is a little misleading, because there are any number of cross-entropy loss functions; however, it's a convention in machine learning to refer to this particular loss as "cross-entropy" loss.
If we look beyond the TensorFlow functions that you link to, then of course there are any number of possible cross-entropy functions. This is because the general concept of cross-entropy is about the comparison of two probability distributions. Depending on which two probability distributions you wish to compare, you may arrive at a different loss than the typical categorical cross-entropy loss. For example, the cross-entropy of a Gaussian target with some varying mean but fixed diagonal covariance reduces to mean-squared error. The general concept of cross-entropy is outlined in more detail in these questions:
Best Answer
In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density
$$f_\theta(y|x)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\}$$
and you fix $\sigma^2$ to some value
The mean $\mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)
Writing the empirical approximation to the cross entropy you get:
$$\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\} \right)$$
$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}}\right) +\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2$$
If we e.g. set $\sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:
$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi}}\right) +\frac{1}{2}(y_i-\mu_\theta(x_i))^2$$
Minimizing this is equivalent to the minimization of the $L2$ loss.
So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss