The answer is yes, but you have to define it the right way.
Cross entropy is defined on probability distributions, not on single values. For discrete distributions $p$ and $q$, it's:
$$H(p, q) = -\sum_y p(y) \log q(y)$$
When the cross entropy loss is used with 'hard' class labels, what this really amounts to is treating $p$ as the conditional empirical distribution over class labels. This is a distribution where the probability is 1 for the observed class label and 0 for all others. $q$ is the conditional distribution (probability of class label, given input) learned by the classifier. For a single observed data point with input $x_0$ and class $y_0$, we can see that the expression above reduces to the standard log loss (which would be averaged over all data points):
$$-\sum_y I\{y = y_0\} \log q(y \mid x_0) = -\log q(y_0 \mid x_0)$$
Here, $I\{\cdot\}$ is the indicator function, which is 1 when its argument is true or 0 otherwise (this is what the empirical distribution is doing). The sum is taken over the set of possible class labels.
In the case of 'soft' labels like you mention, the labels are no longer class identities themselves, but probabilities over two possible classes. Because of this, you can't use the standard expression for the log loss. But, the concept of cross entropy still applies. In fact, it seems even more natural in this case.
Let's call the class $y$, which can be 0 or 1. And, let's say that the soft label $s(x)$ gives the probability that the class is 1 (given the corresponding input $x$). So, the soft label defines a probability distribution:
$$p(y \mid x) = \left \{
\begin{array}{cl}
s(x) & \text{If } y = 1 \\
1-s(x) & \text{If } y = 0
\end{array}
\right .$$
The classifier also gives a distribution over classes, given the input:
$$
q(y \mid x) = \left \{
\begin{array}{cl}
c(x) & \text{If } y = 1 \\
1-c(x) & \text{If } y = 0
\end{array}
\right .
$$
Here, $c(x)$ is the classifier's estimated probability that the class is 1, given input $x$.
The task is now to determine how different these two distributions are, using the cross entropy. Plug these expressions for $p$ and $q$ into the definition of cross entropy, above. The sum is taken over the set of possible classes $\{0, 1\}$:
$$
\begin{array}{ccl}
H(p, q) & = & - p(y=0 \mid x) \log q(y=0 \mid x) - p(y=1 \mid x) \log q(y=1 \mid x)\\
& = & -(1-s(x)) \log (1-c(x)) - s(x) \log c(x)
\end{array}
$$
That's the expression for a single, observed data point. The loss function would be the mean over all data points. Of course, this can be generalized to multiclass classification as well.
Cross entropy is definitely the way to go. I don't know Keras but TF has this: https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits
Here is a paper directly implementing this: Fully Convolutional Networks
for Semantic Segmentation by Shelhamer et al.
The U-Net paper is also a very successful implementation of the idea, using skip connections to avoid loss of spatial resolution. You can find many implementations of this in the net.
From my personal experience, you might want to start with a simple encoder-decoder network first, but do not use strides (or strides=1), otherwise you lose a lot of resolution because the upsampling is not perfect. Go with small kernel sizes. I don't know your specific application but even a 2-3 hidden layer network will give very good results. Use 32-64 channels at each layer. Start simple, 2 hidden layers, 32 channels each, 3x3 kernels, stride=1 and experiment with parameters in an isolated manner to see their effect. Keep the dimensions always equal to the input dimension for starters to avoid resolution loss. Afterwards you can switch on strides and upsampling and implement ideas like U-Net. U-Net works extremely well for medical image segmentation.
For class-inbalance see https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/
Here the idea is to weight the different classes with $\alpha$ and $\beta$ parameters.
Best Answer
First option is sensible as it's the usual MAE/MSE and they're used as reconstruction loss in may other situations. You can also use cross entropy loss for $w\times h\times 3$ values.
I do not recommend your second option as the class labels destroys the ordinal relationship between the pixel values, i.e. $0<1\dots<255$.