Yes, they need to compute the weights ones but not assigned it to the whole loss. Instead each pixel in the loss (before summing it in both directions) should take a weight. So overall 'weights' is a tensor just like the others. Lets say there are only two classes, and frequency of $ C_1 $ is twofold of $ C_2 $. One of the pixel is corretly predicted as $ C_2 $ with confidence [0.3 0.7] . The loss is $ sum([1, 0].*log[0.3, 0.7]) $. When the weight is included the loss is $ sum([1, 0].*log[0.3, 0.7] * 2) $, because $ C_2 $ should take twice to make a balance. So for each pixel, weight is either 1 or 2 depends on which class it belongs to. This construct a weight matrix. However it can be convenient to think it as tensor because, the weight value correspond to the other class multiplied by 0 in 'true_dist'. In this case the loss for single pixel can be written as $ sum([1, 0].*log[0.3, 0.7].*[2, 1]) $. So it doesn't effect the result. In this way you can make a point-wise multiplication.
PS: It didn't fit to the commment section
EDIT: I can't edit your code because the weight calculation section is not included. If you calculated weights for N classes, then its a 1XN vector. You will construct a 3D array, $ W_{ijk} $, with these weights. The first and second dimension of this array corresponds to 'img_col' and 'img_row' respectively. The third dimension will be a function of 'true_dist', $ T_{ijk}$, at corresponding pixel. I guess you are confused in here, so I will try to be more open at this point. Lets say N is 4 and the weight vector you calculated is denoted as $ w = [w_1,w_2,w_3,w_4]$. The weight values are inversely correlated with frequency of each class'. If a pixel $ (a,b) $ belongs to class $ C_3 $ then T_{ab.} = [0, 0, 1, 0] and $ W_{ab.} = T_{ab.}.*([w_1,w_2,w_3,w_4]) = [0,0,w_3,0]$ where $ .* $ is point-wise multiplication. So only 3rd class' weight value will effect for that individual pixel (a,b). As you see $ W $ is a function of $ T $. What you need to do evaluate $ W $ before passing in to the loss. You can make a function which take $ T $ as input.
The 'weights' in the code is denoted as $ W $ (capital) is a 3D array. $ w $, vector, corresponds to reverse frequency values for each class.
EDIT2: Sorry for the mess I created in here. You don't need to make point-wise multiplication to create $ W_{ijk}$ because it is already done in loss function. So just replicate $ w $ to each pixel of $ W $.
$ \forall (a,b), W_{ab.} = [w_1,w_2,w_3,w_4]$
Yes, it does make sense. But in order to “mix” the features from distant parts of an image and account for context, the network should be either very deep (like ResNet), or have wide convolution kernels (like Dilated convolutions), or use some kind of post-processing (like fully-connected CRF, although it is not very semantic).
The dilated convolutions paper specifically targets at increasing the size of the receptive field for the output of some specific variable on an intermediate layer. To make that tractable, the stride is pushed into convolution kernels. It seems like a powerful idea for accounting for global context, in particular, for semantic segmentation.
Best Answer
I might suggest you should have a look at this paper Fully Convolutional Networks for Semantic Segmentation.
This paper suggests architecture which can take input of undefined dimensions and produce segmentation heat maps.