Basically, yolo combines detection and classification into one loss function: the green part corresponds to whether or not any object is there, while the red part corresponds to encouraging correctly determining which object is there, if one is present.
Since we are training on some labeled dataset, it means that $p_i(c)$ should be zero except for one class $c$, right?
Yes. Notice we are only penalizing the network when there is indeed an object present. But if your question is whether $p_i(c)\in\{0,1\}$, then usually yes, that is how it is done.
Why are we interested in confidence score? At the end of the neural net, do we have some decision algorithm that says: if this bounding box as confidence above threshold $c_0$ then displays it and choose class with highest probability?
Usually, yes, a threshold is needed exactly as you describe. Often it is a hyper-parameter that can be chosen or cross-validate over.
As for your other questions about the "confidence" score, I must agree that the nomenclature is confusing. There are two "viewpoints" one can have about this: (1) a probabilistic confidence measure of whether any object exists in the locale, and (2) a deterministic prediction of the overlap between the local predicted bounding box $\hat{B}$ and the ground truth one $B$. Both outlooks are often conflated, and in some sense can be treated as "equivalent", since we can view $|B\cap \hat{B}|/|B\cup\hat{B}|\in[0,1]$ as a probability.
As an aside, there are already a couple other discussions of the yolo loss:
This might explain why there are not many papers on asymmetric loss functions.
That's not true. Cross-entropy is used as loss function in most classification problems (and problems that aren't standard classification, like for example autoencoder training and segmentation problems), and it's not symmetric.
Best Answer
The loss function is chosen according to the noise process assumed to contaminate the data, not the output layer activation function. The purpose of the output layer activation function is to apply whatever constraints ought to apply on the output of the model. There is a correspondance between loss function and activation function that can simplify the implementation of the model, but that is pretty much the only real benefit (c.f. link functions in Generalised Linear Models) as neural net people generally don't go in much for analysis of parameters etc. Note the tanh function is a scaled and translated version of the logistic sigmoidal function, so a modified logistic loss with recoded targets might be a good match from that perspective.