For simplicity, bias units are subsumed into the equation by extending the input vector adding a component which is always 1. Concretely,
$$
x = (x_{1}, ..., x_{n},1)
$$
so that the activation for each unit can then be rewritten as,
$$
a_{i} = \sum_{j=1}^{N} w_{ij}x_{j} + w_{i0} = \sum_{j=0}^{N} w_{ij}x_{j}
$$
You can see a detailed derivation of the backpropagation rule in the paper neural networks and their applications.
Convolution with a kernel is done on all input maps and their summation is taken. In the input layer it is obvious since there is only one feature (input map). However, after first convolution, the later comvolutions are summuation of kernel operation on all feature maps. Hence, instead of 48 output feature map at C2, there should be 8 maps. this link explains the network and its back-prop in a clear way.
Use $ f(x) = max(0,x)$ as activation(transform) function. After successful implementation, you can use the others too. You should use the same function for both 'feedforward' and 'back-prop'.
I haven't read the paper, but breaking symmetry is about selecting weight from a random distribution. If the weights are the same on feature maps, back propagated error will be the same. As a result network learns the same filters which is not desirable.
Rule of connections are already defined as mathematical expressions. The number of kernels, number of layers, kernelsize etc. should be defined symbolically and they sould be assigned in main section of the code.
You should add bias before applying activation function. A single bias, most commonly used, is added to feature map. Summing a scalar with a matrix is simply adding the scalar at each indexes of the matrix.
If you didn't write a code for NN before, It would be better to start with it.
Best Answer
There are two cases in the ResNet paper.
When shortcut connections where the summands have the same shape, the identity mapping is used, so there is no weight matrix.
When the summands would have different shapes, then there is a weight matrix that has the purpose of projecting the shortcut output to be the same shape as the direct output.
From the ResNet paper Kaiming He et al., "Deep Residual Learning for Image Recognition"