That is an interesting problem. I think your conjecture is partially correct. After adding the $H_C=-\sum_pC_p$ term to the Hamiltonian, the ground state degeneracy will be reduced for sure. The remaining degeneracy can be 2-fold in some cases. But for the most general case, the degeneracy can be completely lifted, leaving unique ground state.
I think it would be easier to prove it using the loop algebra method developed in this work [Y.-Z. You and X.-G. Wen, arXiv:1204.0113] (which you may have already known). Admittedly, the constraint-counting method you mentioned is more rigorous, and is successful for the original toric code model. However for your case with $H_C$ term, after thinking for a while, I could not figure out how to count the constraints either.
Here is how the loop algebra works.
First, let us define 4 loop operators according to the above figure (red - $\sigma^x$, green - $\sigma^y$, blue - $\sigma^z$)
$$Q_1=\prod_{l\in\text{x-line}}\sigma_l^x, Q_2=\prod_{l\in\text{y-line}}\sigma_l^x,
P_1=\prod_{l\in\text{x-line}}\sigma_l^z, P_2=\prod_{l\in\text{y-line}}\sigma_l^z.$$
Assuming periodic boundary condition, the loops wind around the torus. Operators $Q_1$ and $Q_2$ measure the $\mathbb{Z}_2$ flux through the two torus holes respectively. While $P_1$ and $P_2$ are responsible to generate or to remove such fluxes (by moving a pair of $m$-excitations around the torus). In the original toric code model, the 4 degenerated ground states on the torus can be labeled by the eigen values of $Q_1$ and $Q_2$ (i.e. the fluxes through the holes), and different ground states are connected by the action of $P_1$ and $P_2$. Therefore the ground state Hilbert space is just the representation space of these loop operators.
It can be easily verified that these loop operators follows the algebra: $Q_1P_2=-P_2Q_1$, $Q_2P_1=-P_1Q_2$, $[Q_1,Q_2]=[P_1,P_2]=[Q_1,P_1]=[Q_2,P_2]=0$ (by checking the bound overlaps). The 4 loop operators can be divided into two anti-commuting pairs. To construct the representation for these operators, we start from $\{Q_1,P_2\}=0$. The anti-commutation relation requires $Q_1$ and $P_2$ to be represented by two Pauli matrices, say $Q_1\bumpeq\sigma_3$ and $P_2\bumpeq\sigma_1$. The same applies for $\{Q_2,P_1\}=0$. However the representation space must be enlarged such that the commutation relations between the anti-commuting pairs are also satisfied:
$$Q_1\bumpeq\sigma_3\otimes\sigma_0, Q_2\bumpeq\sigma_0\otimes\sigma_3, P_1\bumpeq\sigma_0\otimes\sigma_1, P_1\bumpeq\sigma_1\otimes\sigma_0.$$
For the original toric code model, it is easy to show that the Hamiltonian commutes with all the 4 loop operators, therefore it must act trivially in the representation space of loop operators, $H\bumpeq\sigma_0\otimes\sigma_0$. That is the algebraic reason that the ground states (and in fact each energy level) must be (at least) 4-fold degenerated.
Now let us consider the effect of adding $H_C$ term. All the $C_p$ operators can be divided into two classes: those away from $Q_{1,2}$ loops as $C_{p1}$ in the figure, and those adjacent to $Q_{1,2}$ loops as $C_{p2}$ in the figure. The Hamiltonian is the sum of these two classes of $C_p$ operators: $H_C=H_{1}+H_{2}$, with $H_{1,2}=-\sum C_{p1,p2}$. Note that $H_1$ commutes with all the loop operators still. However $H_2$ anti-commutes with $Q_1$ or $Q_2$ while commutes with $P_1$ and $P_2$. Therefore $H_1$ and $H_2$ are represented differently: $H_1\bumpeq\sigma_0\otimes\sigma_0$, $H_2\bumpeq a\sigma_1\otimes\sigma_0+ b\sigma_0\otimes\sigma_1+ c\sigma_1\otimes\sigma_1$ (combined with some coefficients $a,b,c$ to be determined case by case). $H_2$ induces mixing between the degenerated ground states. According to the idea of degenerate perturbation, the degeneracy will be lifted, and in the worst case, completely lifted.
Best Answer
The defining property of the double semion model is the nature of the ground state as a superposition of loop pattern with alternating signs, and not the form of its Hamiltonian. As you noticed, it is not clear how to count loops on a square lattice. As far as I see, this is one reason why the string-net models are defined on honeycomb lattices, since it allows to count loops unambiguously. (In fact, any trivalent graph would do.)
If you want to define a way to count loops on a square lattice, one way is to "decorate" it such that it becomes a trivalent lattice, this is, you replace every four-valent vertex by two three-valent vertices with an edge inbetween. The state of the extra edge is uniquely determined by the state on the surrounding edges, and thus, this gives you a way to count loops on the square lattice. In the same way, you can map the honeycomb Hamiltonian to a new Hamiltonian on the square lattice. Note, however, that this mapping will necessarily break some lattice symmetry.
Your Hamiltonian is rotationally invariant, so I suspect it is not the correct Hamiltonian. I have not analyzed it carefully, but you could try to exactly diagonalize it on a 4x4 lattice and check the ground subspace. Alternatively, you can study different moves to go from one configuration to another and check if they all give the same phase (I suspect that no, and there will be cancellations). For this, of course, you first have to pick a convention of how to count loops.