I would probably go with your original model with your full dataset. I generally think of these things as facilitating sensitivity analyses. That is, they point you towards what to check to ensure that you don't have a given result only because of something stupid. In your case, you have some potentially influential points, but if you rerun the model without them, you get substantively the same answer (at least with respect to the aspects that you presumably care about). In other words, use whichever threshold you like—you are only refitting the model as a check, not as the 'true' version. If you think that other people will be sufficiently concerned about the potential outliers, you could report both model fits. What you would say is along the lines of,
Here are my results. One might be concerned that this picture only emerges due to a couple unusual, but highly influential, observations. These are the results of the same model, but without those observations. There are no substantive differences.
It is also possible to remove them and use the second model as your primary result. After all, staying with the original dataset amounts to an assumption about which data belong in the model just as much as going with the subset. But people are likely to be very skeptical of your reported results because psychologically it is too easy for someone to convince themselves, without any actual corrupt intent, to go with the set of post-hoc tweaks (such as dropping some observations) that gives them the result they most expected to see. By always going with the full dataset, you preempt that possibility and assure people (say, reviewers) that that isn't what's going on in your project.
Another issue here is that people end up 'chasing the bubble'. When you drop some potential outliers, and rerun your model, you end up with results that show new, different observations as potential outliers. How many iterations are you supposed to go through? The standard response to this is that you should stay with your original, full dataset and run a robust regression instead. This again, can be understood as a sensitivity analysis.
Best Answer
I don't know if I can give you a complete answer, but I can give you some thoughts that may be helpful. First, all statistical models / tests have assumptions. However, logistic regression very much does not assume the residuals are normally distributed nor that the variance is constant. Rather, it is assumed that the data are distributed as a binomial, $\mathcal{B}(n_{x_i},p_{x_i})$, that is, with the number of Bernoulli trials equal to the number of observations at that exact set of covariate values and with the probability associated with that set of covariate values. Remember that the variance of a binomial is $np(1-p)$. Thus, if the $n$'s vary at different levels of the covariate, the variances will as well. Further, if any of the covariates are at all related to the response variable, then the probabilities will vary, and thus, so will the variances. These are important facts about logistic regression.
Second, model comparisons are usually performed between models with different specifications (for example, with different sets of covariates included), not over different subsets of the data. To be honest, I am not sure how that would properly be done. With a linear model, you could look at the 2 $R^2$s to see how much better the fit is with the aberrant data excluded, but this would only be descriptive, and you should know that $R^2$ would have to go up. With logistic regression, the standard $R^2$ cannot be used, however. There are various 'pseudo-$R^2$s' that have been developed to provide similar information, but they are often considered to be flawed and are not often used. For an overview of the different pseudo-$R^2$s that exist, see here. For some discussion, and criticism, of them, see here. Another possibility might be to jackknife the betas with and without the outliers included to see how excluding them contributes to stabilizing their sampling distributions. Once again, this would only be descriptive (i.e., it wouldn't constitute a test to tell you which model--er, subset of your data--to prefer) and the variance would have to go down. These things are true, for both pseudo-$R^2$s and the jackknifed distributions, because you selected those data to exclude based on the fact that they appear extreme.