A few general points before answering the individual questions.
First, in logistic regression (unlike in linear regression) coefficient estimates will be biased if you omit any predictor associated with outcome whether or not it is correlated with the included predictors. This page gives an analytic demonstration for the related probit regression.
Second, it's not necessary (even if it's desirable) to know the mechanism through which a predictor is related to outcome. If it improves outcome prediction (either on its own or as a control for other predictors) it can be useful. "Answer[ing] the question does [this] new feature really effect/explain the outcome behavior?'" generally can't be done by statistical modeling; modeling like yours can point the way to the more detailed experimental studies needed to get to the mechanism.
Third, class imbalance problems typically arise from using an improper scoring rule or from just not having enough members of the minority class to get good estimates. See this page among many on this site. Your nicely designed study has over 1500 in the minority class, so the latter is certainly not a problem. Accuracy and F1 score are not strictly proper scoring rules, and the AUC (equivalent to the concordance or C-index) is not very sensitive for detecting differences among models (note that these issues are essentially the same in survival modeling or in logistic regression). So concentrate on using a correct and sensitive measure of model quality.
Fourth, even with your sample size using a single test/train split instead of modeling-process validation by bootstrapping might be leading you astray. See this page and its links. With bootstrapping you take several hundred samples of the same size as your data set, but with replacement, after you have built your model on the entire data set. You do not set aside separate training, validation, and test sets; you use all of the data for the model building and evaluation process. Bootstrapping mimics the process of taking your original sample from the underlying population. You repeat the entire model building process (including feature selection steps) on each bootstrap sample and test, with appropriate metrics, the performance of each model on the full original data set. Then pool the results over all the models from the bootstraps. You can evaluate bias and optimism/overfitting with this approach, and if you are doing feature selection you can compare among the hundreds of models to see the variability among the selected features.
Fifth, with respect to feature selection, predictors in clinical data are often highly inter-correlated in practice. In such cases the specific features selected by any method will tend to depend on the particular sample you have in hand. You can check this for yourself with the bootstrapping approach described above. That will be true of any modeling method you choose. That is one of many reasons why you will find little support on this site for automated model selection. In any case, the initial choice of features to evaluate should be based on your knowledge of the subject matter.
So with respect to the questions:
Congratulations on identifying 2 new risk factors associated with outcome. A predictive model certainly should include them if they are going to be generally available to others in your field. Under the first and second general points above, however, you might want to reconsider removing from your model any predictors that might, based on your knowledge of the subject matter, be associated with outcome. With over 1500 in the minority class you are unlikely to be overfitting with 60 features (if they are all continuous or binary categorical). The usual rule of thumb of 15 minority-class members per evaluated predictor would allow you up to 100 predictors (including levels of categorical variables beyond the second and including interaction terms). If any predictor is going to be available in practice and is expected to be related to outcome based on your knowledge of the subject matter, there's no reason to remove it just because it's not "statistically significant."
The third and fourth general points above might account for this finding. AUC is not a very sensitive measure for comparing models, and using a fixed test/train split could lead to split-dependent imbalances that would be avoided if you did bootstrap-based model validation, as for example with the rms package in R. That leads to:
A logistic regression model optimizes a log-loss, effectively a strictly proper scoring rule that would be expected to be more sensitive than AUC. Note that the size of your study will make it possible to detect "significance" at p < 0.05 for smaller effects than would be possible with a smaller study. Use your knowledge of the subject matter to decide if these statistically significant findings are likely to be clinically significant.
Avoid accuracy. Avoid F1. Be cautious in using AUC. Use a strictly proper scoring rule.
See the third general point above. If your ultimate goal is to use something like boosted classification trees then there is probably no need to do this preliminary logistic regression. Note, however, that a well calibrated logistic regression model can be much easier to interpret than any but the simplest (and potentially most unreliable) tree models. And make sure that your optimization criterion in a tree model provides a proper scoring rule; once again, avoid accuracy as a criterion.
There really is no problem. Bootstrap-based logistic model validation and calibration instead of the single fixed test/train split could provide a much better sense of how your model will perform on new data. If your model is well calibrated (e.g., linearity assumptions hold) then you could use the logistic regression model directly instead of going on to a tree-based model. If you need to make a yes/no decision based solely on the model, choose a probability cutoff that represents the tradeoff between false-negative and false-positive findings.
The answer to your last question depends on your knowledge of the subject matter. Again, this is the issue of statistical significance versus clinical significance. Only you and your colleagues in the field can make that determination.
Modeling outcome-relevant geographic and demographic covariates directly, as @AdamO suggests, is the best way to go. Another answer suggests ways to start getting such information.
The 3-digit ZIP code poses two problems.
First, without care the numbers might be interpreted as a numeric variable and, as you state, "a one digit increase" in ZIP code would be meaningless. If you are to go with ZIP code as a predictor, you must ensure that it is encoded as a multi-level categorical predictor. That could allow you to use ZIP codes as fixed effects, or as cluster()
or frailty/random-effect terms to account for correlations within ZIP codes.
Second, as @AdamO notes in a comment, "The first 3 digits of a zip code doesn't mean anything." The first 3 digits of a ZIP code near where I live includes both wide-open suburban and dense urban localities, areas with some of the highest and lowest average wealth in the state, widely different education levels, and differences in access to transportation and to health-care facilities. If you can't get more detailed geographic and demographic data (e.g., from "ZIP +4" values combined with Census data), at least use the full 5-digit ZIP code.
Best Answer
One of my favorite uses of zip code data is to look up demographic variables based on zipcode that may not be available at the individual level otherwise...
For instance, with http://www.city-data.com/ you can look up income distribution, age ranges, etc., which might tell you something about your data. These continuous variables are often far more useful than just going based on binarized zip codes, at least for relatively finite amounts of data.
Also, zip codes are hierarchical... if you take the first two or three digits, and binarize based on those, you have some amount of regional information, which gets you more data than individual zips.
As Zach said, used latitude and longitude can also be useful, especially in a tree based model. For a regularized linear model, you can use quadtrees, splitting up the United States into four geographic groups, binarized those, then each of those areas into four groups, and including those as additional binary variables... so for n total leaf regions you end up with [(4n - 1)/3 - 1] total variables (n for the smallest regions, n/4 for the next level up, etc). Of course this is multicollinear, which is why regularization is needed to do this.