An unbalanced data set (like you suggest when you say the proportion of households with CHE = 1 make up 3% of your data) will influence the results of a logistic regression. See the answers to this question for instance.
In short, the class imbalance in your data set affects only the intercept term. If you are using your model to say something about you4 8 socioeconomic variables, you should be fine. If it is for prediction, then you could consider lowering your probability threshold below 0.5.
You mention in the comments that you are computing the AUC using a 75-25 train-test split, and you are puzzled why AUC is maximized when training your model on only 8 of your 30 regressors. From this you have gotten the impression that AUC is somehow penalizing complexity in your model.
In reality there is something penalizing complexity in your model, but it is not the AUC metric. It is the train-test split. Train-test splitting is what makes it possible to use pretty much any metric, even AUC, for model selection, even if they have no inherent penalty on model complexity.
As you probably know, we do not measure performance on the same data that we train our models on, because the training data error rate is generally an overly optimistic measure of performance in practice (see Section 7.4 of the ESL book). But this is not the most important reason to use train-test splits. The most important reason is to avoid overfitting with excessively complex models.
Given two models A and B such that B "contains A" (the parameter set of B contains that of A) the training error is mathematically guaranteed to favor model B, if you are fitting by optimizing some fit criterion and measuring error by that same criterion. That's because B can fit the data in all the ways that A can, plus additional ways that may produce lower error than A's best fit. This is why you were expecting to see lower error as you added more predictors to your model.
However, by splitting your data into two reasonably independent sets for training and testing, you guard yourself against this pitfall. When you fit the training data aggressively, with many predictors and parameters, it doesn't necessarily improve the test data fit. In fact, no matter what the model or fit criterion, we can generally expect that a model which has overfit the training data will not do well on an independent set of test data which it has never seen. As model complexity increases into overfitting territory, test set performance will generally worsen as the model picks up on increasingly spurious training data patterns, taking its predictions farther and farther away from the actual trends in the system it is trying to predict. See for example slide 4 of this presentation, and sections 7.10 and 7.12 of ESL.
If you still need convincing, a simple thought experiment may help. Imagine you have a dataset of 100 points with a simple linear trend plus gaussian noise, and you want to fit a polynomial model to this data. Now let's say you split the data into training and test sets of size 50 each and you fit a polynomial of degree 50 to the training data. This polynomial will interpolate the data and give zero training set error, but it will exhibit wild oscillatory behavior carrying it far, far away from the simple linear trendline. This will cause extremely large errors on the test set, much larger than you would get using a simple linear model. So the linear model will be favored by CV error. This will also happen if you compare the linear model against a more stable model like smoothing splines, although the effect will be less dramatic.
In conclusion, by using train-test splitting techniques such as CV, and measuring performance on the test data, we get an implicit penalization of model complexity, no matter what metric we use, just because the model has to predict on data it hasn't seen. This is why train-test splitting is universally used in the modern approach to evaluating performance in regression and classification.
Best Answer
Yes, you can overfit logistic regression models. But first, I'd like to address the point about the AUC (Area Under the Receiver Operating Characteristic Curve): There are no universal rules of thumb with the AUC, ever ever ever.
What the AUC is is the probability that a randomly sampled positive (or case) will have a higher marker value than a negative (or control) because the AUC is mathematically equivalent to the U statistic.
What the AUC is not is a standardized measure of predictive accuracy. Highly deterministic events can have single predictor AUCs of 95% or higher (such as in controlled mechatronics, robotics, or optics), some complex multivariable logistic risk prediction models have AUCs of 64% or lower such as breast cancer risk prediction, and those are respectably high levels of predictive accuracy.
A sensible AUC value, as with a power analysis, is prespecified by gathering knowledge of the background and aims of a study apriori. The doctor/engineer describes what they want, and you, the statistician, resolve on a target AUC value for your predictive model. Then begins the investigation.
It is indeed possible to overfit a logistic regression model. Aside from linear dependence (if the model matrix is of deficient rank), you can also have perfect concordance, or that is the plot of fitted values against Y perfectly discriminates cases and controls. In that case, your parameters have not converged but simply reside somewhere on the boundary space that gives a likelihood of $\infty$. Sometimes, however, the AUC is 1 by random chance alone.
There's another type of bias that arises from adding too many predictors to the model, and that's small sample bias. In general, the log odds ratios of a logistic regression model tend toward a biased factor of $2\beta$ because of non-collapsibility of the odds ratio and zero cell counts. In inference, this is handled using conditional logistic regression to control for confounding and precision variables in stratified analyses. However, in prediction, you're SooL. There is no generalizable prediction when you have $p \gg n \pi(1-\pi)$, ($\pi = \mbox{Prob}(Y=1)$) because you're guaranteed to have modeled the "data" and not the "trend" at that point. High dimensional (large $p$) prediction of binary outcomes is better done with machine learning methods. Understanding linear discriminant analysis, partial least squares, nearest neighbor prediction, boosting, and random forests would be a very good place to start.