There are several problems in this question. First, there is the question of whether bootstrapped averages will be sensible estimators even when some of the individual bootstrapped estimators are not computable (lack of convergence, non-existence of solutions). Second, given that the bootstrapped estimators are sensible, there is a question of how to obtain confidence intervals or perhaps just standard errors for these estimates.
The idea of averaging bootstrapped estimates is closely related to, if not actually the same as, bootstrap aggregation, or bagging, used in machine learning to improve prediction performance of weak predictors. See ESL, Section 8.7. In certain cases $-$ also for estimating parameters $-$ the averaging of bootstrap estimates may reduce the variance of the resulting estimator compared to just using the estimator on the original data set.
The purpose in the question is, however, to produce estimates even in cases where the algorithm for computing the estimates may fail occasionally or where the estimator is occasionally undefined. As a general approach there is a problem:
- Averaging bootstrapped estimates while blindly throwing away the bootstrapped samples for which the estimates are not computable will in general give biased results.
How severe the general problem is depends on several things. For instance, how frequently the estimate is not computable and whether the conditional distribution of the sample given that the estimate is not computable differs from the conditional distribution of the sample given that the estimate is computable. I would not recommend to use the method.
For the second part of the question we need a little notation. If $X$ denotes our original data set, $\hat{\theta}$ our estimator (assume for simplicity it is real valued and allowed to take the value NA) such that $\hat{\theta}(X)$ is the estimate for the original data set, and $Y$ denotes a single bootstrapped sample then the bootstrap averaging is effectively computing the estimator
$$\tilde{\theta}(X) = E(\hat{\theta}(Y) \mid X, A(X))$$
where $A(X)$ denotes the event, depending on $X$, upon which $\hat{\theta}(Y) \neq \text{NA}$. That is, we compute the conditional expectation of the estimator on a bootstrapped sample $-$ conditioning on the original sample $X$ and the event, $A(X)$, that the estimator is computable for the bootstrapped sample. The actual bootstrap computation is a sampling based approximation of $\tilde{\theta}(X)$.
The suggestion in the question is to compute the empirical standard deviation of the bootstrapped estimators, which is an estimate of the standard deviation of $\hat{\theta}(Y)$ conditionally on $X$ and $A(X)$. The desired standard deviation, the standard error, is the standard deviation of $\tilde{\theta}(X)$. You can't get the latter from the former. I see no other obvious and general way than to use a second layer of bootstrapping for obtaining a reliable estimate of the standard error.
The discussion on the estimation of the standard error is independent of how the conditioning on $A(X)$ affects the bias of the estimator $\tilde{\theta}(X)$. If the effect is severe then even with correct estimates of the standard error, a confidence interval will be misleading.
Edit:
The very nice paper Estimation and Accuracy After Model Selection by Efron gives a general method for estimating the standard error of a bagged estimator without using a second layer of bootstrapping. The paper does not deal explicitly with estimators that are occasionally not computable.
@Glen_b is right about the nature of the normality assumption in regression1.
I think your bigger problem is going to be that you don't have enough data to support 4 to 5 explanatory variables. The standard rule of thumb2 is that you should have at least 10 data per explanatory variable, i.e. 40 or 50 data in your case (and this is for ideal situations where there isn't any question about the assumptions). Because your model would not be completely saturated3 (you have more data than parameters to fit), you can get parameter (slope, etc.) estimates and under ideal circumstances the estimates are asymptotically unbiased. However, it is quite likely that your estimates will be a long way off from the true values and your SE's / CI's will be very large, so you will have no statistical power. Note that using a nonparametric, or other alternative, regression analysis will not get you out of this problem.
What you will need to do here is either pick a single explanatory variable (before looking at your data!) based on prior theories in your field or your hunches, or you should combine your explanatory variables. A reasonable strategy for the latter option is to run a principal components analysis (PCA) and use the first principle component as your explanatory variable.
References:
1. What if residuals are normally distributed but Y is not?
2. Rules of thumb for minimum sample size for multiple regression
3. Maximum number of independent variables that can be entered into a multiple regression equation
Best Answer
Your problem is not new and a whole chapter ("High-Dimensional Problems") of this book is dedicated to such cases where the number of variables $p$ is much bigger than the number of observations $N$. Numerous ways are possible.
In my opinion, the two simplest methods to regularize the problem are the Lasso and the Ridge Regression which consist in adding a penalty to the standard least-square term.