If you're going to use a stepwise procedure, don't resample. Create one random subsample once and for all. Perform your analysis on it. Validate the results against the held-out data. It's likely most of the "significant" variables will turn out not to be significant.
(Edit 12/2015: You can indeed go beyond such a simple approach by resampling, repeating the stepwise procedure, and re-validating: this will lead you into a form of cross-validation. But in such a case more sophisticated methods of variable selection, such as ridge regression, the Lasso, and the Elastic Net are likely preferable to stepwise regression.)
Focus on the variables that make sense, not those that fit the data a little better. If you have more than a handful of variables for 330 records, you're at great risk of overfitting in the first place. Consider using fairly severe entering and leaving criteria for the stepwise regression. Base it on AIC or $C_p$ instead of thresholds for $F$ tests or $t$ tests.
(I presume you have already carried out the analysis and exploration to identify appropriate re-expressions of the independent variables, that you have identified likely interactions, and that you have established that there really is an approximately linear relationship between the logit of the dependent variable and the regressors. If not, do this essential preliminary work and only then return to the stepwise regression.)
Be cautious about following generic advice like I just gave, by the way :-). Your approach should depend on the purpose of the analysis (prediction? extrapolation? scientific understanding? decision making?) as well as the nature of the data, the number of variables, etc.
I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.
source for image
Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.
I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?
Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.
Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.
set.seed(7938) # this makes the simulation exactly reproducible
my.fun = function(){
x = sample.int(512, size=90, replace=TRUE)
return(sum(x[1:45]%in%x[46:90])==0)
}
n.unique = replicate(10000, my.fun())
mean(n.unique) # [1] 0.0181
The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.
Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.
Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.
Best Answer
What you observed follows from statistical theory and is completely expected. And "best fitting on the first dataset" is a by-product of overfitting and is not really interesting.
Why the need to reduce the number of features to 10-30? Why is parsimony good? Why not just fit an L2 penalized model with all features?