You could estimate the mutual information of each feature $i$ with the class label (also known as the expected information gain),
$$I[C, X_i] = H[C] - H[C \mid X_i].$$
The most informative feature is the one which on average produces the least uncertainty in $C$, which is measured by the entropy $H[C \mid X_i]$. We can estimate the entropy by averaging over data points:
$$H[C \mid X_i] \approx -\frac{1}{N} \sum_n \sum_c p(c \mid x_{ni}) \log p(c \mid x_{ni}).$$
In your case, presumably,
$$p(c \mid x_{ni}) = \frac{\mathcal{N}(x_{ni}; \mu_{ci}, \sigma_{ci}^2)}{\sum_{c'} \mathcal{N}(x_{ni}; \mu_{c'i}, \sigma_{c'i}^2)}.$$
Finding the most informative combination of features is a bit trickier. Say you want to find the three most informative features, then you would have to estimate $H[C \mid (X_i, X_j, X_k)]$ for all $\binom{M}{3}$ possible combinations of three features.
For choosing many features, you could try a greedy approach where you first pick the most informative feature $i$. Then you choose the second feature based on $H[C \mid (X_i, X_j)]$, by fixing $i$ and testing all $M$ possible choices for $j$.
Best Answer
It really depends on what your ultimate goal is. If you just care about overall accuracy and the class priors you observe in your training set are a good estimate of what you are likely to see in the world, then you should not do anything to your data. It is worth noting that you will likely end up with a classifier which overwhelmingly predicts $A$, but this is what you would expect and makes sense from a decision theory point of view.
On the other hand, if you care about things like precision and recall for both classes, or you think the observed class priors are not truly that biased, you will need to do something to deal with the class imbalance. Rather than repeat here, I'll point you to this answer I previously posted regarding methods to deal with class imbalance.
As to the last part of your question, the answer is yes, this answer applies to classifiers in general and not just Naive Bayes.