This is common in NLP: manipulating very high dimensional feature vectors. Each dimension corresponds to a word (or a bigram) if you are considering a very simple case of text classification.
Feature selection, taking again the text classification task, can be done very naturally in Logistic Regression with $l_1$-norm. You control the strength of the parameter and the algorithm automatically prunes (sets their coefficients to 0) away features that don't contribute for the classification task.
In Bayesian setting, I believe feature selection is done using special priors like Laplacian. Unfortunately I am not very proficient in this field.
Having some custom rules to prune away features is nice and instructive. For a real problem, let a well-tested learning algorithm this do for you.
If one categorical variable has high cardinality, wouldn't encoding it this way "overpower" other (for example binary) variables?
It depends on the algorithm.
Algorithms based on the sampling of the columns (random forests, extremely randomized trees, gradient boosting or a bagged classifier...) train a lot of models on subsamples of the data. If 90% of your columns represent a "dummified" variable, it is likely that a large number of the models are actually working on the same variable, therefore, making them more correlated than they should be, thus arming performance.
Linear regression methods will not be affected, they will simply give a weight to every binary variable produced by the encoded variable.
With nearest neighbours and similarity based methods (such as kernel SVMs) the impact should be limited as well. No matter the number of columns, the only thing that matters in the end is the inner product or the distance between two lines of your data. However, the number of columns that stems from a nominal variable, the distance (or inner product) can only be 0 or 1 (the nominal variables were equal or not).
If our classifier model is aware of relationships between variables, wouldn't it unnecessarily attempt to find relationships between introduced binary "components" of the same variable?
How is your classifier "aware" of relationships between variables ? I am not sure I can address this question.
And if so, how could this be addressed?
In the case of any method relying on samples of the columns, prior weights could be given to the columns (so that they are not selected with the same probabilities). However, I do not have any implementations in mind that do this. A quick fix could be to repeat the other columns, so that there likelihood to be selected artificially increases.
Best Answer
There are possibly many ways to tackle this, depending on your data, feature cardinality, etc.:
Please refer for this great article for in-depth analysis of different encoding schemes and their performance: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931