When one analyses the proof that Bonferroni controls the type I error ''family-wise'' then you see that no assumptions are needed; it basically uses only the inequality of Boole. So Bonferroni does not need e.g. an independence assumption.
However, the analysis of the proof learns that the probability of a type I error is at most $\alpha$, i.e. the Bonferroni method can have a type-I error probability that is stricly smaller than $\alpha$ (and this will result in a loss of power).
The cases where the probability of a type I error probability is strictly smaller than $\alpha$, (one says that in these cases Bonferroni is conservative) occur when the tests are dependent or when the p-values of the indiviual tests are themselves conservative. The latter can be the case for discrete random variables (in a univariate test for a Binomial variable e.g. the ''observed'' type I error probability may be strictly smaller than $\alpha$). .
Note that Holm's method also controls type I error probability and its power is at least as good as the one of the Bonferroni method. For discrete random variables, like the Binomial, other multiple test correction methods methods have been shown to be more powerful (e.g. minP).
So to summarise: the Bonferroni method does not need any additional
assumptions to show that the type-I error probability is controled,
however, it can be conservative.
What happens if the residuals are not homoscedastic? If the residuals show an increasing or decreasing pattern in Residuals vs. Fitted plot.
If the error term is not homoscedastic (we use the residuals as a proxy for the unobservable error term), the OLS estimator is still consistent and unbiased but is no longer the most efficient in the class of linear estimators. It is the GLS estimator now that enjoys this property.
What happens if the residuals are not normally distributed, and fail the Shapiro-Wilk test? Shapiro-Wilk test of normality is a very strict test, and sometimes even if the Normal-QQ plot looks somewhat reasonable, the data fails the test.
Normality is not required by the Gauss-Markov theorem. The OLS estimator is still BLUE but without normality you will have difficulty doing inference, i.e. hypothesis testing and confidence intervals, at least for finite sample sizes. There is still the bootstrap, however.
Asymptotically this is less of a problem since the OLS estimator has a limiting normal distribution under mild regularity conditions.
What happens if one or more predictors are not normally distributed, do not look right on the Normal-QQ plot or if the data fails the Shapiro-Wilk test?
As far as I know the predictors are either considered fixed or the regression is conditional on them. This limits the effect of non-normality.
What does failing the normality means for a model that is a good fit according to the R-Squared value. Does it become less reliable, or completely useless?
The R-squared is the proportion of the variance explained by the model. It does not require the normality assumption and it's a measure of goodness of fit regardless. If you want to use it for a partial F-test though, that is quite another story.
To what extent, the deviation is acceptable, or is it acceptable at all?
Deviation from normality you mean, right? It really depends on your purposes because as I said, inference becomes hard in the absence of normality but is not impossible (bootstrap!).
When applying transformations on the data to meet the normality criteria, does the model gets better if the data is more normal (higher P-value on Shapiro-Wilk test, better looking on normal Q-Q plot), or it is useless (equally good or bad compared to the original) until the data passes normality test?
In short, if you have all the Gauss-Markov assumptions plus normality then the OLS estimator is Best Unbiased (BUE), i.e. the most efficient in all classes of estimators - the Cramer-Rao Lower Bound is attained. This is desirable of course but it's not the end of world if it does not happen. The above remarks apply.
Regarding transformations, bear in mind that while the distribution of the response might be brought closer to normality, interpretation might not be straightforward afterwards.
These are just some short answers to your questions. You seem to be particularly concerned with the implications of non-normality. Overall, I would say that it is not as catastrophic as people (have been made to?) believe and there are workarounds. The two references I have included are a good starting point for further reading, the first one being of theoretical nature.
References:
Hayashi, Fumio. : "Econometrics.", Princeton University Press, 2000
Kutner, Michael H., et al. "Applied linear statistical models.", McGraw-Hill Irwin, 2005.
Best Answer
Generally speaking, the answer is yes, both type I and type II error rates are impacted by choosing tests on the basis of tests of assumptions.
This is pretty well established with testing of equality of variance (for which several papers point it out), and testing normality. It should be expected that it will be the case in general.
The advice is usually along the lines of "if you can't make the assumption without testing, better to simply act as if the assumption doesn't hold".
So, for example, if you're trying to decide between the equal-variance and Welch-type t-tests, by default use the Welch test (though under equal sample size it is robust to violations of that assumption).
Similarly, in moderately-small$^*$ samples, you may be better off using a permutation test for location by default than testing for normality and then using a t-test if you fail to reject (in large samples, the t-test is usually level-robust enough that it's not likely to be that big an issue in most cases, if the sample is also large enough that you're not concerned about impact on power). Alternatively, the Wilcoxon-Mann-Whitney has very good power compared to the t-test at the normal, and would often be a very viable alternative.
[If for some reason you need to test it would be best to be aware of the extent to which the significance level and power of the tests may be affected under either arm of any resulting choice the test of assumptions leads you to. This will depend on the particular circumstances; for example simulation can be used to help investigate the behavior in similar situations.] * (but not very small, since the discreteness of the test statistic will limit the available significance levels too much; specifically, at very small sample sizes the smallest possible significance level may be impractically large)
A reference (with a link to more) on testing heteroskedasticity when choosing between equal-variance-t vs Welch-t location tests is here.
I also have one for the case of testing normality before choosing between the t test and the Wilcoxon-Mann-Whitney test (ref [3] here).