IMO (as not-a-logician or formally trained statistician per se), one shouldn't take any of this language too seriously. Even rejecting a null when p < .001 doesn't make the null false without a doubt. What's the harm in "accepting" the alternative hypothesis in a similarly provisional sense then? It strikes me as a safer interpretation than "accepting the null" in the opposite scenario (i.e., a large, insignificant p), because the alternative hypothesis is so much less specific. E.g., given $\alpha=.05$, if p = .06, there's still a 94% chance that future studies would find an effect that's at least as different from the null*, so accepting the null isn't a smart bet even if one cannot reject the null. Conversely, if p = .04, one can reject the null, which I've always understood to imply favoring the alternative. Why not "accepting"? The only reason I can see is the fact that one could be wrong, but the same applies when rejecting.
The alternative isn't a particularly strong claim, because as you say, it covers the whole "space". To reject your null, one must find a reliable effect on either side of the null such that the confidence interval doesn't include the null. Given such a confidence interval (CI), the alternative hypothesis is true of it: all values within are unequal to the null. The alternative hypothesis is also true of values outside the CI but more different from the null than the most extremely different value within the CI (e.g., if $\rm CI_{95\%}=[.6,.8]$, it wouldn't even be a problem for the alternative hypothesis if $\mathbb P(\rm head)=.9$). If you can get a CI like that, then again, what's not to accept about it, let alone the alternative hypothesis?
There might be some argument of which I'm unaware, but I doubt I'd be persuaded. Pragmatically, it might be wise not to write that you're accepting the alternative if there are reviewers involved, because success with them (as with people in general) often depends on not defying expectations in unwelcome ways. There's not much at stake anyway if you're not taking "accept" or "reject" too strictly as the final truth of the matter. I think that's the more important mistake to avoid in any case.
It's also important to remember that the null can be useful even if it's probably untrue. In the first example I mentioned where p = .06, failing to reject the null isn't the same as betting that it's true, but it's basically the same as judging it scientifically useful. Rejecting it is basically the same as judging the alternative to be more useful. That seems close enough to "acceptance" to me, especially since it isn't much of a hypothesis to accept.
BTW, this is another argument for focusing on CIs: if you can reject the null using Neyman–Pearson-style reasoning, then it doesn't matter how much smaller than $\alpha$ the p is for the sake of rejecting the null. It may matter by Fisher's reasoning, but if you can reject the null at a level of $\alpha$ that works for you, then it might be more useful to carry that $\alpha$ forward in a CI instead of just rejecting the null more confidently than you need to (a sort of statistical "overkill"). If you have a comfortable error rate $\alpha$ in advance, try using that error rate to describe what you think the effect size could be within a $\rm CI_{(1-\alpha)}$. This is probably more useful than accepting a more vague alternative hypothesis for most purposes.
* Another important point about the interpretation of this example p value is that it represents this chance for the scenario in which it is given that the null is true. If the null is untrue as evidence would seem to suggest in this case (albeit not persuasively enough for conventional scientific standards), then that chance is even greater. In other words, even if the null is true (but one doesn't know this), it wouldn't be wise to bet so in this case, and the bet is even worse if it's untrue!
What happens if the residuals are not homoscedastic? If the residuals show an increasing or decreasing pattern in Residuals vs. Fitted plot.
If the error term is not homoscedastic (we use the residuals as a proxy for the unobservable error term), the OLS estimator is still consistent and unbiased but is no longer the most efficient in the class of linear estimators. It is the GLS estimator now that enjoys this property.
What happens if the residuals are not normally distributed, and fail the Shapiro-Wilk test? Shapiro-Wilk test of normality is a very strict test, and sometimes even if the Normal-QQ plot looks somewhat reasonable, the data fails the test.
Normality is not required by the Gauss-Markov theorem. The OLS estimator is still BLUE but without normality you will have difficulty doing inference, i.e. hypothesis testing and confidence intervals, at least for finite sample sizes. There is still the bootstrap, however.
Asymptotically this is less of a problem since the OLS estimator has a limiting normal distribution under mild regularity conditions.
What happens if one or more predictors are not normally distributed, do not look right on the Normal-QQ plot or if the data fails the Shapiro-Wilk test?
As far as I know the predictors are either considered fixed or the regression is conditional on them. This limits the effect of non-normality.
What does failing the normality means for a model that is a good fit according to the R-Squared value. Does it become less reliable, or completely useless?
The R-squared is the proportion of the variance explained by the model. It does not require the normality assumption and it's a measure of goodness of fit regardless. If you want to use it for a partial F-test though, that is quite another story.
To what extent, the deviation is acceptable, or is it acceptable at all?
Deviation from normality you mean, right? It really depends on your purposes because as I said, inference becomes hard in the absence of normality but is not impossible (bootstrap!).
When applying transformations on the data to meet the normality criteria, does the model gets better if the data is more normal (higher P-value on Shapiro-Wilk test, better looking on normal Q-Q plot), or it is useless (equally good or bad compared to the original) until the data passes normality test?
In short, if you have all the Gauss-Markov assumptions plus normality then the OLS estimator is Best Unbiased (BUE), i.e. the most efficient in all classes of estimators - the Cramer-Rao Lower Bound is attained. This is desirable of course but it's not the end of world if it does not happen. The above remarks apply.
Regarding transformations, bear in mind that while the distribution of the response might be brought closer to normality, interpretation might not be straightforward afterwards.
These are just some short answers to your questions. You seem to be particularly concerned with the implications of non-normality. Overall, I would say that it is not as catastrophic as people (have been made to?) believe and there are workarounds. The two references I have included are a good starting point for further reading, the first one being of theoretical nature.
References:
Hayashi, Fumio. : "Econometrics.", Princeton University Press, 2000
Kutner, Michael H., et al. "Applied linear statistical models.", McGraw-Hill Irwin, 2005.
Best Answer
There is no such thing as a test that your data are normally distributed. There are only tests that your data are not normally distributed. Thus, there are tests like the Shapiro-Wilk where $H_0\!: \rm normal$ (there are many others), but no tests where the null is that the population is not normal and the alternative hypothesis is that the population is normal.
All you can do is figure out what kind of deviation from normality you care about (e.g., skewness), and how big that deviation would have to be before it bothered you. Then you could test to see if the deviation from perfect normality in your data was less than the critical amount. For more information on the general idea it might help to read my answer here: Why do statisticians say a non-significant result means “you can't reject the null” as opposed to accepting the null hypothesis?