Why does the linear test static of GLM follow F-distribution?
It doesn't.
Then, the test statistic will follow an $F$-distribution [...] does this hold for all generalized linear models?
There's no result that establishes it in the general case, and indeed we can show (e.g. by simulation in particular instances) that it's not the case in general.
It holds for the Gaussian case, of course, but the derivation relies on the normality of the data. You can see it's not the case for logistic regression, since the data (and hence "F"-statistics based on the data) are discrete.
There is an asymptotic chi square result. This, combined with Slutsky's theorem should give us that the F-statistic will asymptotically be distributed as a scaled chi-square.
However, in sufficiently large samples (where how large "large" is will depend on a number of things), we might anticipate that The F distribution would still be approximately correct, since both the $F$ distribution being used to figure out p-values, and the actual distribution of the test statistic are both going to the same scaled chi-square distribution asymptotically.
We see the same issue with the common use of t-tests for parameter significance in GLMs (which many packages do) even though it's only t-distributed for the Gaussian case; for the others we only have an asymptotic normal result (but a similar argument for why the $t$ shouldn't do badly in sufficiently large samples can be made).
I don't have a good book suggestion. Some books give a handwavy argument for using the $F$ (some akin to mine above), others seem to ignore the need to justify it at all.
It is because GLM parameter estimates are maximum likelihood estimates, and those are asypmtotically normal if we assume the observations are independent. See here, for instance. Note that this is only an asymptotic result so for a particular finite sample your test statistics won't be exactly normal.
Best Answer
A GLM is a more general version of a linear model: the linear model is a special case of a Gaussian GLM with the identity link. So the question is then: why do we use other link functions or other mean-variance relationships? We fit GLMs because they answer a specific question that we are interested in.
There is, for instance, nothing inherently wrong with fitting a binary response in a linear regression model if you are interested in the association between these variables. Indeed if a higher proportion of negative outcomes tends to be observed in the lower 50th percentile of an exposure and a higher proportion of positive outcomes is observed in the upper 50th percentile, this will yield a positively sloped line which correctly describes a positive association between these two variables.
Alternately, you might be interested in modeling the aforementioned association using an S shaped curve. The slope and the intercept of such a curve account for a tendency of extreme risk to tend toward 0/1 probability. Also the slope of a logit curve is interpreted as a log-odds ratio. That motivates use of a logit link function. Similarly, fitted probabilities very close to 1 or 0 may tend to be less variable under replications of study design, and so could be accounted for by a binomial mean-variance relationship saying that $se(\hat{Y}) = \hat{Y}(1-\hat{Y})$ which motivates logistic regression. Along those lines, a more modern approach to this problem would suggest fitting a relative risk model which utilizes a log link, such that the slope of the exponential trend line is interpreted as a log-relative risk, a more practical value than a log-odds-ratio.