For your second question, you have $\mathbf{y}\sim N(\mathbf{X}\boldsymbol{\beta},\sigma^2 \mathbf{I})$ and suppose you're testing $\mathbf{C}\boldsymbol{\beta}=\mathbf{0}$. So, we have that (the following is all shown through matrix algebra and properties of the normal distribution -- I'm happy to walk through any of these details)
$
\mathbf{C}\hat{\boldsymbol{\beta}}\sim N(\mathbf{0}, \sigma^2 \mathbf{C(X'X)^{-1}C'}).
$
And so,
$
\textrm{Cov}(\mathbf{C}\hat{\boldsymbol{\beta}})=\sigma^2 \mathbf{C(X'X)^{-1}C}.
$
which leads to noting that
$
F_1 = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}})'[\mathbf{C(X'X)^{-1}C'}]^{-1}\mathbf{C}\hat{\boldsymbol{\beta}}}{\sigma^2}\sim \chi^2 \left(q\right).
$
You get the above result because $F_1$ is a quadratic form and by invoking a certain theorem. This theorem states that if $\mathbf{x}\sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, then $\mathbf{x'Ax}\sim \chi^2 (r,p)$, where $r=\textrm{rank}(A)$ and $p=\frac{1}{2}\boldsymbol{\mu}'\mathbf{A}\boldsymbol{\mu}$, iff $\mathbf{A}\boldsymbol{\Sigma}$ is idempotent. [The proof of this theorem is a bit long and tedious, but it's doable. Hint: use the moment generating function of $\mathbf{x'Ax}$].
So, since $\mathbf{C}\hat{\boldsymbol{\beta}}$ is normally distributed, and the numerator of $F_1$ is a quadratic form involving $\mathbf{C}\hat{\boldsymbol{\beta}}$, we can use the above theorem (after proving the idempotent part).
Then,
$
F_2 = \frac{\mathbf{y}'[\mathbf{I} - \mathbf{X(X'X)^{-1}X'}]\mathbf{y}}{\sigma^2}\sim \chi^2(n-p-1)
$
Through some tedious details, you can show that $F_1$ and $F_2$ are independent. And from there you should be able to justify your second $F$ statistic.
Sorry for my painting skills, I will try to give you the following intuition.
Let $f(\beta)$ be the objective function (for example, MSE in case of regression). Let's imagine the contour plot of this function in red (of course we paint it in the space of $\beta$, here for simplicity $\beta_1$ and $\beta_2$).
There is a minimum of this function, in the middle of the red circles. And this minimum gives us the non-penalized solution.
Now we add different objective $g(\beta)$ which contour plot is given in blue. Either LASSO regularizer or ridge regression regularizer. For LASSO $g(\beta) = \lambda (|\beta_1| + |\beta_2|)$, for ridge regression $g(\beta) = \lambda (\beta_1^2 + \beta_2^2)$ ($\lambda$ is a penalization parameter). Contour plots shows the area at which the function have the fixed values. So the larger $\lambda$ - the faster $g(x)$ growth, and the more "narrow" the contour plot is.
Now we have to find the minimum of the sum of this two objectives: $f(\beta) + g(\beta)$. And this is achieved when two contour plots meet each other.
The larger penalty, the "more narrow" blue contours we get, and then the plots meet each other in a point closer to zero. An vise-versa: the smaller the penalty, the contours expand, and the intersection of blue and red plots comes closer to the center of the red circle (non-penalized solution).
And now follows an interesting thing that greatly explains to me the difference between ridge regression and LASSO: in case of LASSO two contour plots will probably meet where the corner of regularizer is ($\beta_1 = 0$ or $\beta_2 = 0$). In case of ridge regression that is almost never the case.
That's why LASSO gives us sparse solution, making some of parameters exactly equal $0$.
Hope that will explain some intuition about how penalized regression works in the space of parameters.
Best Answer
I think that you best bet is the thesis of Dongwen Luo from Massey University, On the geometry of generalized linear models; it is available online here. In particular you want to focus on Chapt. 3 - The Geometry of GLMs (and more particular in section 3.4). He employs two different "geometrical domains"; one before and one after the canonical link transformation. Some of the basic theoretical machinery stems from Fienberg's work on The Geometry of an r × c Contingency Table. As advocated in Luo's thesis:
Clearly both $S$ and $A$ need to be at least 2-D and $R^n = S \oplus A$. Under this theoretical framework $\hat{\mu}$ and the data vector $y$ have the same projection onto any direction in the sufficiency space.
Assuming you have differential geometry knowledge, the book of Kass and Vos Geometrical Foundations of Asymptotic Inference should provide a solid foundation on this matter. This paper on The Geometry of Asymptotic Inference is freely available from the author's website.
Finally, to answer your question whether there is "any geometric interpretation of generalized linear model (logistic regression, Poisson, survival)". Yes, there is one; and depends on the link function used. The observations themselves are viewed as a vector in that link transformed space. It goes without saying you will be looking at higher-dimensional manifolds as your sample size and/or the number of columns of your design matrix is increasing.