The answer to both 1 and 2 is no, but care is needed in interpreting the existence theorem.
Variance of Ridge Estimator
Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is
$$
\begin{align*}
E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\
& = \gamma_1 (k) + \gamma_2(k) \\
& = R(k)
\end{align*}
$$
where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.
Supposing $X^T X = \mathbf{I}_p$, then
$$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$
Let
$$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$.
Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.
The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.
Comment
There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.
Why is ridge regression usually recommended only in the case of correlated predictors?
H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.
But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.
See if you can derive if from this more general result:
If $\mathbf{y}\sim \text{N}(\mathbf{X}\mathbf\beta,\mathbf{R})$ and $\mathbf\beta \sim \text{N}(\mathbf{a},\mathbf{B})$ then the posterior is $\mathbf\beta|\mathbf{y} \sim \text{N}(\mathbf\mu, \mathbf\Sigma)$ where $$\mathbf\mu = \mathbf\Sigma\left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{B}^{-1}\mathbf{a}\right)\quad\text{and}\quad\mathbf\Sigma = \left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X} + \mathbf{B}^{-1}\right)^{-1}$$
In order to identify the kernel of the distribution of the posterior, we will only keep track of terms involving $\mathbf\beta$.
$$\begin{align*}
p(\mathbf\beta|\mathbf{y}) &\propto p(\mathbf{y}|\mathbf\beta)p(\mathbf\beta)\\
&\propto \exp\left\{ -\dfrac{1}{2}\left[(\mathbf{y}-\mathbf{X}\mathbf\beta)^\intercal\mathbf{R}^{-1}(\mathbf{y}-\mathbf{X}\mathbf\beta) + (\mathbf\beta-\mathbf{a})^\intercal\mathbf{B}^{-1}(\mathbf\beta-\mathbf{a})\right]\right\}\\
&\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{y} - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}\right.\right.\\
&\qquad + \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta + \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta + \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf{a}\right]\right\}
\end{align*}$$ dropping terms not involving $\mathbf\beta$
$$\begin{align*}
&\propto \exp\left\{ -\dfrac{1}{2}\left[ - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta\right.\right. \\
&\qquad+ \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta \right]\right\}\\
&\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\underbrace{(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}+\mathbf{B}^{-1})}_{\mathbf\Sigma^{-1}}\mathbf\beta - \mathbf{\beta}^\intercal(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a}) \right.\right.\qquad - \left.\left.\vphantom{\dfrac{1}{2}a_\underbrace{\Sigma}}(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a})^\intercal\mathbf\beta\right]\right\}
\end{align*}$$ Multiplying by the identity, $\mathbf{I} = \mathbf{\Sigma}^{-1}\mathbf{\Sigma}$
$$\begin{align*}
&\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta\right]\right\}
\end{align*}$$ multiplying and dividing by $\exp\left\{\mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right\}$
$$\begin{align*}
&\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta + \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right]\right\}\\
&\propto \exp\left\{ -\dfrac{1}{2}(\mathbf\beta - \mathbf\mu)^\intercal\mathbf\Sigma^{-1}(\mathbf\beta - \mathbf\mu)\right\}
\end{align*}$$ Therefore the posterior is normal with mean $\mathbf\mu$ and variance covariance matrix $\mathbf\Sigma$.
Best Answer
What the description in the sklearn documentation says is that the model is a regression model with extra regularization parameter for the coefficients. The model is
$$\begin{align} y &\sim \mathcal{N}(\mu, \alpha^{-1}) \\ \mu &= X\omega \\ \omega &\sim \mathcal{N}(0, \lambda^{-1}\mathbf{I}_p) \\ \alpha &\sim \mathcal{G}(\alpha_1, \alpha_2) \\ \lambda &\sim \mathcal{G}(\lambda_1, \lambda_2) \end{align}$$
So $y$ follows normal distribution (the likelihood function) parametrized by mean $\mu = X\omega$ and variance $\alpha^{-1}$. Where we choose Gamma priors for $\alpha$ and regularizing parameter $\lambda$, the distributions have hyperpriors $\alpha_1, \alpha_2, \lambda_1, \lambda_2$. The regression parameters $\omega$ have independent Gaussian priors with mean $0$ and variance $\lambda^{-1}$, so $\lambda$ serves as a regularization parameter (it is a precision parameter, so the larger $\lambda$, the $\omega$ values are a priori assumed to be more concentrated around zero).