In most situations, more data is usually better. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid ($n=4$, even!) when considering only the small group of people working on floor, but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.
That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system. I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.
A surprising situation, called **double-descent**, also occurs when size of the training set is close to the number of model parameters. In these cases, the test risk first decreases as the size of the training set increases, transiently *increases* when a bit more training data is added, and finally begins decreasing again as the training set continues to grow. This phenomena was reported 25 years in the neural network literature (see Opper, 1995), but occurs in modern networks too ([Advani and Saxe, 2017][1]). Interestingly, this happens even for a linear regression, albeit one fit by SGD ([Nakkiran, 2019][2]). This phenomenon is not yet totally understood and is largely of theoretical interest: I certainly wouldn't use it as a reason not to collect more data (though I might fiddle with the training set size if n==p and the performance were unexpectedly bad).
[*]A language model is just the probability of seeing a given sequence of words e.g.
$P(w_n = \textrm{'quick', } w_{n+1} = \textrm{'brown', } w_{n+2} = \textrm{'fox'})$. They're vital to building halfway decent speech/character recognizers.
There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.
A useful paper is the 2004 study by Bengio & Grandvalet which argues that the variance of the cross validation estimator is a linear combination of three moments:
$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$
$$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$
Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$
As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$. Each data point $x_i$ contributes to an error term $\epsilon_i$ which are summed up into the MSE. Varying $K$ does not have a direct, algebraically straightforward impact on the variance of the CV estimator.
$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.
Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. These two effects are influenced by the value of $K$ which explains why different datasets and models will lead to different behaviours,
You will need to read through the extensive (and technical) literature to really grasp the subtlety and special cases.
Best Answer
You mean a model with prediction errors due to high bias?
Bias, is defined as $\operatorname{Bias}[\hat{f}(x)]=\mathrm{E}[\hat{f}(x)]-f(x)$ and thus would not be affected by increasing the training set size. If your model predicts vastly different values when the training set changes, i.e., if the error is largely defined by the variance of the predictions, than you can improve the overall loss by more training data, because the model will learn to generalize better, and hence the variance term will go down. To decrease the bias term, you probably need to choose a different model.