I'm going to motivate this intuitively, and indicate how it comes about for the special case of two groups, assuming you're happy to accept the normal approximation to the binomial.
Hopefully that will be enough for you to get a good sense of why it works the way it does.
You're talking about the chi-square goodness of fit test. Let's say there are $k$ groups (you have it as $n$, but there's a reason I tend to prefer to call it $k$).
In the model being applied for this situation, the counts $O_i$, $i=1,2,...,k$ are multinomial.
Let $N=\sum_{i=1}^k O_i$. The counts are conditioned on the sum $N$ (except in some fairly rare situations); and there are some prespecified set of probabilities for each category, $p_i, i=1, 2, \ldots,k$, which sum to $1$.
Just as with the binomial, there's an asymptotic normal approximation for multinomials -- indeed, if you consider only the count in a given cell ("in this category" or not), it would then be binomial. Just as with the binomial, the variances of the counts (as well as their covariances in the multinomial) are functions of $N$ and the $p$'s; you don't estimate a variance separately.
That is, if the expected counts are sufficiently large, the vector of counts is approximately normal with mean $E_i=Np_i$. However, because the counts are conditioned on $N$, the distribution is degenerate (it exists in a hyperplane of dimension $k-1$, since specifying $k-1$ of the counts fixes the remaining one). The variance-covariance matrix has diagonal entries $Np_i(1-p_i)$ and off diagonal elements $-Np_ip_j$, and it is of rank $k-1$ because of the degeneracy.
As a result, for an individual cell $\text{Var}(O_i)=Np_i(1-p_i)$, and you could write $z_i = \frac{O_i-E_i}{\sqrt{E_i(1-p_i)}}$. However, the terms are dependent (negatively correlated), so if you sum the squares of those $z_i$ it won't have the a $\chi^2_k$ distribution (as it would if they were independent standardized variables). Instead we could potentially construct a set of $k-1$ independent variables from the original $k$ which are independent and still approximately normal (asymptotically normal). If we summed their (standardized) squares, we'd get a $\chi^2_{k-1}$. There are ways to construct such a set of $k-1$ variables explicitly, but fortunately there's a very neat shortcut that avoids what amounts to a substantial amount of effort, and yields the same result (the same value of the statistic) as if we had gone to the trouble.
Consider, for simplicity, a goodness of fit with two categories (which is now binomial). The probability of being in the first cell is $p_1=p$, and in the second cell is $p_2=1-p$. There are $X = O_1$ observations in the first cell, and $N-X=O_2$ in the second cell.
The observed first cell count, $X$ is asymptotically $\text{N}(Np,Np(1-p))$. We can standardize it as $z=\frac{X-Np}{\sqrt{Np(1-p)}}$. Then $z^2 = \frac{(X-Np)^2}{Np(1-p)}$ is approximately $\sim \chi^2_1$ (asymptotically $\sim \chi^2_1$).
Notice that
$\sum_{i=1}^2 \frac{(O_i-E_i)^2}{E_i} = \frac{[X-Np]^2}{Np}+ \frac{[(N-X)-(N-Np)]^2}{N(1-p)}= \frac{[X-Np]^2}{Np}+ \frac{[X-Np]^2}{N(1-p)}=(X-Np)^2[\frac{1}{Np}+ \frac{1}{N(1-p)}]$.
But
$\frac{1}{Np}+ \frac{1}{N(1-p)} =\frac{Np+N(1-p)}{Np.N(1-p)} = \frac{1}{Np(1-p)}$.
So $\sum_{i=1}^2 \frac{(O_i-E_i)^2}{E_i} =\frac{(X-Np)^2}{Np(1-p)}$ which is the $z^2$ we started with - which asymptotically will be a $\chi^2_1$ random variable. The dependence between the two cells is such that by diving by $E_i$ instead of $E_i(1-p_i)$ we exactly compensate for the dependence between the two, and get the original square-of-an-approximately-normal random variable.
The same kind of sum-dependence is taken care of by the same approach when there are more than two categories -- by summing the $\frac{(O_i-E_i)^2}{E_i}$ instead of $\frac{(O_i-E_i)^2}{E_i(1-p_i)}$ over all $k$ terms, you exactly compensate for the effect of the dependence, and obtain a sum equivalent to a sum of $k-1$ independent normals.
There are a variety of ways to show the statistic has a distribution that asymptotically $\chi^2_{k-1}$ for larger $k$ (it's covered in some undergraduate statistics courses, and can be found in a number of undergraduate-level texts), but I don't want to lead you too far beyond the level your question suggests. Indeed derivations are easy to find in notes on the internet, for example there are two different derivations in the space of about two pages here
The Pearson contingency coefficient is intended to provide us some measure regarding the dependence between data columns (which sometimes, when discussing regression, we call collinearity). $C$ takes values in $[0,1)$ when values near 0 indicate column independence, values far from 0 indicate dependence. But how far is far? Can we achieve a tighter upper bound? That's the general idea.
Red part (a): notice the term "degrees of dependence". If there are zero of them, columns are independent.
Red part (c): as the upper bound for $C$ was found, dividing by this bound (/multiplying by its reciprocal) $C_{corr}=\sqrt{\frac{k}{k-1}}C$ yields $0\le C_{corr} \le 1$.
Red part (b): Let $k=\min\{r,s\}$ and assume the extreme condition described (e.g. $x,y$ is a pair of letter and digit. Although there are 10 digits and 26 letters, The only pairs which appear are $(a,0), (s,1), (h,2), (i,3), (r,4), (g,5), (u,6), (n,7), (y,8), (t,9)$). Each of the $k=10$ possible pairs has relative frequency $f_{xy}$, as well as each of the components (i.e each letter and digit have frequencies $f_x,f_y$).
For letters outside the $k$ used above, $f_x=0$ and obviously $f_{xy}=0$ so they're not summed. Our sum is now:
$$\chi^2=n\sum_{i=1}^k\sum_{j=1}^k\frac{(f_{x,y}(v_i, w_j)-f_x(v_i)f_y(w_j))^2}{f_x\left(v_i\right)f_y(w_j)}\\=n\sum_{i=1}^k\sum_{j=1}^k\left[\frac{f^2_{xy}(v_i,w_j)}{f_x(v_i)f_y(w_j)}\right]-2n\sum_{i=1}^k\sum_{j=1}^k\left[f_{xy}(v_i,w_j)\right]+n\sum_{i=1}^k\sum_{j=1}^k\left[f_x(v_i)f_y(w_j)\right]$$
The rightmost sum equals 1 as well as the middle sum (second probability axiom), so we only need to solve
$$\chi^2=n\sum_{i=1}^k\sum_{j=1}^k\left[\frac{f^2_{xy}(v_i,w_j)}{f_x(v_i)f_y(w_j)}\right]-2n+n=n\sum_{i=1}^k\sum_{j=1}^k\left[\frac{f^2_{xy}(v_i,w_j)}{f_x(v_i)f_y(w_j)}\right]-n$$
If the pair $(v_i,w_j)$ does not appear on our list, $f_{xy}(v_i,w_j)=0$ is not summed. The pair matching is one-to-one (a digit always appears with the same letter and vice versa), so now assume that a pair $(v_i,w_j)$ does exist. In fact, it appears $a$ times (out of $n$), so its relative pair frequency is $f_{xy}(v_i,w_j)=\frac{a}{n}$. However, these are also the only $a$ occurrences of $v_i$ and $w_j$ so $f_x(v_i)=f_y(w_j)=f_{xy}(v_i,w_j)=\frac{a}{n}$. This means that when a pair does exist, the fraction equals $1$. We have $k$ such cases, so overall
$$\chi^2=nk-n=n(k-1)$$
Now getting the boundary for $C$ is easy:
$$C=\sqrt{\frac{\chi^2}{\chi^2+n}}=\sqrt{\frac{n(k-1)}{n(k-1)+n}}=\sqrt{\frac{k-1}{k}} \qquad\blacksquare.$$
Best Answer
This is more directly and more accurately handled through the use of a binary logistic regression model with an interaction term. The usually-best test is the likelihood ratio $\chi^2$ test from such a model. The regression context also allows one to test continuous variables, adjust for other variables, and a host of other extensions.
General comment: I think we spend too much time teaching special cases and would do well to use general tools so that we have more time to deal with complications such as missing data, high dimensionality, etc.