It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
Your calculation is almost right, up to the point where you made the huge mistake of thinking that
\begin{align}
\dfrac{\partial}{\partial \theta} = \mathbf{e}_{\theta}
\end{align}
This is completely wrong, because the vector on the RHS by definition is the normalized version of the one on the left.
Let's go through it step by step (even though you got it right for the most part). By definition we have
\begin{align}
\text{grad}(f) := g^{\sharp}(df)
\end{align}
And if we work in a chart $(U,x)$, then
\begin{align}
\text{grad}(f) &:= g^{\sharp}(df) \\
&= g^{\sharp}\left( \dfrac{\partial f}{\partial x^i}\, dx^i\right) \\
&= \dfrac{\partial f}{\partial x^i} \cdot g^{\sharp}\left(dx^i\right) \\
&= \dfrac{\partial f}{\partial x^i} \cdot g^{ij}\dfrac{\partial }{\partial x^j} \tag{$*$}
\end{align}
Where, I use the notation $g_{ij} := g\left(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j}\right)$, and $[g^{ij}]$ denotes the inverse matrix of $[g_{ij}]$. For polar coordinates $(r,\theta)$ in the plane (more precisely on a certain open subset of $\Bbb{R}^2$), we have
\begin{align}
[g_{ij}] =
\begin{pmatrix}
g_{rr} & g_{r\theta}\\
g_{\theta r} & g_{\theta \theta}
\end{pmatrix} =
\begin{pmatrix}
1 & 0 \\
0 & r^2
\end{pmatrix}
\end{align}
where for convenience rather than writing $g_{11}, g_{12}$ etc, I used the notation $g_{rr}, g_{r\theta}$. Now, the inverse matrix is easily calculated because it is diagonal:
\begin{align}
[g^{ij}] =
\begin{pmatrix}
g^{rr} & g^{r\theta}\\
g^{\theta r} & g^{\theta \theta}
\end{pmatrix} =
\begin{pmatrix}
1 & 0 \\
0 & 1/r^2
\end{pmatrix}
\end{align}
Now, what you have to do is use the formula $(*)$ exactly as written. If we apply it directly then we find
\begin{align}
\text{grad}(f) &= g^{rr}\dfrac{\partial f}{\partial r}\dfrac{\partial }{\partial r} + g^{\theta \theta}\dfrac{\partial f}{\partial \theta}\dfrac{\partial }{\partial \theta} \\
&= \dfrac{\partial f}{\partial r}\dfrac{\partial }{\partial r} + \dfrac{1}{r^2}\dfrac{\partial f}{\partial \theta}\dfrac{\partial }{\partial \theta} \tag{$**$}
\end{align}
This formula is $100\%$ correct, and it DOES NOT contradict what you may have seen in standard vector analysis texts. To get the "usual" formula, we have to see how $\mathbf{e}_r, \frac{\partial}{\partial r}, \mathbf{e}_{\theta}, \frac{\partial}{\partial \theta}$ are related to each other. By definition, the $\mathbf{e}$'s are the normalized versions, which means
\begin{align}
\mathbf{e}_r &:= \dfrac{\frac{\partial}{\partial r}}{\lVert \frac{\partial}{\partial r}\rVert} \quad \text{and} \quad \mathbf{e}_{\theta} := \dfrac{\frac{\partial}{\partial \theta}}{\lVert \frac{\partial}{\partial \theta}\rVert}
\end{align}
So, what is the norm of a vector? By definition, it is the square root of the inner product of the vector with itself; i.e $\lVert v\rVert := \sqrt{\langle v,v \rangle} = \sqrt{g(v,v)}$, where the last equality is simple a notational change (recall that the metric tensor $g$ is precisely an inner product on each tangent space $T_pM$ of your manifold... which in this case is $M = \Bbb{R}^2$). So, we have
\begin{align}
\begin{cases}
\mathbf{e}_r &:= \dfrac{\frac{\partial}{\partial r}}{\sqrt{g\left( \frac{\partial}{\partial r},\frac{\partial}{\partial r}\right)}} = \dfrac{1}{\sqrt{g_{rr}}}\dfrac{\partial}{\partial r} = \dfrac{\partial}{\partial r}\\
\mathbf{e}_{\theta} &:= \dfrac{\frac{\partial}{\partial \theta}}{\sqrt{g\left(\frac{\partial}{\partial \theta},\frac{\partial}{\partial \theta}\right)}} = \dfrac{1}{\sqrt{g_{\theta\theta}}}\dfrac{\partial}{\partial \theta} = \dfrac{1}{r}\dfrac{\partial}{\partial \theta}
\end{cases}
\end{align}
If you now make these substitutions into $(**)$, you find exactly that
\begin{align}
\text{grad}(f) &= \dfrac{\partial f}{\partial r} \mathbf{e}_r + \dfrac{1}{r}\dfrac{\partial f}{\partial \theta} \mathbf{e}_{\theta} \tag{$***$}
\end{align}
By the way, when you asked "why is the norm of $\frac{\partial}{\partial \theta}$ is $r$", it's not clear to me whether your confusion is regarding why $[g_{ij}] = \text{diag}(1,r^2)$, or simply what the relationship between the norm and inner product (i.e the metric tensor field) is. If you need more clarification let me know.
Finally, on a more general note, let's go back to $n$ dimensions. We once again define $\mathbf{e}_j$ to be the normalized vector corresponding to $\frac{\partial}{\partial x^j}$, i.e
\begin{align}
\mathbf{e}_j &= \dfrac{1}{\sqrt{g\left(\frac{\partial}{\partial x^j}, \frac{\partial}{\partial x^j}\right)}} \frac{\partial}{\partial x^j} = \dfrac{1}{\sqrt{g_{jj}}}\frac{\partial}{\partial x^j}
\end{align}
If we now plug this into $(*)$, then we see that the gradient vector field, when written in terms of the normalized coordinate vector field (i.e the $e_{j}$'s) is
\begin{align}
\text{grad}(f) &= \sum_{i,j=1}^n g^{ij}\sqrt{g_{jj}}\dfrac{\partial f}{\partial x^i} \, \mathbf{e}_j\tag{$*'$}
\end{align}
This formula above is entirely equivalent to $(*)$. Now let's specialize slightly, just for fun. Suppose the coordinate vector fields are orthogonal (i.e $g_{ij} = 0$ if $i\neq j$). Then, the inverse matrix $[g^{ij}]$ is easily calculated to be $\text{diag}(1/g_{11}, \dots, 1/g_{nn})$, and in this special case, the gradient reduces to:
\begin{align}
\text{grad}(f) &= \sum_{i=1}^n \dfrac{1}{\sqrt{g_{ii}}}\dfrac{\partial f}{\partial x^i} \, \mathbf{e}_i
\end{align}
Now, once again, as a sanity check try applying this to the Polar coordinate case, and you should recover $(***)$.
Best Answer
The answer to the question in the title is "yes," and here's why.
As the Wikipedia article on the musical isomorphism explains, the so-called "musical isomorphism" is equivalent to raising and lowering indices (sometimes called "index juggling") in component form. The process of raising and lowering indices is allowed due to the metric ($g_{ij}$) and its inverse ($g^{ij}$) imposed on the manifold. Specifically, a vector $\vec X=X^i\vec e_i$ is related to the $1$-form $\tilde X=X_i\tilde\omega^i$ by the equation $X^i=g^{ij}X_j$ under the standard coordinate bases $\vec e_i=\frac{\partial}{\partial q^i}$ and $\tilde\omega^i={\bf d} q^i$ for some set of coordinates $q^i$ if the usual $1$-form and metric requirements are imposed (viz. ${\bf d}q^i(\frac{\partial}{\partial q^j})=\delta^{i}_{\ \ j}$ and $g^{ik}g_{kj}=\delta^i_{\ \ j}$).
Using this relationship, ${\bf d}f=\frac{\partial f}{\partial q^i}{\bf d}q^i$ is related to $\vec\nabla f=(\partial^if)\frac{\partial}{\partial q^i}=(g^{ij}\frac{\partial f}{\partial q^j})\frac{\partial}{\partial q^i} $ through an obvious application of index juggling (which, remember, is the same as the musical isomorphism).
And you are correct that the raised index (which is hiding that inverse metric tensor) is important in other coordinate systems. Look at the definition of the gradient in other coordinates and you will notice that additional factors are present that resemble the elements of the inverse metric. If you look carefully, they appear (at least for cylindrical and spherical coordinates) to be the square root of those elements. The reason for this is that the definitions given on that Wikipedia page use normalized unit vectors instead of the standard basis of partial derivatives. The normalization factor is exactly the square root of those terms, so the unit vectors are actually hiding a portion of the inverse metric's contribution to the final answer.