Yeah, this is a little intricate and it pays to look at the basic picture very clearly, from first principles.
What is the mathematics we're using here?
So in the real mathematical world, the symbols do not mean anything intrinsically and some function $f$ just maps a tuple of $N$ real numbers, called arguments, to some other real number. The notion of partial derivative would be a function $f_{(i)}$ which would be "the derivative of $f$ with respect to its $i^\text{th}$ argument holding all of its other arguments constant,"$$f_{(i)}\big(a_1,~\dots~ a_N\big)=\lim_{h\to 0}\frac{f\big(a_1,~\dots~ a_{i-1},~ a_{i} + h,~ a_{i+1},~ \dots~ a_N\big) - f\big(a_1,~ \dots~ a_N\big)}h.$$These are partial derivatives. To find a total derivative we need to impose an extra structure that we call a path, which consists of an interval of real numbers $R$ over which some parameter varies to tell you how far you've gone along the path, and then $N$ functions which we will call $a^i(r)$ with upper indexes, which map these abstract path positions in $R$ to real numbers that we can feed to the function. So on this path $P$ we can then say that $f$ becomes $F$ (or you could call it $f_P$ or whatever),$$F(r) = f\big(a^1(r),~a^2(r),~\dots~a^N(r)\big).$$
What are its fundamental assumptions here?
Now let's review the fundamental assumption of single-variable calculus: it is that if you "zoom in" enough on "the right sort of" curve $y = f(x)$, you will find that it becomes a straight line. So you zoom in enough at one point of a parabola and it looks like a straight line, same with a circle. The "wrong sorts" of curve include the Mandelbrot step, Koch curve, or even the $x = 0$ "kink" point of $y = |x|$. The term for "right sort of" curve is differentiable. Technically the straight line also needs to not be 100% vertical.
Now here is the fundamental assumption of multivariable calculus: it is that when you zoom in enough on $y = f(a_1,~\dots~a_N)$ the function is a hyperplane of the $N+1$ dimensional space of points $(y,~ a_1,~ \dots,~ a_N),$ and again it needs to not be "vertical" in a sense. Now a hyperplane looks like, $$f(a_1,\dots a_N) = c + m_1a_1 + m_2a_2 + \dots + m_Na_N,$$ and you can see that these slopes $m_i$ are just those partial derivatives $f_{(i)}$ that we saw before. So it should not surprise you that this assumption then implies that for very small displacements $\delta a_i$ we will see $$f(a_1 + \delta a_1,~\dots a_N+\delta a_N)\\
\approx f(a_1,\dots a_N) + f_{(1)}(a_1,\dots a_N)~\delta a_1 + \dots + f_{(N)}(a_1,\dots a_N)~\delta a_N.$$
How might I write this nicer?
In a more compact notation we would combine the derivatives into a "vector field" $\nabla f(\mathbf a) = \big(f_{(1)}(\mathbf a),~\dots f_{(N)}(\mathbf a)\big)$ and then write this as $$f(\mathbf a + \delta \mathbf a) \approx f(\mathbf a) + \nabla f(\mathbf a)\cdot \delta\mathbf a$$However you write it the point is the same: this fundamental assumption, when it holds, allows you to look at little differences as falling on the tangent hyperplane, just like in calculus little differences fell on the tangent line: and hyperplanes would just add up all of the little differences, multiplied by the slopes in those directions.
Now put these two together: when you evaluate a multivariable function on a path you have $\delta \mathbf a$ given by some $\frac{d a^i}{dr}~\delta r$ for the path parameter $r$. You therefore have, $$F(r + \delta r) \approx F(r) + \sum_{i=0}^N f_{(i)}\big(a^1(r),\dots a^N(r)\big) ~ \frac{da^i}{dr}~\delta r.$$
How does that mathematics apply to the physics here?
Note that there is absolutely no physics in any of this, this is just how the mathematics works. Physics uses math to understand the world, but before we do that we have to understand the math: and this is how the math of multivariable calculus works.
So bringing the physics into it, the convenient parameter to use for a path is the time coordinate $t$, so that becomes our $r$. When you are dealing with Lagrangians you need to be especially careful: the claim is that of all paths $x_P(t)$ a particle could take from time $0$ to time $T$ between two endpoints, the path that it does take is determined by the action integral $$S[P] = \int_0^T dt~L\big(\mathbf x_P(t),~\dot{\mathbf x}_P(t),~t\big),$$ by the stationary action principle: to first order all nearby paths with the same endpoints have the same action, $S[P + \delta P]\approx S[P].$
Now what is the Lagrangian $L$? It is a $2N+1$ dimensional function $L(\mathbf x, \mathbf v, t)$. OK? The Lagrangian itself has no idea that $x$ is related to $v$. They are just independent parameters as far as the Lagrangian function is concerned. The Lagrangian knows nothing about path, it only assigns to a point in the $2N+1$-dimensional space a number.
That is the basis for this method's great power, which is called "generalized coordinates". Because the action integral just sums up a bunch of numbers $L(p(t),t)$ for different "configuration points $p(t)$" but does not care how those configurations are actually represented, you are free to change how you describe the configuration of the system as long as there is an analogous $L$ which returns the exact same numbers along the path.
How this all answers your question
What you are calling the "chain rule" is actually just the fundamental notion of "differentiability" for multivariable functions.
The only thing you need to add is that when you want to come back from these generalized coordinates $q_i$ to your real coordinates $r_i$, you have that there are some functions $r_1(q_1,\dots q_N)$ through $r_N(q_1, \dots q_N)$. Putting the actual path $q^k(t)$ in there, the time derivatives must therefore be, assuming that all functions are nice, $$r_i(t) + \frac{dr_i}{dt}~\delta t \approx r_i\left(q^1(t) + \frac{dq^1}{dt}~\delta t,~ \dots~q^N(t) + \frac{dq^N}{dt}~\delta t\right),$$ so that $$
\frac{dr_i}{dt} = \sum_k \frac{\partial r_i}{\partial q_k}~\frac{dq^k}{dt}.$$ Notice the difference here, the $q_k$ in the denominator marks which argument we are taking the partial derivative relative to, while the $q^k$ in the numerator is the function, coming from the path, which tells you what value it has at various times.
The extra $\partial r/\partial t$ term is the exact same -- you can think of it as just another path coordinate $q_0$ which happens to be synced to time so that $\frac {dq^0}{dt} = 1.$
Best Answer
He is using the standard chain rule for partial differentiation from calculus. The partial derivative and the total derivative are not the same. See any calculus text such as one by Kaplan, Thomas, or Taylor.
$$\displaystyle{\sum_{k}\frac{\partial \vec{r_i}}{\partial \dot{q_k}}\cdot \vec{\dot{q_k}}}$$
does not vanish. $\vec{\dot{q_k}}$ is ${d\vec q_k \over dt}$.
Suggest you look at an example application of the Lagrangian approach to a real problem to understand this more clearly.