First, note that the asymptotic complexity of arithmetic operations stated in the common literature concerns operations on numbers with arbitrary precision, and the running time is expressed as a function of the desired number of digits. From the standpoint of asymptotic complexity it makes no sense to ask for operations with constant precision (e.g., single floats, as you mentioned in the comments): there are only $O(1)$ such numbers, hence the operation can be evaluated in time $O(1)$ (e.g., by a look-up table).
Let me thus denote the desired precision as $m$ (since you use the customary $n$ for something else). The result you quote is that $\sqrt a$ can be computed in time $O(M(m))$, where $M(m)$ is any function (satisfying some mild regularity conditions) such that multiplication of two $m$-bit integers can be performed in time $M(m)$. (The currently known asymptotically fastest multiplication algorithm has $M(m)=m\log m\,2^{O(\log^*m)}$.) The algorithm uses Newton iteration $x\mapsto x-\frac{x^2-a}{2x}$. This iteration has a quadratic rate of convergence, hence $O(\log m)$ iterations suffice, and each step takes $O(1)$ multiplications and divisions, leading to the estimate $O(M(m)\log m)$ on the total running time. The extra factor of $\log m$ can be removed by the following observation: since the number of correct digits is roughly doubled by each iteration, we do not have to perform all operations with precision $m$, it suffices to use precision sufficient to accomodate the correct digits. Thus only the last iteration is performed with precision $m$, the last but one has precision $m/2$, the one before that $m/4$, and so on. Then the running time is $O(M(m)+M(m/2)+M(m/4)+\cdots)$. Since $M$ is essentially linear, this is can be bounded by a geometric series, whose sum is $O(M(m))$. (Note by the way that the fact that division can be done in time $O(M(m))$ also uses a similar Newton iteration argument.)
Now, what about $n$th roots in general? You can use Newton iteration again, as Denis suggests. The analysis is similar to the square root case, but since each step takes $O(\log n)$ multiplications, you get a bound $O(M(m)\log n)$. Note that if $n$ is given in binary, $\log n$ is the length of the input, hence this is an algorithm with worse than a quadratic running time. Another approach is to compute $\sqrt[n]a$ as $\exp((\log a)/n)$. Using binary splitting, the Taylor series for $\exp$ and $\log$ can be evaluated in time $O(M(m)(\log m)^2)$; using algorithms based on the arithmetic-geometric mean this can be reduced to $O(M(m)\log m)$, leading to $n$th root computation with the same time bound. This also has an extra $\log$ factor, but it is independent of $n$. I don’t know how to compute $\sqrt[n]a$ in time $O(M(m))$, and I am somewhat skeptical that such a thing is known. It might well be that the comment in Alt’s paper is only intended to cover the case of constant $n$.
Assuming that efficient means better than the naive $O(n^{2+k})$ multiplication, let us review some possibilities.
Padding. For $k > \omega-2$, just pad $A$ with $n-n^k$ zero or garbage rows, perform square matrix multiplication in $O(n^\omega)$ time, and discard the extra rows from the output. This is already efficient because $\omega < 2+k$. (For $k < \omega-2$ the naive method would be better.)
Naive splitting. Following Knight (1995), we note that $\langle m,n,p \rangle$ matrix multiplication, with $m$ the smallest of the three dimensions, can be done in $O(m^{\omega-2}np)$ time. This is by splitting $A_{m \times n}$ into $1 \times (n/m)$ blocks of size $m \times m$, and splitting $B_{n \times p}$ into $(n/m) \times (p/m)$ blocks of same size. Then we naively perform the block matrix multiplication of size $\langle 1, \, n/m, \, p/m \rangle$ by performing $np/m^2$ square multiplications of size $m$, each in time $O(m^\omega)$, for a total of $O(m^{\omega-2}np)$. Applying to the current question with input size $\langle n^k, n, n \rangle$ we achieve $O(n^{k(\omega-2)}n^2) = O(n^{2+k(\omega-2)})$. This is efficient and better than padding for all $0<k<1$.
These are just some straightforward transformations that show that "efficient" is possible for the task given. I believe newer developments (Le Gall and others) may be provide more efficient methods.
Tensor powering. One thing to note is that the splitting method above used fast MM inside the square blocks, but naive MM to combine them. In fact one can use fast methods on both levels, and the blocks need not be square. Coppersmith and Winograd (1982) note that if we have two rectangular matrix multiplication methods,
- $\langle m,n,p \rangle$ in $K$ multiplications and
- $\langle m',n',p' \rangle$ in $K'$ multiplications,
then we can do
- $\langle mm', nn', pp' \rangle$ in $KK'$ multiplications.
This is a powerful tool for combining MM methods of different dimensions. I am not sure, but possibly some of Le Gall & Urrutia's "symmetric" $\langle n, n^k, n \rangle$ rectangular methods that the OP cites could be "tensored up" in this way to solve the "asymmetric" $\langle n^k, n, n\rangle$ case for the current question.
Almost quadratic. Coppersmith (1982) reports that if $k = 0.17227$, then we can perform the $\langle n^k, n, n\rangle$ multiplication in $O(n^2 \, (\log n)^2)$ time, that is, almost in quadratic time! Note that the naive splitting would give an exponent of $2 + k(\omega-2) \le 2.0643$, using $\omega \le 2.373$.
Bibliography
Knight, Philip A., Fast rectangular matrix multiplication and QR decomposition, Linear Algebra Appl. 221, 69-81 (1995). ZBL0827.65044.
Coppersmith, D., Rapid multiplication of rectangular matrices, SIAM J. Comput. 11, 467-471 (1982). ZBL0486.68031.
Coppersmith, D.; Winograd, S., On the asymptotic complexity of matrix multiplication, SIAM J. Comput. 11, 472-492 (1982). ZBL0486.68030.
Best Answer
Gaussian elimination is a polynomial-time algorithm. While it may not be obvious on the first sight, it can be implemented so that the intermediate entries have only polynomial size (bit length), because they happen to be equal to determinants of certain submatrices of the original matrix (or ratios thereof, depending on the version). See e.g. Edmonds and Bareiss.