I don't know if another answer can be more illuminating to you, but let me try saying it a bit differently anyway.
First, consider the inner loop:
for k= 1 to m
//dowork ---- CONSTANT
Because it's doing a constant amount of work $m$ times (for k=1 to m
), this takes approximately time $m$, whatever the value of $m$ is. (To be precise, it takes $\Theta(m)$ time, which means that there are constants $c_1$ and $c_2$ such that the time it takes is between $c_1m$ and $c_2m$, but when finding the "approximate running time", i.e., the asymptotic complexity, we usually ignore constant factors.)
Now the outer loop looks like
m = n
while (m > 0)
//Do 'm' amount of work
m = floor(m/2)
where I've replaced the inner loop with what we know about its time. So the first time the loop is run, $m=n$ and it takes time $n$. By the second time the loop is run, $m$ is halved, so $m = n/2$ and it takes time $n/2$ (I'm ignoring writing $\lfloor n/2 \rfloor$, because that's within a constant factor of $n/2$.) The third time it takes $n/4$, etc. So the total time taken by the code is:
$$\begin{align}&n + \frac{n}{2} + \frac{n}{4} + \frac{n}{8} + \dots \\
&= n\left(1 + \frac12 + \frac14 + \frac18 + \dots\right)\end{align}$$
until the term becomes less than $1$ (so $m$ would have become $0$ and the code would have halted).
Now, the sum $\left(1 + \frac12 + \frac14 + \frac18 + \dots\right)$ is at most $2$, so the above sum for the running time is at most $2n$. Ignoring constant factors, we can say that the code takes approximately time $n$, which is shorthand for saying that the time it takes is is linear in $n$. Or, if we did the whole thing formally, we would have said it takes time $\Theta(n)$.
(As it happens, we can analyse the number of terms in the sum: if the last term is $\frac{n}{2^k}$ (each term is of this type), then $k$ is such that $2^k \le n < 2^{k+1}$, which means $k = \lfloor \lg n \rfloor$, but all this is irrelevant to the actual problem here.)
I can't read that code, but when you want to raise something to the power $n$, you can write $n$ out in binary, and then for every zero in that binary expansion you have to do a squaring, and for every one in the expansion you have to do a square and a multiply. So the number of operations you have to do is at most twice the number of bits in the binary expansion of $n$. But the number of bits in the binary expansion of $n$ is, essentially, $\log_2n$.
Best Answer
The factor $k$ should be clear.
To compute $a^{n-1}$ one can use repeated squaring and multiplication, which requires one or two modular multiplications per bit of $n$. As the number of bits is $\log n$, we get a factor of $\log n$.
A modular multiplication can be done by adding up to $\log n$ bitshifted copies of a factor. Each addition uses at most twice as many bits as $n$ has, so takes $O(\log n)$ steps; the reduction modulo $n$ might for example be done by using $\log n$ precomputed values, one for each of the "higher" bits. At least this does not hurt the O(\log n) for the addition.
With this simple method, we obtain thus $$O(\underbrace{k}_{\text{repeats}}\times\underbrace{\log n}_{\text{expon.}}\times\underbrace{\log n}_{\text{mult.}}\times \underbrace{\log n}_{\text{add}})=O(k\times \log^3n).$$ To improve by a factor of $\frac{\log\log n\times\log\log\log n}{\log n}$, one needs to employ slightly smarter methods (I guess FFT or something like that).