If you are using IEEE 754 compliant floating point numbers, it may be that square root operations are faster than you might suppose, as the square root operation is directly supported in the standard and any compliant hardware is guaranteed to round correctly (unlike for sine and cosine) and is often using a hardware implementation of the Newton-Raphson method Alijah described.
That said, the algorithm you linked to only uses the fractional part of the square root to calculate pixel opacity, and consequently the final opacity value ranges only from 0 to 255. Because of the small range, floating point numbers may be overkill and a fixed-point integer representation might work better. If the range is truly only a byte and the maximum size of your radii aren't too large, you can use a look-up table with fixed-point input to skip the expensive square root calculation. A 16-bit fixed-point number would give you a 64KB look-up table, which isn't too bad.
You might also be able to avoid a division and square root operation for calculating the $45^\circ$ value (ffd
in the algorithm) by using the Fast Inverse Square Root hack.
Now for the question of whether there is a method to calculate the fractional part of a square root knowing only the integer part, there is one approach that iteratively calculates a square root and minimizes divisions: Continued Fractions. The simplest approach I know of for what you are wanting to do (fast convergence) is detailed here and works as follows:
$a_0 = x - \lfloor\sqrt x\rfloor^2$
$b_0 = 2\lfloor\sqrt x\rfloor$
$a_n = a_0b_{n-1}$
$b_n = b_0b_{n-1} + a_{n-1}$
which gives you quadratically better and better approximations of the fractional part of $\sqrt x$, and you divide $\frac{a_n}{b_n}$ when you've done enough iterations to ensure accuracy. If you are ultimately needing only byte sized opacity values, it should only take 3 or 4 iterations or so, and we save the division until the end, which is the only significant difference between this and the Newton-Raphson method other than the fact that it gives you the fractional part directly.
If you really want to pursue incrementally calculating variables as far as it can go, you can use Gosper's continued fraction algorithms (see especially the section on square roots) and calculate all the variables involved as continued fractions one term at a time. This allows you to avoid square root calculations and divisions other than bit shifts, as well as abort as soon as you know the correct pixel opacity (or whatever else you're calculating) without wasting time calculating digits you don't need, but it involves a serious overhaul to the algorithm you linked, and if I went into detail this answer would turn into a book.
So essentially, if you have memory to spare and the max length of your radii isn't huge and your output size is small, go with a look-up table with fixed-point numbers. If you want simplicity of implementation, go with floating point or fixed-point numbers. If you want to absolutely avoid square root calculation without a look-up table go with Newton-Raphson or the continued fraction variant on it. If you want to absolutely minimize wasted computation at the expense of some up-front overhead, go with Gosper's continued fraction algorithms.
Best Answer
We are given the equation the nonlinear equation $f(x) = 0$ where $f : [0,\infty) \rightarrow \mathbb{R}$ is given by \begin{equation} f(x) = a x^p + bx^q + cx + d, \quad p = 2.56, \quad q=1.78 \end{equation} It is known that $a, c, d$ are (strictly) negative and that $b$ is (strictly) positive. We first consider the question of the existence of solutions. It is clear that $f$ is continuous. We have $f(0) = d < 0$. Since \begin{equation} f(x) \rightarrow -\infty, \quad x \rightarrow \infty, \quad x \ge 0 \end{equation} we cannot immediately conclude if $f$ has a zero. We therefore seek to determine the range of $f$ in the standard manner. We have \begin{equation} f'(x) = apx^{p-1} + bq x^{q-1} + c = (2.56) a x^{1.56} + (1.78)b x^{0.78} + c \end{equation} The fact that \begin{equation} 1.56 = 2 \cdot 0.78 \end{equation} is hardly a coincidence and it is worth investigating which property of the original problem gave rise to this.
It is vital that you check that this is equality is true, i.e. that $p$ and $q$ are exact and not the result of rounding.
With the substitution \begin{equation} y = x^{0.78} \end{equation} it is clear that $f'(x)=0$ if and only if \begin{equation} (2.56) a y^2 + (1.78)b y + c = 0. \end{equation} Given specific values of $a, b$ and $c$ it is (almost) trivial to determine if $f'$ has any zeros or not, but I urge you to consult Higham's book "Accuracy and Stability of Numerical Algorithms" if you have never considered catastrophic cancellation in this setting before. The stable solution of quadratic equations is discussed in either Chapter 1 or Chapter 2.
Solve the equation $f'(x)=0$ will allow you to break the interval $[0,\infty)$ into subintervals where $f$ is either strictly increasing or strictly decreasing. By evaluating $f$ at the break points, i.e the roots of $f'$ you will be able to detect any sign changes. By continuity, this will identify intervals which contain exactly one root.
For the sake of robustness I would recommend using the bisection algorithm. When speed is of essence, I always recommend a hybrid between the bisection algorithm and the secant method. In this manner you can have both the rapid convergence of the secant method and the safety of the bisection algorithm.
If this is part of a "serious" code, which will run billions of times or more and you need to ensure that it works subject to the limitations of floating point arithmetic, then do not hesitate to contact me via email. I can not make any promises, but it could be a fun problem.
I foresee no difficult in reaching a solution in less than a millisecond. I can not imagine that we would need more than a few thousand CPU cycles.
It is possible to view $f$ as a polynomial, but not in the variable $x$. Specifically, if $x=y^{50}$, then \begin{equation} f(x) = a x^{2.56} + bx^{1.78} + cx + d = a y^{128} + b y^{89} + y^{50} + c. \end{equation} This circumvents the need for any approximations, but I see no real advantage to this approach. We still have to determine the range of $f$ as well as the intervals which contain the root(s). Computing powers requires calls to the exponential and logarithm functions, so we are no better off with this form of $f$ than the original.