[Math] Gaussian elimination algorithm performance

algorithmselementary-number-theorygaussian eliminationmatricesnumber theory

I am developing the quadratic sieve algorithm and I reached a new bottle neck: The matrix processing.

I been reading quit a lot about this topic and I found many solutions

  • Gaussian elimination: This perhaps the most common approach for this problem. It's running time is $O(N^3)$ above GF (2)
  • Method of Four Russians: It optimize the classic Gaussian elimination to $O(\frac{N^3}{\log{N}})$ by partition the matrix into small square blocks of size $\log N$. You can read more about it here(section 3.2).
  • Block Wiedemann algorithm is used to parallelize the matrix of several machines in order to speed up the process. Here is a good example of such implementation over GAPP, they use a little bit different algorithm for that.

But non of the above was good enough for me, so I created my own algorithm.

It's based on Gaussian elimination, but instead of searching for pivot in each column I represent the columns with hash table and linked list(This is how HashMap in java is implemented) where each node of the list is stored in hash table and contain the pivot index of the original matrix. So I can iterate over all the pivots of column.

When I start to perform Gaussian elimination it only takes $O(1)$ to find the first pivot of a column, once I find such pivot I store it in another hash table, so I want pick it for the next column I will operate. Next I iterate over the reset of the pivot of the first column and xor the rows to eliminate all the reset of the pivot of the first column. I do the same trick with the rows, so I got duplicate representation of the matrix, by hash-table rows and columns. This allows me to speed up the row xoring process as I only iterate over the row pivot. The xor operation itself of each node is $O(1)$ but it evolves 2 hash operations and it's by it self quite expensive. And so this is the part where this algorithm fails. Even so that I estimate it's performance as sub-quadratic on average.

I ended up implementing classic Gaussian elimination using bit operators and it allows me to process square matrices with up to $10,000$ lines in minutes time. But it's not fast enough for me, my goal is factoring 100 digits number, this require around 1M square matrix size, and this impossible with my current approach.

Is there any other methods that I missed? What is the best non parallel method to solve this problem?

Best Answer

Your approach to Gaussian elimination is fundamentally at odds with the currect state of hardware. You are up against the memory wall, i.e. the fact that floating point operations can be completed at a rate which is far higher than the speed at which data can be retrieved from may memory. By processing your matrix one column at a time, your are constantly experiencing cache misses and you CPU is running at a fraction of the peak rate. A truly enourmous amount of research involving many good people working together for many years, has gone into developing blocked algorithms which facilitate reuse of the data stored in the cache hierarchy.

On a "standard" desktop with a Haswell chip, running at 3.6 GHz a random matrix of dimension $N=10,000$ can be factored using around 4.5 seconds using the dense factorization routine in LAPACK.

It is possible to write fast matrix factorization routines based on Strassens matrix matrix multiplication with roughly $O(N^{2.6})$ complexity, but I cannot recommend this approach due to unresolved stability issues.

So to answer you question, the best sequential algorithms for GE currently available is are implementations of LAPACK's DGETRF/DGETRS which factors and solves general nonsingular linear systems. You should be able to obtain an implementation through www.netlib.org.

Related Question