I'm going to copy my answer from Stack Overflow, which also shows why 4-component vectors (and hence 4×4 matrices) are used instead of 3-component ones.
In most 3D graphics a point is represented by a 4-component vector (x, y, z, w), where w = 1. Usual operations applied on a point include translation, scaling, rotation, reflection, skewing and combination of these.
These transformations can be represented by a mathematical object called "matrix". A matrix applies on a vector like this:
[ a b c tx ] [ x ] [ a*x + b*y + c*z + tx*w ]
| d e f ty | | y | = | d*x + e*y + f*z + ty*w |
| g h i tz | | z | | g*x + h*y + i*z + tz*w |
[ p q r s ] [ w ] [ p*x + q*y + r*z + s*w ]
For example, scaling is represented as
[ 2 . . . ] [ x ] [ 2x ]
| . 2 . . | | y | = | 2y |
| . . 2 . | | z | | 2z |
[ . . . 1 ] [ 1 ] [ 1 ]
and translation as
[ 1 . . dx ] [ x ] [ x + dx ]
| . 1 . dy | | y | = | y + dy |
| . . 1 dz | | z | | z + dz |
[ . . . 1 ] [ 1 ] [ 1 ]
One of the reason for the 4th component is to make a translation representable by a matrix.
The advantage of using a matrix is that multiple transformations can be combined into one via matrix multiplication.
Now, if the purpose is simply to bring translation on the table, then I'd say (x, y, z, 1) instead of (x, y, z, w) and make the last row of the matrix always [0 0 0 1]
, as done usually for 2D graphics. In fact, the 4-component vector will be mapped back to the normal 3-vector vector via this formula:
[ x(3D) ] [ x / w ]
| y(3D) ] = | y / w |
[ z(3D) ] [ z / w ]
This is called homogeneous coordinates. Allowing this makes the perspective projection expressible with a matrix too, which can again combine with all other transformations.
For example, since objects farther away should be smaller on screen, we transform the 3D coordinates into 2D using formula
x(2D) = x(3D) / (10 * z(3D))
y(2D) = y(3D) / (10 * z(3D))
Now if we apply the projection matrix
[ 1 . . . ] [ x ] [ x ]
| . 1 . . | | y | = | y |
| . . 1 . | | z | | z |
[ . . 10 . ] [ 1 ] [ 10*z ]
then the real 3D coordinates would become
x(3D) := x/w = x/10z
y(3D) := y/w = y/10z
z(3D) := z/w = 0.1
so we just need to chop the z-coordinate out to project to 2D.
As I googled, I think I realized what you need. Every row is a 2D point. So "rotation" and "translation" are performed to EVERY row. So given two sets of points written in the matrices $A$ and $B$ you want to find a rotation matrix $R = \left(\begin{matrix} \cos{T} & -\sin{T} \\ \sin{T} & \cos{T} \end{matrix}\right)$ and a translation vector $tr = (t_x, t_y)^t$ such that for every row $B_i$ of $B$ and every row $A_i$ of $A$ you have $B_i^t = R.A_i^t + tr$.
Now, if $R$ and $tr$ exist and if you have two distinct rows from $B$ and $A$ you will have that
$$ B_1^t = R.A_1^t+tr, \qquad B_2^t = R.A_2^t+tr$$
from where after subtraction you will obtain $B_2^t-B_1^t = R.(A_2^t-A_1^t)$ which is a $2\times 2$ linear system for the "variables" $\cos{T}$ and $\sin{T}$. If your points are distinct, you will obtain unique solution for $R$ (and you know that an angle $T$ is determined uniquely by its $\sin$ and $\cos$). Once you have $R$ determined you obtain $tr$ from the first equation: $tr = B_1^t-R.A_1^t$.
Best Answer
"Gets them close enough" is the tricky part. How do you measure how far apart they are? If the answer is "I use the usual Euclidean metric," so that the distance from $(a, b, c)$ to $(x, y, z)$ is $\sqrt{(a-x)^2 + (b-y)^2 + (c-z)^2}$, then you can minimize the sum of squared errors by using the Moore-Penrose pseudo-inverse.
Here's (roughly) the trick. You want $$ Mv_i = u_i $$ for some given list of "input" vectors $v_1, v_2, \ldots, v_k$ and "target vectors" $u_1, u_2, \ldots, u_k$, where each of these is in $\Bbb R^3$, and $M$ should be a $3 \times 3$ matrix.
If you put the $v$-vectors into a $ 3 \times k$ matrix (each vector becomes one column of the matrix $V$), and do the same thing for the $u$-vectors, making a matrix $U$, then you're hoping that $$ M V = U $$ where $M$ is $3 \times 3$, and $V$ and $U$ are each $3 \times k$. For $k > 3$, this problem is generally not solvable. But if you look at $$ R = MV - U $$ for a particular $M$, that's a matrix that shows the "errors" -- how far $Mv_i$ is from $u_i$. And if the sum of the squares of the entries of $R$ is small...then you've done well.
Now that is a calculus problem: over all $3 \times 3 $ matrices $M$, consider the function $$ f(M) = \sum_{i,j}(MV - U)^2_{i,j} $$ and minimize it. You take derivatives, set them to zero, and (after a good deal of algebra) you arrive at a formula for $M$.
Let me lead you there in a slightly different way. Let's suppose that there is such a matrix $M$ for your $U$ and $V$ data. Start from $$ MV = U $$ and multiply both sides by $V^t$ to get $$ M (VV^t) = (U V^t) $$ The matrices in parentheses are now $3 \times 3$, and unless your set of $v$-vectors is particularly bad (e.g., they all lie in one plane!), the matrix $V V^t$ is invertible. So you find that $$ M = (UV^t)(V V^t)^{-1} $$ Now that's the answer for $M$ in the case that $MV = U$ does have a solution. The surprising thing (which you get by doing the calculus I mentioned) is that even if $MV = U$ doesn't have an exact solution, the matrix $$ M^{*} = (UV^t)(V V^t)^{-1} $$ is the one for which $f(M)$ is minimized, i.e., for which the transformed $v$-vectors are as close as possible (in the aggregate) to the corresponding $u$-vectors.