ad 1) Because any linear function of T can be written in the form ε = N · T for some vector N, there is little hope to improve your method if we are restricted to linear least squares.
EDIT It's actually possible to formulate it as a linear least squares problem with linear restrictions, which allows to see that just computing the distance from the ray as you proposed should work fine and is easy to implement, see (*).
So better don't try to implement what I suggested under "ad 2)", because it is more complicated than necessary.
ad 2) Formulate it as a nonlinear least squares problem. Both the Gauss-Newton algorithm (GNA) and the Levenberg–Marquardt algorithm (LMA) are well suited to solve this type of problem. There are many free implementations of LMA available. It may be a good idea to generate a sufficient large number of random initial values to avoid problems with local minima or non-convergence.
I would compute the objective functions as follows:
For each bearing measurement, apply a rotation matrix $R$ to the analytic expression for T such that the positive x-axis becomes the bearing ray. Then use ε = atan2(y-coord,x-coord) as measurement error whose sum of squares should be minimized.
A problem with this approach can occur when the target passes closely by the sensor, because the errors introduced by the assumption that the target travels at constant speed is not modeled well (and because atan2 has a singularity at (0,0) which was not there for your linear approach). This can be fixed by treating T for each bearing measurement as unknown, and minimizing it's deviation from the analytic expression for T. However, the random initial values should only be generated for A and V, and the initial values for T just computed with the analytic expression. (If this should really lead to convergence problems, then the initial values for T could also be randomly perturbed a bit.)
EDIT(*) When we treat T as unknown, we actually have to decide how we want to weight the errors of the measurements and the errors of our assumption against each other. In case we assume that the errors of the bearing measurements are negligible with respect to the assumption of constant velocity, we basically arrive again at the initial linear least squares formulation. It now reads $0$ = cross(B) · T and $0\leq$ B · T together with the fact that |T - (A+Vt-P)| should be minimized. This is identical to the initial formulation in case $0<$ B · T. Otherwise, we have to replace the distance from the line with the distance from the origin. (Which is equivalent to your initial idea to use the distance from the ray for the objective function.) The theory of quadratic programming has more to say about how the switching between the different cases has to be done correctly, but I hope it shouldn't be too difficult for this specific case.
EDIT 2 The squared distance from the bearing ray could be written in the form $[cross(B)\cdot T]^2 + [\max(0,-B\cdot T)]^2$. I would probably use this together with my favorite Levenberg-Marquardt implementation, because then I don't have to worry about solving the normal equations myself. If I would instead solve the "local" normal equations until the "local" normal equations no longer change, this would be equivalent to using the Gauss-Newton method. I'm just not sure whether this is guaranteed to work, that's why I pointed to the quadratic programming solution (which is guaranteed to work).
Your first question is hard to answer without going through the others, so I'll come back to it.
Your second question:
The "normal" vectors are vectors perpendicular to the lines in question. These are often used because, in an N-dimensional vector space, an equation of the form $\sum_i^N c_i x_i = d$ describes an N-1 dimensional object--in 3d, it describes a plane; in 4d, it describes a 3d hypersurface, and so on. Normal vectors--the vectors orthogonal to these objects--generalize well to any case in arbitrary dimensions.
In essence, when you add the equations of two lines (talking about the 2d case), you get another line, which is described by the new normal vector.
Otherwise, I think your understanding is correct.
For your third question: As we've seen, it's not possible to characterize an arbitrary line just from its normal vector. Rather, you must have some third degree of freedom to characterize a line's offset from the origin. This is the justification for homogeneous coordinates and projective geometry, which I'll touch on in just a moment.
For your fourth question: You use an "alternating bilinear form" $\Psi$. This is a natural time to talk about exterior algebra and wedge products. Define the wedge product $e_1 \wedge e_2 = - e_2 \wedge e_1$ for two orthogonal vectors. A linear operator $\underline T$ obeys $\underline T(e_1) \wedge \underline T(e_2) = \lambda e_1 \wedge e_2$ in two dimensions (the left-hand side is also taken as the definition of $\underline T(e_1 \wedge e_2)$ also, which generalizes to higher dimensions). Anyway, the notation is different, but we're talking about the same math.
One way to interpret Cramer's rule is in the projective geometry I spoke of. As I said, lines with an aribtrary offset from the origin require 3 degrees of freedom to describe: a point that the line goes through (two degrees) and a direction (one degree). The natural way to work with this is in a three dimensional space. In $ax + by = c$, let $c$ go along the third axis, and the vector $a e_1 + b e_2 + c e_3$ goes normal to a plane through the origin. Take another such vector, $u e_1 + v e_2 + w e_3$, and take the cross product.
$$(a e_1 + b e_2 + c e_3) \times (u e_1 + v e_2 + w e_3) = (av - bu) e_3 + (bw - cv) e_1 + (cu - av) e_2$$
Note that this is a homogeneous representation, and any non-unit factor of $e_3$ can be rescaled. Dividing through by the coefficient of $e_3$ generates exactly the terms of cramer's rule, and the interpretation geometrically is simple: we found the common line between the two planes (planes which, in this space, represent lines on the original space), whose direction must of course be perpendicular to the planes' normal vectors. Where the common line in the projective space intersects the projective plane is the ordinary point of intersection.
There is a way to extend this to arbitrary dimensions (ones in which the cross product doesn't exist), but you're on the right track, talking about alternating bilinear forms.
I'm afraid I'm not really able to follow from this point (or even to answer your original question), but I think I can probe at your difficulty as follows: the idea of a system of linear equations being a linear map is, indeed, quite arbitrary (you can choose the ordering of the equations as they correspond to components however you like and add and subtract equations at will). It doesn't have a neat geometric interpretation, as far as I can see. Homogeneous coordinates and projective geometry, on the other hand, give very neat and clean geometric interpretations, something you can intuitively understand (or at least that I can).
I won't call this a complete answer, but perhaps delving into projective geometry and homogeneous coordinates will give you further insights into the problem of finding the intersections between lines (or the common lines between planes, etc.). It's the only method I use anymore. In particular, I highly recommend the geometric (clifford) algebra approach to this stuff. Doran and Lasenby or Dorst, Fontijne, and Mann give excellent descriptions of projective geometry (and conformal geometry) using that formalism.
Best Answer
In order to show that $Ax+By=C$ is the general equation for a straight line, we need a definition of "straight line." You might be tempted to define a straight line as "a curve such that it is the path of shortest distance between any two points on itself," but showing that $Ax+By=C$ gives the shortest distance between any two points is a somewhat long story, so I hope you'll be satisfied with defining a straight line as "a path of constant slope," i.e., for any two points $(x_1,y_1)$ and $(x_2,y_2)$, the slope $\frac{y_2-y_1}{x_2-x_1}$ is constant (depends on the line but not on the particular points). Now, every line has at least one point and a slope, and if we fix a point $(x_0,y_0)$ and a slope $m$, we must have the equation $$ m = \frac{y-y_0}{x-x_0}$$ which rearranges to $Ax+By=C$ form. The one exceptional case is the "infinite slope" vertical line $y=C$ which is already in the desired form. Conversely, if we have $Ax+By=C$, we can show slope is constant: taking two points $(x_1,y_1)$ and $(x_2,y_2)$, we get $A(x_2-x_1) + B(y_2-y_1)=0$, which rearranges to an equation of constant slope.