Solved – Cosine distance with latitude and longitude

circular statisticscosine similarityspatial

I have several features I'd like to use for computing cosine similarity between rows in a data set. However, two of them are latitude and longitude.

Apart from the fact that it's not the "correct" way to measure the distance between points on the surface of the Earth, is there any pressing reason I can't use them along with other features to compute cosine similarity between two rows of a data set?

Best Answer

Because latitude and longitude are circular coordinates, some care is needed.

A simple solution is to convert them to geocentric Cartesian coordinates. For most purposes the usual conversion from spherical to Cartesian coordinates works just fine. A highly accurate calculation is included in my post at https://gis.stackexchange.com/a/34534/664; the key code is this:

ellipsoidToCartesian[{lon_, lat_}, {a_,b_}] := 
    {a Cos[lat] Cos[lon], a Cos[lat] Sin[lon], b Sin[lat]};
cartesianToEllipsoid[{x_, y_, z_}, {a_,b_}] := 
    {ArcTan[x, y], ArcTan[Norm[{x, y}]/a, z/b]};

(This is written in Mathematica. It serves as pseudocode for implementation in other environments, but pay attention to the order of arguments to ArcTan.)

The values of a and b are the planet's semi-axes. For modern Earth coordinate systems, such as WGS84, $a = 6\,378\,137.0$ and $b \approx 6\,356\,752.314\,245$ meters. When adopting a spherical approximation, use the Authalic radius of $6\,371\,007.2$ meters--but feel free to rescale this radius if you wish to adjust the relative weight of your coordinates within the overall analysis.

If you also have height or depth data coordinates relative to the planet's surface, refer to that post for details.