Solved – Encoding Angle Data for Neural Network

circular statisticsneural networks

I am training a neural network (details not important) where the target data is a vector of angles (between 0 and 2*pi). I am looking for advice on how to encode this data. Here is what I am currently trying (with limited success):

1) 1-of-C encoding: I bin the set up possible angles into 1000 or so discrete angles and then indicate a particular angle by putting a 1 at the relevant index. The problem with this, is that the network simply learns to output all 0's (since this is almost exactly correct).

2) Simple scaling: I scaled the networks output range ([0,1]) to [0,2*pi]. The problem here is that angles naturally have a circular topology (i.e. 0.0001 and 2*pi are actually right next to each other). With this type of encoding, that information is lost.

Any suggestions would be appreciated!

Best Answer

Introduction

I find this question really interesting, I'm assume someone has put out a paper on it, but it's my day off, so I don't want to go chasing references.

So we could consider it as an representation/encoding of the output, which I do in this answer. I remain thinking that there is a better way, where you can just use a slightly different loss function. (Perhaps sum of squared differences, using subtraction modulo 2 $\pi$).

But onwards with the actual answer.

Method

I propose that an angle $\theta$ be represented as a pair of values, its sine and its cosine.

So the encoding function is: $\qquad\qquad\quad\theta \mapsto (\sin(\theta), \cos(\theta))$
and the decoding function is: $\qquad(y_1,y_2) \mapsto \arctan\!2(y_1,y_2)$
For arctan2 being the inverse tangents, preserving direction in all quadrants)

You could, in theory, equivalently work directly with the angles if your tool use supported atan2 as a layer function (taking exactly 2 inputs and producing 1 output). TensorFlow does this now, and supports gradient descent on it, though not it intended for this use. I investigated using out = atan2(sigmoid(ylogit), sigmoid(xlogit)) with a loss function min((pred - out)^2, (pred - out - 2pi)^2). I found that it trained far worse than using outs = tanh(ylogit), outc = tanh(xlogit)) with a loss function 0.5((sin(pred) - outs)^2 + (cos(pred) - outc)^2. Which I think can be attributed to the gradient being discontinuous for atan2

My testing here runs it as a preprocessing function

To evaluate this I defined a task:

Given a black and white image representing a single line on a blank background Output what angle that line is at to the "positive x-axis"

I implemented a function randomly generate these images, with lines at random angles (NB: earlier versions of this post used random slopes, rather than random angles. Thanks to @Ari Herman for point it out. It is now fixed). I constructed several neural networks to evaluate there performance on the task. The full details of implementation are in this Jupyter notebook. The code is all in Julia, and I make use of the Mocha neural network library.

For comparison, I present it against the alternative methods of scaling to 0,1. and to putting into 500 bins and using soft-label softmax. I am not particularly happy with the last, and feel I need to tweak it. Which is why, unlike the others I only trial it for 1,000 iterations, vs the other two which were run for 1,000 and for 10,000

Experimental Setup

Images were $101\times101$ pixels, with the line commensing at the center and going to the edge. There was no noise etc in the image, just a "black" line, on a white background.

For each trail 1,000 training, and 1,000 test images were generated randomly.

The evaluation network had a single hidden layer of width 500. Sigmoid neurons were used in the hidden layer.

It was trained by Stochastic Gradient Decent, with a fixed learning rate of 0.01, and a fixed momentum of 0.9.

No regularization, or dropout was used. Nor was any kind of convolution etc. A simple network, which I hope suggests that these results will generalize

It is very easy to tweak these parameters in the test code, and I encourage people to do so. (and look for bugs in the test).

Results

My results are as follows:

|                        |  500 bins    |  scaled to 0-1 |  Sin/Cos     |  scaled to 0-1 |  Sin/Cos     |
|                        | 1,000 Iter   | 1,000 Iter     | 1,000 iter   | 10,000 Iter    | 10,000 iter  |
|------------------------|--------------|----------------|--------------|----------------|--------------|
| mean_error             | 0.4711263342 | 0.2225284486   | 2.099914718  | 0.1085846429   | 2.1036656318 |
| std(errors)            | 1.1881991421 | 0.4878383767   | 1.485967909  | 0.2807570442   | 1.4891605068 |
| minimum(errors)        | 1.83E-006    | 1.82E-005      | 9.66E-007    | 1.92E-006      | 5.82E-006    |
| median(errors)         | 0.0512168533 | 0.1291033982   | 1.8440767072 | 0.0562908143   | 1.8491085947 |
| maximum(errors)        | 6.0749693965 | 4.9283551248   | 6.2593307366 | 3.735884823    | 6.2704853962 |
| accurancy              | 0.00%        | 0.00%          | 0.00%        | 0.00%          | 0.00%        |
| accurancy_to_point001  | 2.10%        | 0.30%          | 3.70%        | 0.80%          | 12.80%       |
| accurancy_to_point01   | 21.90%       | 4.20%          | 37.10%       | 8.20%          | 74.60%       |
| accurancy_to_point1    | 59.60%       | 35.90%         | 98.90%       | 72.50%         | 99.90%       |

Where I refer to error, this is the absolute value of the difference between the angle output by the neural network, and the true angle. So the mean error (for example) is the average over the 1,000 test cases of this difference etc. I am not sure that I should not be rescaling it by making an error of say $\frac{7\pi}{4}$ be equal to an error of $\frac{\pi}{4}$).

I also present the accuracy at various levels of granularity. The accuracy being the portion of test cases it got corred. So accuracy_to_point01 means that it was counted as correct if the output was within 0.01 of the true angle. None of the representations got any perfect results, but that is not at all surprising given how floating point math works.

If you take a look at the history of this post you will see the results do have a bit of noise to them, slightly different each time I rerun it. But the general order and scale of values remains the same; thus allowing us to draw some conclusions.

Discussion

Binning with softmax performs by far the worst, as I said I am not sure I didn't screw up something in the implementation. It does perform marginally above the guess rate though. if it were just guessing we would be getting a mean error of $\pi$

The sin/cos encoding performs significantly better than the scaled 0-1 encoding. The improvement is to the extent that at 1,000 training iterations sin/cos is performing about 3 times better on most metrics than scaling is at 10,000 iterations.

I think, in part this is related to improving generalization, as both were getting fairly similar mean squared error on the training set, at least once 10,000 iterations were run.

There is certainly an upper limit on the best possible performance at this task, given that the Angle could be more or less any real number, but not all such angels produce different lines at the resolution of $101\times101$ pixels. So since, for example, the angles 45.0 and 45.0000001 both are tied to the same image at that resolution, no method will ever get both perfectly correct.

It also seems likely that on an absolute scale to move beyond this performance, a better neural network is needed. Rather than the very simple one outlined above in experimental setup.

Conclusion.

It seems that the sin/cos representation is by far the best, of the representations I investigated here. This does make sense, in that it does have a smooth value as you move around the circle. I also like that the inverse can be done with arctan2, which is elegant.

I believe the task presented is sufficient in its ability to present a reasonable challenge for the network. Though I guess really it is just learning to do curve fitting to $f(x)=\frac{y1}{y2} x$ so perhaps it is too easy. And perhaps worse it may be favouring the paired representation. I don't think it is, but it is getting late here, so I might have missed something I invite you again to look over my code. Suggest improvements, or alternative tasks.