It looks like you are using an ANN for function approximation where your target is a continuous variable(?). If so, the uniformity of input data spread over all the dimensions and their range and scale can have a strong impact on how well your ANN works. It has been noticed that, even if there is dense coverage of datapoints over the space of the inputs, an ANN can still have a problem making a good prediction. (I have actually had to smooth the input data $\rightarrow$ i.e., cubic splines, to improve uniformity of datapoint coverage over the range and scale of inputs, before the MSE stabilized and decreased substantially). Therefore, straightforward use of raw input data may not help the situation. If you were performing classification analysis, you would need to use the softmax function on the output side -- but it looks like you are merely performing function approximation. You might also try a linear function (identity) on the output side, and what happens on the output side, because the distribution of your target $y$-variable $[y=f(x_1,x_2,...,x_p)]$ can affect the MSE as well. You could also consider an RBF network, or SOM-ANN, which will reduce dimensionality of your inputs. Lastly, correlation between input features degrades learning speed, since an ANN will waste time learning the correlation between inputs. This is why many use PCA on inputs first, and then input e.g. the 10 PCs associated with the greatest eigenvalues -- effectively decorrelating features so they are orthogonal.
Update(4/27/2016):
A way to more evenly distribute randomly sampled points for inputs to a function approximation problem using an ANN is to employ Latin Hypercube Sampling (LHS) from $\{x,y\}$ in order to predict $\hat{z}$. To begin, split up the range of $x$ and $y$ into $M=10$ uniformly spaced, non-overlapping bins -- the result is a $10\times 10$ square grid with 100 elements (cells) -- call this a ``range grid.'' Next, sample one of the 100 cells, and from this cell draw a random pair of $\{x,y\}$ values from within the range of the bin walls for $x$ and $y$ (of the selected cell), and then block that row and column out from further selection. Next, draw another random element from a row and column that hasn't been sampled from yet, and draw a pair of random $\{x,y\}$ from within that cell. Continue until all rows and column have been selected once. The 10 samples of $\{x,y\}$ will provide pairs of points with no overlap, which is a good way to feed $\{x,y\}$ to an ANN for an $\{x,y,z\}$ problem, or multiple feature problem $\{x_1,z_2,\ldots,x_p\}$.
If you want 100 pairs of $\{x,y\}$, you can start with the combination $\{1,2,3,4,5,6,7,8,9,10\}$. Next, identify 10 permutations for this combination to generate a $10 \times 10$ ``row'' matrix $\mathbf{R}$:
$\{3,2,10,4,1,5,7,9,8,6\}$, $\{2,4,3,1,5,6,10,8,9,7\}$,...,$\{9,1,2,10,5,6,7,8,3,4\}$.
which will give 100 integer values for sampling rows.
Next, generate a $10 \times 10$ ``column'' matrix $\mathbf{C}$ using another set of 10 different permutations:
$\{5,8,10,4,9,3,7,1,2,6\}$, $\{3,7,4,1,5,6,10,8,9,2\}$,...,$\{6,9,2,7,5,1,10,8,3,4\}$
which will provide 100 integers for sampling columns.
The first random draw using the above matrices would be from row 3 and col 5 in the original 10 by 10 grid ``range grid'' of 100 bins for $x$ and $y$. This is another form of LHS.
If you need more than 100 $\{x,y\}$ pairs, then just increase the number of permutations used, and don't be stingy as there are 10! permutations.
Best Answer
Introduction
I find this question really interesting, I'm assume someone has put out a paper on it, but it's my day off, so I don't want to go chasing references.
So we could consider it as an representation/encoding of the output, which I do in this answer. I remain thinking that there is a better way, where you can just use a slightly different loss function. (Perhaps sum of squared differences, using subtraction modulo 2 $\pi$).
But onwards with the actual answer.
Method
I propose that an angle $\theta$ be represented as a pair of values, its sine and its cosine.
So the encoding function is: $\qquad\qquad\quad\theta \mapsto (\sin(\theta), \cos(\theta))$
and the decoding function is: $\qquad(y_1,y_2) \mapsto \arctan\!2(y_1,y_2)$
For arctan2 being the inverse tangents, preserving direction in all quadrants)
You could, in theory, equivalently work directly with the angles if your tool use supported
atan2
as a layer function (taking exactly 2 inputs and producing 1 output). TensorFlow does this now, and supports gradient descent on it, though not it intended for this use. I investigated usingout = atan2(sigmoid(ylogit), sigmoid(xlogit))
with a loss functionmin((pred - out)^2, (pred - out - 2pi)^2)
. I found that it trained far worse than usingouts = tanh(ylogit), outc = tanh(xlogit))
with a loss function0.5((sin(pred) - outs)^2 + (cos(pred) - outc)^2
. Which I think can be attributed to the gradient being discontinuous foratan2
My testing here runs it as a preprocessing function
To evaluate this I defined a task:
I implemented a function randomly generate these images, with lines at random angles (NB: earlier versions of this post used random slopes, rather than random angles. Thanks to @Ari Herman for point it out. It is now fixed). I constructed several neural networks to evaluate there performance on the task. The full details of implementation are in this Jupyter notebook. The code is all in Julia, and I make use of the Mocha neural network library.
For comparison, I present it against the alternative methods of scaling to 0,1. and to putting into 500 bins and using soft-label softmax. I am not particularly happy with the last, and feel I need to tweak it. Which is why, unlike the others I only trial it for 1,000 iterations, vs the other two which were run for 1,000 and for 10,000
Experimental Setup
Images were $101\times101$ pixels, with the line commensing at the center and going to the edge. There was no noise etc in the image, just a "black" line, on a white background.
For each trail 1,000 training, and 1,000 test images were generated randomly.
The evaluation network had a single hidden layer of width 500. Sigmoid neurons were used in the hidden layer.
It was trained by Stochastic Gradient Decent, with a fixed learning rate of 0.01, and a fixed momentum of 0.9.
No regularization, or dropout was used. Nor was any kind of convolution etc. A simple network, which I hope suggests that these results will generalize
It is very easy to tweak these parameters in the test code, and I encourage people to do so. (and look for bugs in the test).
Results
My results are as follows:
Where I refer to error, this is the absolute value of the difference between the angle output by the neural network, and the true angle. So the mean error (for example) is the average over the 1,000 test cases of this difference etc. I am not sure that I should not be rescaling it by making an error of say $\frac{7\pi}{4}$ be equal to an error of $\frac{\pi}{4}$).
I also present the accuracy at various levels of granularity. The accuracy being the portion of test cases it got corred. So
accuracy_to_point01
means that it was counted as correct if the output was within 0.01 of the true angle. None of the representations got any perfect results, but that is not at all surprising given how floating point math works.If you take a look at the history of this post you will see the results do have a bit of noise to them, slightly different each time I rerun it. But the general order and scale of values remains the same; thus allowing us to draw some conclusions.
Discussion
Binning with softmax performs by far the worst, as I said I am not sure I didn't screw up something in the implementation. It does perform marginally above the guess rate though. if it were just guessing we would be getting a mean error of $\pi$
The sin/cos encoding performs significantly better than the scaled 0-1 encoding. The improvement is to the extent that at 1,000 training iterations sin/cos is performing about 3 times better on most metrics than scaling is at 10,000 iterations.
I think, in part this is related to improving generalization, as both were getting fairly similar mean squared error on the training set, at least once 10,000 iterations were run.
There is certainly an upper limit on the best possible performance at this task, given that the Angle could be more or less any real number, but not all such angels produce different lines at the resolution of $101\times101$ pixels. So since, for example, the angles 45.0 and 45.0000001 both are tied to the same image at that resolution, no method will ever get both perfectly correct.
It also seems likely that on an absolute scale to move beyond this performance, a better neural network is needed. Rather than the very simple one outlined above in experimental setup.
Conclusion.
It seems that the sin/cos representation is by far the best, of the representations I investigated here. This does make sense, in that it does have a smooth value as you move around the circle. I also like that the inverse can be done with arctan2, which is elegant.
I believe the task presented is sufficient in its ability to present a reasonable challenge for the network. Though I guess really it is just learning to do curve fitting to $f(x)=\frac{y1}{y2} x$ so perhaps it is too easy. And perhaps worse it may be favouring the paired representation. I don't think it is, but it is getting late here, so I might have missed something I invite you again to look over my code. Suggest improvements, or alternative tasks.