Introduction
I find this question really interesting, I'm assume someone has put out a paper on it, but it's my day off, so I don't want to go chasing references.
So we could consider it as an representation/encoding of the output, which I do in this answer.
I remain thinking that there is a better way, where you can just use a slightly different loss function.
(Perhaps sum of squared differences, using subtraction modulo 2 $\pi$).
But onwards with the actual answer.
Method
I propose that an angle $\theta$ be represented as a pair of values, its sine and its cosine.
So the encoding function is: $\qquad\qquad\quad\theta \mapsto (\sin(\theta), \cos(\theta))$
and the decoding function is: $\qquad(y_1,y_2) \mapsto \arctan\!2(y_1,y_2)$
For arctan2 being the inverse tangents, preserving direction in all quadrants)
You could, in theory, equivalently work directly with the angles if your tool use supported atan2
as a layer function (taking exactly 2 inputs and producing 1 output).
TensorFlow does this now, and supports gradient descent on it, though not it intended for this use.
I investigated using out = atan2(sigmoid(ylogit), sigmoid(xlogit))
with a loss function min((pred - out)^2, (pred - out - 2pi)^2)
.
I found that it trained far worse than
using outs = tanh(ylogit), outc = tanh(xlogit))
with a loss function 0.5((sin(pred) - outs)^2 + (cos(pred) - outc)^2
.
Which I think can be attributed to the gradient being discontinuous for atan2
My testing here runs it as a preprocessing function
To evaluate this I defined a task:
Given a black and white image representing a single line on a blank background
Output what angle that line is at to the "positive x-axis"
I implemented a function randomly generate these images, with lines at random angles (NB: earlier versions of this post used random slopes, rather than random angles. Thanks to @Ari Herman for point it out. It is now fixed).
I constructed several neural networks to evaluate there performance on the task. The full details of implementation are in this Jupyter notebook.
The code is all in Julia, and I make use of the Mocha neural network library.
For comparison, I present it against the alternative methods of scaling to 0,1.
and to putting into 500 bins and using soft-label softmax.
I am not particularly happy with the last, and feel I need to tweak it.
Which is why, unlike the others I only trial it for 1,000 iterations, vs the other two which were run for 1,000 and for 10,000
Experimental Setup
Images were $101\times101$ pixels, with the line commensing at the center and going to the edge.
There was no noise etc in the image, just a "black" line, on a white background.
For each trail 1,000 training, and 1,000 test images were generated randomly.
The evaluation network had a single hidden layer of width 500.
Sigmoid neurons were used in the hidden layer.
It was trained by Stochastic Gradient Decent, with a fixed learning rate of 0.01, and a fixed momentum of 0.9.
No regularization, or dropout was used. Nor was any kind of convolution etc.
A simple network, which I hope suggests that these results will generalize
It is very easy to tweak these parameters in the test code, and I encourage people to do so. (and look for bugs in the test).
Results
My results are as follows:
| | 500 bins | scaled to 0-1 | Sin/Cos | scaled to 0-1 | Sin/Cos |
| | 1,000 Iter | 1,000 Iter | 1,000 iter | 10,000 Iter | 10,000 iter |
|------------------------|--------------|----------------|--------------|----------------|--------------|
| mean_error | 0.4711263342 | 0.2225284486 | 2.099914718 | 0.1085846429 | 2.1036656318 |
| std(errors) | 1.1881991421 | 0.4878383767 | 1.485967909 | 0.2807570442 | 1.4891605068 |
| minimum(errors) | 1.83E-006 | 1.82E-005 | 9.66E-007 | 1.92E-006 | 5.82E-006 |
| median(errors) | 0.0512168533 | 0.1291033982 | 1.8440767072 | 0.0562908143 | 1.8491085947 |
| maximum(errors) | 6.0749693965 | 4.9283551248 | 6.2593307366 | 3.735884823 | 6.2704853962 |
| accurancy | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| accurancy_to_point001 | 2.10% | 0.30% | 3.70% | 0.80% | 12.80% |
| accurancy_to_point01 | 21.90% | 4.20% | 37.10% | 8.20% | 74.60% |
| accurancy_to_point1 | 59.60% | 35.90% | 98.90% | 72.50% | 99.90% |
Where I refer to error, this is the absolute value of the difference between the angle output by the neural network, and the true angle. So the mean error (for example) is the average over the 1,000 test cases of this difference etc.
I am not sure that I should not be rescaling it by making an error of say $\frac{7\pi}{4}$ be equal to an error of $\frac{\pi}{4}$).
I also present the accuracy at various levels of granularity.
The accuracy being the portion of test cases it got corred.
So accuracy_to_point01
means that it was counted as correct if the output was within 0.01 of the true angle.
None of the representations got any perfect results, but that is not at all surprising given how floating point math works.
If you take a look at the history of this post you will see the results do have a bit of noise to them, slightly different each time I rerun it. But the general order and scale of values remains the same; thus allowing us to draw some conclusions.
Discussion
Binning with softmax performs by far the worst, as I said I am not sure I didn't screw up something in the implementation.
It does perform marginally above the guess rate though. if it were just guessing we would be getting a mean error of $\pi$
The sin/cos encoding performs significantly better than the scaled 0-1 encoding.
The improvement is to the extent that at 1,000 training iterations sin/cos is performing about 3 times better on most metrics than scaling is at 10,000 iterations.
I think, in part this is related to improving generalization,
as both were getting fairly similar mean squared error on the training set, at least once 10,000 iterations were run.
There is certainly an upper limit on the best possible performance at this task, given that the Angle could be more or less any real number, but not all such angels produce different lines at the resolution of $101\times101$ pixels.
So since, for example, the angles 45.0 and 45.0000001 both are tied to the same image at that resolution, no method will ever get both perfectly correct.
It also seems likely that on an absolute scale to move beyond this performance, a better neural network is needed. Rather than the very simple one outlined above in experimental setup.
Conclusion.
It seems that the sin/cos representation is by far the best, of the representations I investigated here. This does make sense, in that it does have a smooth value as you move around the circle.
I also like that the inverse can be done with arctan2, which is elegant.
I believe the task presented is sufficient in its ability to present a reasonable challenge for the network. Though I guess really it is just learning to do curve fitting to $f(x)=\frac{y1}{y2} x$ so perhaps it is too easy.
And perhaps worse it may be favouring the paired representation.
I don't think it is, but it is getting late here, so I might have missed something
I invite you again to look over my code.
Suggest improvements, or alternative tasks.
I think a partially linear modeling framework may be suitable for your problem. If you focus on one flower at the time, note that both the flower data and the air temperature data exhibit strong temporal cycles which peak roughly at the same time. So the simplest partially linear model you could consider for one flower would look like this:
FT_h = beta0 + beta1*AT_h + m(h) + epsilon_h,
where FT_h is the flower temperature for the chosen flower at hour h, AT_h is the air temperature at hour h, m() is a smooth, unknown function meant to capture the temporal cycles you see in the temperature data and epsilon_h is an unknown error term. Here, h = 1, 2, 3, ..., H is an index which counts how many hours you have represented in total in your flower data. In other words, this index counts your hours from the first to the last. If you have 9,000 hours represented in your data, for example, then H = 9,000. In this model, beta1 represents the hourly effect of air temperature on flower temperature, after controlling for temporal effects.
The model can be expanded by adding a linear effect for incident solar radiation (ISR):
FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h + epsilon_h.
If you wanted to throw in wind direction as well, you could code this variable as taking the values North, South, East, West (or add variations like North-East, North-West, etc.) and include it in your model using dummy variables. For example, if you only code this variable as taking the values North, South, East or West, the flower-specific model could be expressed as:
FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h +
beta3*NorthDummy_h + beta3*EastDummy_h + beta4*WestDummy_h +
epsilon_h,
where South is treated as the reference direction against which all others will be compared and NorthDummy_h is set to 1 if wind direction was North at hour h and 0 otherwise, EastDummy_h is set to 1 if wind direction was East at hour h and 0 otherwise and WestDummy_h is set to 1 if wind direction was West at hour h and 0 otherwise.
The challenging aspects of these models are:
The need to estimate the (unknown) degree of smoothness of the (unknown) temporal effect m() carefully, given that this is just a nuisance effect and the real interest is in estimating beta1;
The possibility that the error terms epsilon_h might be temporally correlated, which in turns can affect how item 1. above is addressed.
Many years ago, I conducted research on this very topic - see, for example - http://www.ghement.ca/217.pdf. However, I have not stayed current on the topic so it's possible there have been several advances on ways to handle item 1.
Intuitively, the temporal signal seen in the data is really strong while the air temperature signal is likely tiny by comparison. So you need to find the right balance when determining the degree of smoothness of the temporal effect, so as not to throw the baby with the bath water.
If you are interested in comparing effects of air temperature across flowers, you can expand the model even further. But I would start small to make sure I get a handle first on the simpler, flower-specific models.
Best Answer
Here, we want to predict a linear dependent variable from circular independent variables. There are several ways to approach this. The main thing to check is whether the relation between your dependent variable (let's say $Y$) and the circular predictor (say $\theta$) has a sinusoidal shape. This is often the case, but not necessarily. Below is an example of data of this shape.
If the data does have this shape, roughly, a good simple model for the data is then given by splitting the circular predictor $\theta$ up in a sine and a cosine component, and running a regular linear regression on these two components, in this case by:
Of course, this can be done for multiple predictors as well. A good introduction on this may be found in Pewsey, Neuhauser & Ruxton (2013), Circular Statistics in R.
As mentioned before, we may add terms as in a Fourier regression, but this can only be recommended if the relationship structurally exhibits very different forms, because higher-order Fourier regression introduces, IIRC, a large number of difficult to interpret parameters.