My advice, in short, would be to try a Kalman filter.
The longer version is this. To restate your problem, at every time step $t$ you have some noisy sensory estimates of robot position $(\hat{x_t},\hat{y_t})$, and you want to infer the robot's true position $(x_t,y_t)$.
Given only the data from a single time step, I don't think there's much you can do with this data. Unless there is some consistent bias in the sensory estimates, your best guess of the robot's position given the current sensory data is simply $(\hat{x_t},\hat{y_t})$. However, your robot's position presumably is highly correlated from one time step to the next, so you could use this information to your advantage. To put it probabilistically, you can make use of the relationship $p(x_t,y_t|x_{t-1},y_{t-1})$ in the following manner:
$$\int p(x_t,y_t|\hat{x}_t,\hat{y}_t,\hat{x}_{t-1},\hat{y}_{t-1})\propto $$
$$
p(\hat{x_t},\hat{y_t}|x_t,y_t)p(x_t,y_t|x_{t-1},y_{t-1})p(x_{t-1}, y_{t-1}|\hat{x}_{t-1:t-N},\hat{y}_{t-1:t-N})dx_{t-1}dy_{t-1} $$
Let's break this down. In words, the idea is that your sensory information from all previous time points provides information about the robot's position at time $t-1$. This information is quantified by the distribution $p(x_{t-1}, y_{t-1}|\hat{x}_{t-1:t-N},\hat{y}_{t-1:t-N})$. Now the robot's position is correlated over time, and this link is described by $p(x_t,y_t|x_{t-1},y_{t-1})$ (i.e. given that the robot was at this location before, where is it likely to have moved to?). Thus, the rightmost part of the equation means that you look at every possible location that the robot could have been in before given your history of sensory data, and this gives you a prediction of all the positions (and their probabilities) that the robot could be now, based only on this historical data.
In other words, the history of sensory data constrains the range of positions where the robot could be right now. Finally, you update this belief by the information gained by observing the sensory data at the current time.
Note that this expression can be computed recursively, as the distrbution of information gained from the sensory history up to $t-1$ can be decomposed into terms depending on $t-1$, and then a term for the history up to $t-2$, so that you get a formula that is equivalent to the one above. Thus, in practice what you would do is start with the first two time points, compute the left hand side of the equation, and then continue with the next time point. The inference at each time point thus depends only on the sensory data at that time, and the running estimate of the information based on sensory history up to $t-1$. (In other words, the problem can effectively be cast as a Markov chain.)
Where does machine learning come into this? Well, you need to know two things: (1) a transition function that gives you $p(x_t,y_t|x_{t-1},y_{t-1})$, i.e. the way a robot can change its location from one time point to the next, and (2) a generative model $p(\hat{x_t},\hat{y_t}|x_t,y_t)$, i.e. a function that describes the probability of observing a certain sensory position reading given the robot's true position. Both functions may be known to you by construction (e.g. you may know the transition function if you know how the robot is programmed to behave, and you may know the generative model of your sensory readings from the manufacturer's specifications). If this information is not known a priori, however, you'd have to learn it from a set of training data. This is not necessarily something that has a plug-and-play solution, however; you'd have to look at the data and consider what you know about the problem, and then figure out how best to model it.
"P.S.": I wrote all this and then found this question which might be more concise and to the point for your needs. But what the heck, I'll just leave this here as an explanation of the assumptions behind a Kalman filter.
You are correct up to the second line of your working in the last part, and then you make an error by dropping the requirement that $j=p$ (which means you retain an additional sum that shouldn't be there). Continuing from your last correct step, you should have:
$$\begin{aligned}
\frac{\partial \ell}{\partial \Theta_p}(\Theta)
&= \sum_{i=1}^m \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) x^{(i)} \\[6pt]
&= \sum_{i=1}^m x^{(i)} \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) \\[6pt]
&= \sum_{i=1}^m x^{(i)} \bigg[ \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \mathbb{I}(p=j) - s_p \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \Bigg] \\[6pt]
&= \sum_{i=1}^m x^{(i)} \bigg[ \mathbb{I}(y^{(i)}=p) - s_p \Bigg] \\[6pt]
&= \sum_{i=1}^m x^{(i)} \mathbb{I}(y^{(i)}=p) - s_p \sum_{i=1}^m x^{(i)}. \\[6pt]
\end{aligned} $$
(The penultimate step follows from the fact that $\sum_{j=1}^k \mathbb{I}(y^{(i)}=j) = 1$ for all $i = 1, ..., m$.)
Best Answer
You have what is called compositional-data. There is quite some literature on how to model this. Take a look through the tag, or search for the term.
Typically, one would choose a reference category and work with log ratios, or similar. One paper I personally know about predicting compositional data is Snyder at al. (2017, IJF). They use a state space approach, not an NN, but their transformation may still be useful to you.