Just to make sure we are on the same page: You have a sequence of 1000 samples with 7 features each. There is a sequential pattern in there, which is why you process them with an RNN. At each timestep:
- It depends. It might get better if you use different normalizations, hard to tell.
- To me it just sounds like classification. I am not sure what you mean by ranking exactly.
- No reason to be skeptical. Normally, training error drops like that--extremly quick for few iterations, very slow afterwards.
- No, absolutely not. For some tasks, less than 100 iterations (= passes over the training set) suffice.
- You are the one who has to say whether the error is small enough. :) We can't tell you without knowing what you are using the network for.
- Hard to tell. You should use early stopping instead. Train the network until the error on some held out validation set rises--that's the moment from which on you only overfit. Use the weights found then to evaluate on a test set. (That makes it three sets: training, validation, test set).
Here are some tips that I can give:
Linear, single-layer FFNs are non-identified
The question as since been edited to exclude this case; I retain it here because understanding the linear case is a simple example of the phenomenon of interest.
Consider a feedforward neural network with 1 hidden layer and all linear activations. The task is a simple OLS regression task.
So we have the model $\hat{y}=X A B$ and the objective is
$$
\min_{A,B} \frac{1}{2}|| y - X A B ||_2^2
$$
for some choice of $A, B$ of appropriate shape. $A$ is the input-to-hidden weights, and $B$ is the hidden-to-output weights.
Clearly the elements of the weight matrices are not identifiable in general, since there are any number of possible configurations for which two pairs of matrices $A,B$ have the same product.
Nonlinear, single-layer FFNs are still non-identified
Building up from the linear, single-layer FFN, we can also observe non-identifiability in the nonlinear, single-layer FFN.
As an example, adding a $\tanh$ nonlinearity to any of the linear activations creates a nonlinear network. This network is still non-identified, because for any loss value, a permutation of the weights of two (or more) neurons at one layer, and their corresponding neurons at the next layer, will likewise result in the same loss value.
In general, neural networks are non-identified
We can use the same reasoning to show that neural networks are non-identified in all but very particular parameterizations.
For example, there is no particular reason that convolutional filters must occur in any particular order. Nor is it required that convolutional filters have any particular sign, since subsequent weights could have the opposite sign to "reverse" that choice.
Likewise, the units in an RNN can be permuted to obtain the same loss.
See also: Can we use MLE to estimate Neural Network weights?
Best Answer
This is an optimization problem rather than an unsupervised learning problem. You're not trying to learn from examples, but to minimize a function of known quantities. Neural nets can be used to solve this type of problem, but it looks different than solving supervised/unsupervised problems that one typically sees in the machine learning literature (no learning is involved here).
For exmaple, see work using Hopfied nets to solve the traveling salesman problem (Hopfield and Tank 1985, and many others since then). A recurrent network is configured such that it encodes the problem. The network has an energy function that governs the behavior of the network (which tends to move to lower energy states). Each network state corresponds to a possible solution. The weights are set such that low-cost solutions (that respect the constraints of the problem) have lower energy. The network is then run from some initial state until it converges to a low energy (i.e. low cost) solution. The traveling salesman problem is NP complete, so it's probably infeasible to compute exact solutions for large problems. The purpose of this method is to compute approximate solutions in a reasonable amount of time (and to do so in a way that mimics biological neural nets; raw performance isn't necessarily the goal here).
There's no need to limit yourself to neural nets; this is a discrete optimization problem and most methods look nothing like neural nets. There may well be more efficient heuristic approaches. For reading, the discrete optimization, operations research, and computer science literature will probably be more helpful than the machine learning literature.
References:
Hopfield and Tank (1985). Neural computation of decisions in optimization problems.