It is not 100% clear whether you are running something that would benefit and work from being framed as a full reinforcement learning problem.
Your RL agent should be doing one or both of:
Predicting total accumulated reward (called "return" or "utility") when completing the task, given a current state or state/action pair (or sometimes in deterministic environments, the next state). These predictions will be used to help decide the next action.
Generating probability space of actions (e.g. a list of 10 means and standard deviations for generating the eventual vectors) from a current state. This Probability Density Function describes a policy that can then be refined based on eventual reward.
You need at least one of the above to be true in order to generate suitable gradients to train your RNN from sparse reward data.
Assuming you do have your problem set up in a reinforcement learning framework as above:
However, I am not sure how should I define the reward function for my network. Since there is really no reward for the intermediate step, should I give them a zero???
In reinforcement learning, rewards are only granted when they impact directly against goals of a task. There is no need for interim rewards, and adding heuristics, unless done very carefully, may hinder the agents learning. So, yes, everything other than the end reward should be zero for your problem. Bear in mind though, that your RNN should not be predicting reward but one or both of state value (long term reward) and/or probability distributions for the actions.
It is OK to just have a single reward granted at the end of an episode, as you have set up with a scoring system for the matrix, and the rest zero. For example, this is a fairly common setup for games where the agent either wins, draws or loses.
Instead of using the score function, why do you not simply optimize for the highest reward and choose Policy*= Max(All actions with discounted rewards)?
You do not have the information in order to take that maximum at the start of learning. In order to know the expected return or discounted sum of future rewards, you need to of measured it whilst using an already optimal policy.
Iterating towards this goal with a policy based on best estimates so far, refining those estimates given the current policy (by acting in that policy and sampling results), then refining the policy based on better estimate is essentially how action-value-based methods work, such as Monte Carlo Control, SARSA or Q Learning. These are all RL solvers, but are not always the most efficient for a given problem.
The score function helps to calculate a sampled measure of the gradient of the expected return of a parametric policy with respect to its parameters. Which means you can use it to perform stochastic gradient ascent directly on a policy, increasing its performance (on average) without necessarily needing to know the action values. The REINFORCE algorithm does not use action values at all. However, algorithms which do, such as Actor-Critic, can be better, and still maintain benefits compared to using a pure action-value approach.
Which is better? It depends on the problem. Sometimes it is more efficient to express a policy as a parametric function of the state. A common example of this is when there are many actions, or action space is continuous. Getting action-value estimates for a large number of actions, and then finding the maximising action over them, is computationally expensive. In those scenarios, it will be more efficient to use a policy gradient method and the score function is needed to estimate the gradient.
Another common scenario where a direct policy refinement can be better is when the ideal policy is stochastic. E.g. in scissor/paper/stone game. Expressing this as maximising over action values is not stable - the agent will pick one action, until that is exploited against it, then pick another etc. Whilst an agent using policy gradient and a softmax action choice could learn optimal ratios in an environment like scissor/paper/stone - two such agents competing should converge in theory to the Nash equilibrium of equiprobable actions.
Conversely, sometimes action-value methods will be the more efficient choice. There might be a simpler relationship between optimal action value and state, than between policy and state. A good example of this might be a maze solver (with reward -1 per time step). The mapping between action value and state is just related to the distance to the exit. The mapping between policy and state has no obvious relation to the state, except when expressed as taking the action that minimises that distance.
Best Answer
There are a few subtleties with the PyBrain library and NFQ. I don't have a lot of experience with NFQ, but it's part of the course I tutor at my university. We use the Pybrain library because it's a good intro to a lot of these things. Generally, there are 2 things that help:
Use exploration. set learner.epsilon=x for some x in [0,1] where 0 means only rely on the network's output, and 1 means act completely randomly. A value of 0.05-0.2 can help learning most problems enormously.
Use more learning episodes and more hidden neurons. NFQ only fits to the number of episodes you tell it to, at a complexity based on the number of hidden units. Running more independent episodes and/or running longer episodes can give more experience for training the network.
These approaches have been used to improve NFQ performance a lot on tasks such as the 2048 game, so I imagine it should be similar for your case. In general though, for grid-world type problems, I find table based RL to be far superior. RBF neural nets might also be good (disclaimer: I haven't tried this).
Another thing to check: make sure you give your agent enough information that it could reasonably figure out which direction to go at each point. It has no memory, so if it can't "see" any landmarks to point it in the right direction, it will only learn random noise.