Your example is a little weaker than it could be, because there don't really seem to be any actions, however, we can work with that; I'll preserve the $\max_a$ and $R(s,a)$ pieces of notation, although $a$ isn't operative in the example. I'm assuming that from state 1 there is a 50-50 chance of transitioning to either state 2 or state 3, and that the discount rate is also 50%, as per your example. I also assume the rewards $R(s,a)$ for being in states 1, 2, and 3 are 0, 2, and 0 respectively.
Value iteration iterates until the values converge. At the first iteration, as you have, $V_0(s) = 0 \space \forall s$.
At iteration 1:
$V_1(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.
$V_1(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.
$V_1(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_0(s=2) + 0.5*V_0(s=3))\} = 0$
At iteration 2:
$V_2(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.
$V_2(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.
$V_2(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_1(s=2) + 0.5*V_1(s=3))\} = 0.5$
At iteration 3:
$V_3(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.
$V_3(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.
$V_3(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_2(s=2) + 0.5*V_2(s=3))\} = 0.5$
And we have convergence! All the $V_3(s) = V_2(s)$. In a more complex example, with actions included, $V_3(s) = V_2(s)$ implies that the actions selected at iteration 3 and at iteration 2 are either equal or equivalent in terms of value; having part of the algorithm be a fixed way of choosing between actions that are tied, in terms of expected value, reduces the implication to the actions being the same between iterations.
There are a few subtleties with the PyBrain library and NFQ. I don't have a lot of experience with NFQ, but it's part of the course I tutor at my university. We use the Pybrain library because it's a good intro to a lot of these things. Generally, there are 2 things that help:
Use exploration. set learner.epsilon=x for some x in [0,1] where 0 means only rely on the network's output, and 1 means act completely randomly. A value of 0.05-0.2 can help learning most problems enormously.
Use more learning episodes and more hidden neurons. NFQ only fits to the number of episodes you tell it to, at a complexity based on the number of hidden units. Running more independent episodes and/or running longer episodes can give more experience for training the network.
These approaches have been used to improve NFQ performance a lot on tasks such as the 2048 game, so I imagine it should be similar for your case. In general though, for grid-world type problems, I find table based RL to be far superior. RBF neural nets might also be good (disclaimer: I haven't tried this).
Another thing to check: make sure you give your agent enough information that it could reasonably figure out which direction to go at each point. It has no memory, so if it can't "see" any landmarks to point it in the right direction, it will only learn random noise.
Best Answer
According to chapter 9 "Planning and Learning" in the (overall recommended) book Reinforcement Learning: An Introduction by Sutton and Barto, both learning and planning try to estimate a value function and use it to improve the overall policy. The difference is, that planning uses simulated experience or knowledge from an environment model meanwhile learning uses actual (trial-and-error) experience to learn the value function.
Since Value Iteration as part of Dynamic Programming requires full knowledge of the environment, it is indeed a planning algorithm. Value Iteration is presented in the context of Reinforcement Learning as theoretical pre-stage, where an environment model is available, before switching to heuristics (Monte-Carlo, Temporal-Difference-Learning) where it is not.
However, as Sutton points out, it is not necessarily helpful to make such a distinction in practice. For example, one can learn the state-value-function $Q(s,a)$ and use that one during action selection to plan (e.g. by using Heuristic Search).