The answer is tucked in the abstract (emphasis mine): "We prove that, given training under any $\epsilon$-soft policy [...] to the action-value function for an arbitrary target policy."
The authors also write, "We always use $\pi$ to denote the target policy, the policy that we are learning about."
Epsilon-soft is, as defined here, synonymous with epsilon-greedy:
An epsilon-soft policy chooses one action with probability $(1 - \epsilon + \frac{\epsilon}{|A(s)|)}$ and all other actions with probability $\frac{\epsilon}{|A(s)|)}$, where $|A(s)|$ is the number of actions in state $s$. This is equivalent to saying that the policy is to choose a random action with some small probability epsilon, and another single (dominant) action otherwise.
Wiki defines epsilon-greedy similarly:
One such method is $\epsilon$-greedy, when the agent chooses the action that it believes has the best long-term effect with probability $1-\epsilon$, and it chooses an action uniformly at random, otherwise.
That is, there are no restrictions on the target policy $\pi$; $b$ is epsilon-soft.
I don't see anything in its update equation as to why it requires knowledge of rewards before hand and why it cannot be trained in an online way in the same way that Q-learning can?
The usual Value Iteration update rule is:
$$v(s) \leftarrow \text{max}_a[\sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s')]$$
or it might be written as some variation of
$$v(s) \leftarrow \text{max}_a[R_s^a + \gamma \sum_{s'} (p(s'|s,a)v(s')]$$
In the first equation, $r$ is a known reward value, where you must already know its distribution. In the second equation $R_s^a$ is the expected reward from a particular state, action pair. It is also possible to use $R_{ss'}^a$ - the expected reward from a particular transition. None of these are about observed reward instances.
You cannot use observed reward instances directly in Value Iteration and trust that this will work in expectation in the long term, as you need to take the max over possible actions, and it is not usually possible to observe the output of taking all possible actions.
You could in theory maintain a separate table of estimated expected rewards, and that would then be a complete RL algorithm, not much different from other single-step TD algorithms. It would be a kind of half-way house to Q-learning as you would still need to know transition probabilities in order to use it.
The equation you give for policy iteration uses a slightly different view of reward:
$$U(s) = R(s) + \gamma \text{max}_a\sum_{s'}{T(s, a, s')U(s')}$$
Here $R(s)$ appears to stand for "fixed reward from arriving in this state". If you have a full model to go with the state transitions, then it is provided. But could you learn it from observations? Yes, with one important caveat: The formula assumes that each state has one and only one associated reward value.
If you can safely assume that your MDP has fixed rewards associated with landing in specific states (and often you can if you have constructed the MDP as part of a game or virtual environment), then yes you should be able to run value iteration much like Q-learning in an online fashion using that update rule. The equation will work just fine without you needing to store or estimate anything about immediate rewards - the expected and immediate values would be the same. You should note this is not a completely generic solution that applies to all MDPs.
The next thing is that I'm not sure why the policy would ever change in value iteration?
The policy changes implicitly. The deterministic policy from value iteration is the one that takes the action with best expected return:
$$\pi(s) \leftarrow \text{argmax}_a[\sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s')]$$
This clearly will change as your values converge. All optimal control algorithms need to deal with this, which is why many use a rolling average (a learning rate parameter) to estimate values, and not a true mean.
Best Answer
Value iteration is an approach used to find an optimal policy. If you want to constrain the solution to stochastic policies, then this may only be possible if you are willing to consider a restricted set of stochastic policies. Importantly, value iteration should work with $\epsilon$-greedy policies, because a variation of the Policy Improvement Theorem still holds. This means that repeatedly choosing the maximising action with $\pi(a|s) = 1 -\epsilon$ and evaluating this should still head towards an optimal solution.
Value iteration using action values will typically sweep through all state, action pairs $(s,a)$ updating the q values like this:
$$q(s,a) = \sum_{s',r} p(s',r|s,a)(r + \gamma max_{a'}[q(s',a')])$$
where $p(s',r|s,a)$ is the probability of observing state $s'$ and reward $r$ given $s,a$.
To adjust the formula above for an $\epsilon$-greedy approach, you just change the terms to calculate the bootstrap value, backing up all possible outcomes from your stochastic policy, weighted according to your policy probability function $\pi(a|s)$, which equals $1-\epsilon$ for the current maximum value action, and $\frac{\epsilon}{N_{actions}}$ for everything else (including another chance of selecting the current maximising action). The set of actions possible in state $s$ is noted as $\mathcal{A}(s)$ below, and it follows the notation used in Sutton & Barto's book Reinforcement Learning: An Introduction:
$$q(s,a) = \sum_{s',r} p(s',r|s,a)(r + \gamma( (1-\epsilon)\text{max}_{a'}[q(s',a')] + \frac{\epsilon}{|\mathcal{A}(s')|}\sum_{a'}q(s',a') ))$$
Use the above formula and perform the same repeated sweep through state,action pairs as before. The resulting policy may choose different "preferred" actions from a deterministic one, because it will tend to avoid risks caused by poor choices due to occasional random actions.
You can make similar update rules for other stochastic policies, but you should bear in mind that they may not have backing theory that guarantees convergence to an optimum variant of your policy. Also, if you end up without some maximisation over the $q(s',a')$ term in the algorithm, then you are likely to be performing policy evaluation of some arbitrary policy, as opposed to policy improvement.