The intuition behind the argument saying that the optimal policy is independent of initial state is the following:
The optimal policy is defined by a function that selects an action for every possible state and actions in different states are independent.
Formally speaking, for an unknown initial distribution, the value function to maximize would be the following (not conditioned on initial state)
$v^\pi = E[ R(s_0,a_0) + \gamma R(s_1,a_1) + ... | \pi ] $
Thus the optimal policy is the policy that maximizes $v^\pi=x_0^T V^\pi$ where $x_0$ is the vector defined as $x_0(s)=Prob[s_0=s]$ and $V^\pi$ is the vector with $V^\pi (s)$ defined the same as your definition. Thus the optimal policy $\pi ^*$ is a solution of the following optimization
$\max_{\pi} x_0^T V^\pi = \max_{\pi} \sum _{s} x_0(s)V^\pi (s)= \sum _{s} x_0(s) \max_{a\in A_s}V^\pi (s)= \sum _{s} x_0(s)V^* (s)$
where $A_s$ is the set of actions available at state $s$. Note that the third equation is only valid because $x_0(s)\geq 0$ and we can separate the decision policy by selecting independent actions at different states. In other words, if there were constraints for example on $x_t$ (e.g., if the optimization only searches among policies that guarantee $x_t \leq d$ for $t>0$), then the argument is not valid anymore and the optimal policy is a function of the initial distribution.
For more details, you can check the following arXiv report:
"Finite-Horizon Markov Decision Processes with State Constraints", by
Mahmoud El Chamie, Behcet Acikmese
Although you can define the reward as a "reward function", and you may have computer code that calculates reward from a function call with current state and action as inputs, typically reward is not considered a mathematical function. It is a variable that you can observe.
So to answer this, assuming you mean "reward" where you say "reward function":
Is reward function needed to be continuous in deep reinforcement learning? It should be noted that the reward is used for gradient computation
No, there is no requirement for reward to be drawn from any continuous function. That is because the value of $R_t$ is produced by the environment, independently of the parameters $\theta$ that the policy gradient is with respect to. Changing any part of $\theta$ would not change the value observed in the same context (although it may change whether you ever observe the same value again). In fact this is used when deriving your first equation in the Policy Gradient Theorem (see appendix 1 of this paper), the gradient of $r$ is assumed to be zero when expanding terms.
Intuitively, the reward is data that your algorithm learns from. It does not make sense to ask about the gradient of reward w.r.t. learnable parameters, any more than it makes sense to ask about gradient of input data from supervised learning w.r.t. learnable parameters*.
* In some contexts - e.g. style transfer for images - we do take gradients of input data in order to modify it - technically gradients of a loss function w.r.t. input, not input w.r.t. learnable parameters (that would still be zero). There are RL contexts where you fit reward structure to observed behaviour where this could be useful (e.g. inverse reinforcement learning), but that is not what you do when training an agent to optimise total reward in an environment.
Best Answer
You do not have the information in order to take that maximum at the start of learning. In order to know the expected return or discounted sum of future rewards, you need to of measured it whilst using an already optimal policy.
Iterating towards this goal with a policy based on best estimates so far, refining those estimates given the current policy (by acting in that policy and sampling results), then refining the policy based on better estimate is essentially how action-value-based methods work, such as Monte Carlo Control, SARSA or Q Learning. These are all RL solvers, but are not always the most efficient for a given problem.
The score function helps to calculate a sampled measure of the gradient of the expected return of a parametric policy with respect to its parameters. Which means you can use it to perform stochastic gradient ascent directly on a policy, increasing its performance (on average) without necessarily needing to know the action values. The REINFORCE algorithm does not use action values at all. However, algorithms which do, such as Actor-Critic, can be better, and still maintain benefits compared to using a pure action-value approach.
Which is better? It depends on the problem. Sometimes it is more efficient to express a policy as a parametric function of the state. A common example of this is when there are many actions, or action space is continuous. Getting action-value estimates for a large number of actions, and then finding the maximising action over them, is computationally expensive. In those scenarios, it will be more efficient to use a policy gradient method and the score function is needed to estimate the gradient.
Another common scenario where a direct policy refinement can be better is when the ideal policy is stochastic. E.g. in scissor/paper/stone game. Expressing this as maximising over action values is not stable - the agent will pick one action, until that is exploited against it, then pick another etc. Whilst an agent using policy gradient and a softmax action choice could learn optimal ratios in an environment like scissor/paper/stone - two such agents competing should converge in theory to the Nash equilibrium of equiprobable actions.
Conversely, sometimes action-value methods will be the more efficient choice. There might be a simpler relationship between optimal action value and state, than between policy and state. A good example of this might be a maze solver (with reward -1 per time step). The mapping between action value and state is just related to the distance to the exit. The mapping between policy and state has no obvious relation to the state, except when expressed as taking the action that minimises that distance.