Intuition needed for this question
Some intuition on the difference between a-priori and a-posteriori:
Let $\{Y_1, ..., Y_4\}$ be i.i.d. uniform random variables over $[0,1]$. So for each $m \in \{1, ..., 4\}$ we have
$$P[Y_m>15/16]=1/16 = 0.0625$$
A-priori:
Let's a-priori pick an index $K \in \{1, 2, 3, 4\}$, independent of the $Y_i$ variables and before observing these variables. We can use any mass function $P[K=m]$ for $m \in \{1, 2, 3, 4\}$. What is $P[Y_K>15/16]$? It is still $1/16$ (regardless of the mass function we use) because, by the law of total probability:
\begin{align}
P[Y_K>15/16] &= \sum_{m=1}^4P[Y_K>15/16|K=m]P[K=m]\\
&=\sum_{m=1}^4\underbrace{P[Y_m>15/16|K=m]}_{P[Y_m>15/16]}P[K=m] \\
&=\sum_{m=1}^4 (1/16)P[K=m]\\
&=1/16
\end{align}
where the equality $P[Y_m>15/16|K=m]=P[Y_m>15/16]$ holds because $Y_m$ is independent of $K$.
A-posteriori:
Now let's a-posteriori pick the index $K$ associated with the largest $Y_m$ value, so $Y_K = \max[Y_1, Y_2, Y_3, Y_4]$. This choice of index $K$ depends on the observed data. Then:
\begin{align}
P[Y_K>15/16] &= 1-P[Y_1\leq 15/16, Y_2\leq 15/16, Y_3\leq 15/16,Y_4\leq 15/16]\\
&=1-(15/16)^4 \\
&\approx 0.2275
\end{align}
and so this probability is much larger than 1/16, even though $Y_K$ is just one of the $Y_m$ values and we know $P[Y_m>15/16]=1/16$ for all $m \in \{1, ..., 4\}$.
On the other hand, we know by the union bound that
$$\{Y_K>15/16\} \subseteq \cup_{m=1}^4 \{Y_m>15/16\} \implies P[Y_K>15/16]\leq \sum_{m=1}^4P[Y_m>15/16]=1/4 $$
and indeed $0.2275 \leq 1/4$.
Your specific setup
As in my above comment, I believe we need the following extra assumptions: We have a finite set $\mathcal{X}$ and an unknown function $f:\mathcal{X}\rightarrow\mathbb{R}$. We are certain that $f$ is one of the $M$ functions in the known set $\{h_1, ..., h_M\}$. We have the ability to exactly evaluate the function $f$ one step at a time. However, the set $\mathcal{X}$ is too large so we want to do a probabilistic estimate. Every time step $i$ we independently chose $X_i \in \mathcal{X}$, uniformly over all points in $\mathcal{X}$. We then observe the value of $f(X_i)$ (with no noise).
So for a given function $h_m$ we define $\phi_m$ by:
$$ \phi_m = P[f(X_i) \neq h_m(X_i)] = \frac{\mbox{number of points $x \in \mathcal{X}$ such that $f(x)\neq h_m(x)$}}{|\mathcal{X}|}$$
For each $m \in \{1, ..., M\}$ define the sequence of indicator functions $\{I^{(m)}_i\}_{i=1}^{\infty}$ by
$$ I^{(m)}_i = \left\{ \begin{array}{ll}
1 &\mbox{ if $h_m(X_i) \neq f(X_i)$} \\
0 & \mbox{ otherwise}
\end{array}
\right.$$
For any fixed $m \in \{1, ..., M\}$ the $\{I^{(m)}_i\}_{i=1}^{\infty}$ indicators are i.i.d. so we can apply Hoeffding. Note that for each individual $m$, Hoeffding is a bound on the following a-priori probability:
$$P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon\right] \leq 2e^{-2\epsilon^2 N} \quad (Eq. 1)$$
and it is nice that the bound on the right-hand-side does not depend on the index $m$.
Your specific questions
1) Your first question asks why Hoeffding requires a fixed function $h$ rather than a random function $H$. It is because Hoeffding applies to i.i.d. random variables. If we fix an index $m \in \{1, .., M\}$ then the indicators $\{I^{(m)}_1, I^{(m)}_2, I^{(m)}_3, ...\}$ are i.i.d. over the indices $i \in \{1, 2, 3, ...\}$. If we have a random index $K$ then the indicators $\{I^{(K)}_1, I^{(K)}_2, I^{(K)}_3, ...\}$ are not necessarily independent because they share a common random index $K$.
2-4) Your remaining questions boil down to the difference between a-priori probability and a-posteriori probability. The Hoeffding bounds in (Eq. 1) are a-priori bounds that hold for all $m \in \{1, ..., M\}$. They are bounds on the probability the data behaves in a certain way. That probability (and its bound) is determined without observing the actual data outcome (in the same way that the probability of a fair coin flip being heads is 1/2, and this probability is determined without looking at the outcome of the flip).
If we a-priori pick a random index $K \in \{1, ..., M\}$ (before observing the data and independent of the data), then we do not need the union bound:
\begin{align}
P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(K)} - \phi_K\right|>\epsilon\right] &= \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(K)} - \phi_K\right|>\epsilon | K=m\right]P[K=m] \\
&= \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon | K=m \right]P[K=m] \\
&\overset{(a)}{=} \sum_{m=1}^M P\left[\left|\frac{1}{N}\sum_{i=1}^NI_i^{(m)} - \phi_m\right|>\epsilon \right]P[K=m] \\
&\leq \sum_{m=1}^m [2e^{-2\epsilon^2 N}]P[K=m]\\
&= 2e^{-2\epsilon^2 N}
\end{align}
where equality (a) holds because the random index $K$ is independent of the data $\{I^{(m)}_i\}_{i=1}^{\infty}$.
So, if we were to a-priori pick a guess function $g$, independent of the data, by just picking a random index, the bound would indeed be $2e^{-2\epsilon^2 N}$ rather than $2M e^{-2\epsilon^2 N}$.
However, if we observe the results of the data and then a-posteriori pick a random index $K \in \{1, ..., M\}$, the way we choose might lead us to pick an index that exhibits "atypical" a-posteriori sample paths. So the equality (a) in the above chain of equalities does not necessarily hold for such picks. We need to be more careful by using the union bound.
Best Answer
Hoeffding's inequalities for absolute values are derived by determining first the bound for the value, and then double it to arrive at a bound for the absolute value. But the question here asks for a bound related to the maximum of the absolute value, not to the absolute value of the maximum, so a direct examination of the absolute value is needed.
To compact notation, we define $S_k=\Big | X_1+\cdots+ X_k \Big |$. Then we are examining
$$\mathbb P\Big( \max_{1\le k\le n} \Big | \frac{S_k}{\sqrt{k}} \Big | \ge t\Big) = \mathbb P\Big( \bigcup_k \Big\{ \Big|\frac{S_k}{\sqrt{k}} \Big | \ge t \Big\} \Big)$$
Denote $I_Z$ the indicator function of the event $Z= \Big( \bigcup_k \Big\{ \Big|\frac{S_k}{\sqrt{k}} \Big |\ge t \Big\} \Big)$ and $I_1,..., I_n$ the indicator functions of the events, $\Big\{ \Big|\frac{S_1}{\sqrt{1}} \Big |\ge t \Big\},...\Big\{ \Big|\frac{S_n}{\sqrt{n}} \Big |\ge t \Big\} $, respectively. Then by the properties of indicator functions $$I_Z=\max\Big \{I_1,...,I_n\Big \}$$ Now if $I_Z =0$ it means that all $I_i,\; i=1,...,n$ are zero. If $I_Z=1$, it means that at least one of these individual indicator functions is unity, denote it $I_m$, and we have $$I_m =1 \Rightarrow \Big|\frac{S_m}{\sqrt{m}} \Big |\ge t $$ Let $v = \text{argmax}_k \Big | \frac{S_k}{\sqrt{k}} \Big |$ and $\Big | \frac{S_v}{\sqrt{v}} \Big |$ the corresponding variable. Then in the case where $I_Z =1$ we have
$$\Big | \frac{S_v}{\sqrt{v}} \Big |\ge \Big|\frac{S_m}{\sqrt{m}} \Big |\ge t$$
Then in all cases, either when $I_Z=0$ or when $I_Z=1$ we have that, for some $h>0$,
$$I_Z \le \exp \left \{h\left (\Big | \frac{S_v}{\sqrt{v}} \Big |-t\right) \right \}$$
Therefore,
$$\mathbb P\Big( \max_{1\le k\le n} \Big | \frac{S_k}{\sqrt{k}} \Big | \ge t\Big) = \mathbb P\Big( \bigcup_k \Big\{ \Big|\frac{S_k}{\sqrt{k}} \Big | \ge t \Big\} \Big) = EI_Z \le E\exp \left \{h\left (\Big | \frac{S_v}{\sqrt{v}} \Big |-t\right) \right \}$$
$$\Rightarrow \mathbb P\Big( \max_{1\le k\le n} \Big | \frac{S_k}{\sqrt{k}} \Big | \ge t\Big) \le e^{-ht} E\exp \left \{h \Big | \frac{S_v}{\sqrt{v}} \Big | \right \} \qquad [1] $$
By Hoeffding's lemma, for a random variable $Y$, with $EY=0,\;a\le Y \le b$ we have, for any real $\lambda$ $$ E\left (e^{\lambda Y} \right) \le \exp \left(\frac{\lambda ^2 (b-a)^2}{8} \right) \qquad [2]$$ In our case, we have $$\Big | \frac{S_v}{\sqrt{v}} \Big | = \frac{1}{\sqrt{v}}\Big |X_1 +...+ X_v\Big | \le \frac{1}{\sqrt{v}}\Big (\Big |X_1 \Big | +...+ \Big | X_v\Big | \Big) \le \frac{1}{\sqrt{v}}v = \sqrt{v}$$
the last inequality by the initial assumptions. Since $v\le n$ we have $$0\le \Big | \frac{S_v}{\sqrt{v}} \Big | \le \sqrt{n} \Rightarrow -E\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big) \le \Big | \frac{S_v}{\sqrt{v}} \Big | - E\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big) \le \sqrt{n} -E\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big) $$. We now have a variable with zero expected value and bounded. The length of the interval is $b-a = \sqrt{n} -E\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big) + E\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big) = \sqrt{n} $ and $\lambda = h $. Inserting these values into Hoeffding's lemma and simplifying we get
$$ E\exp \Big \{h\Big | \frac{S_v}{\sqrt{v}} \Big | - hE\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big)\Big \} \le \exp \left(\frac{nh^2}{8} \right) $$ $$\Rightarrow E\exp \Big \{h\Big | \frac{S_v}{\sqrt{v}} \Big | \Big \} \le \exp \left(\frac{nh^2}{8} \right)\exp\Big \{hE\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big)\Big \} \qquad [3]$$ Note that the exponential that moved to the right hand side does not contain any random quantities that's why the expected value has disappeared. From previously we have $$ \Big | \frac{S_v}{\sqrt{v}} \Big | \le \sqrt{n} \Rightarrow E\Big | \frac{S_v}{\sqrt{v}} \Big | \le \sqrt{n} \Rightarrow \exp\Big \{hE\Big(\Big | \frac{S_v}{\sqrt{v}} \Big |\Big)\Big \} \le \exp\Big \{h\sqrt{n}\Big \} \le \exp\Big \{h^2n\Big \} \; [4] $$
Inserting the RHS of $[4]$ into $[3]$ and back in $[1]$ we obtain
$$ \mathbb P\Big( \max_{1\le k\le n} \Big | \frac{S_k}{\sqrt{k}} \Big |\Big )= \mathbb P\Big( \Big | \frac{S_v}{\sqrt{v}} \Big | \ge t\Big) \le \exp \left(-ht +\frac{9nh^2}{8} \right) \qquad [5]$$ Minimizing the RHS over $h$ we obtain $h^* = \frac {4}{9n}t$ and inserting into $[5]$ we finally obtain
$$ \mathbb P\Big( \max_{1\le k\le n} \Big | \frac{S_k}{\sqrt{k}} \Big |\Big )=\mathbb P\Big( \Big | \frac{S_v}{\sqrt{v}} \Big | \ge t\Big) \le \exp\Big \{-\frac{2}{9n}t^2\Big \} \qquad [6] $$
...which is a bound related to the number of r.v.'s involved.