I know of at least five ways of initializing the volatility process:
1) Set it equal to $\varepsilon_{t-1}^2$,
2) The sample variance,
3) Unconditional variance of the model ($\alpha_0/(1-\alpha_1 - \alpha_2)$),
4) Allow it it to be an parameter to be estimated,
5) Backcasting with an exponential filter.
The topic is discussed in further detail here
A conditional volatility model such as the GARCH model is defined by the mean equation
\begin{equation}
r_t = \mu + \sigma_t z_t = \mu + \varepsilon_t
\end{equation}
and the GARCH equation (this is for the simple GARCH)
\begin{equation}
\sigma^2_t = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2
\end{equation}
To perform maximum-likelihood estimation, we must make distributional assumptions on $z_t$. It is typically assumed to be i.i.d. $N(0,1)$.
Conditional on the informationset at time t, we have that
\begin{equation}
r_t \sim N(\mu, \sigma_t^2)
\end{equation}
or
\begin{equation}
\varepsilon_t = r_t - \mu \sim N(0, \sigma_t^2)
\end{equation}
However when we perform maximum-likelihood estimation, we are interested in the joint distribution
\begin{equation}
f(\varepsilon_1,...,\varepsilon_T; \theta)
\end{equation}
where $\theta$ is the parameter vector. Using iteratively that the joint distribution is equal to the product of the conditional and the marginal density, we obtain
\begin{eqnarray}
f(\varepsilon_0,...,\varepsilon_T; \theta) &=& f(\varepsilon_0;\theta)f(\varepsilon_1,...,\varepsilon_T\vert \varepsilon_0 ;\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T f(\varepsilon_t \vert \varepsilon_{t-1},...,\varepsilon_{0} ;\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T f(\varepsilon_t \vert \varepsilon_{t-1};\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T \frac{1}{\sqrt{2\pi \sigma_t^2}}\exp\left(-\frac{\varepsilon_t^2}{2\sigma_t^2}\right)
\end{eqnarray}
Dropping $f(\varepsilon_0;\theta)$ and taking logs, we obtain the (conditional) log-likelihood function
\begin{equation}
L(\theta) = \sum_{t=1}^T \frac{1}{2} \left[-\log2\pi-\log(\sigma_t^2) -\frac{\varepsilon_t^2}{\sigma_t^2}\right]
\end{equation}
To question 1): The exact same steps can be followed for the GJR-GARCH model. The log-likelihood functions are similar but not the same due to the different specification for $\sigma_t^2$.
To question 2): One is free to use whatever assumption about the distribution of the innovations, but the calculations will become more tedious. As far as I know, Filtered Historical Simulation is used to performe e.g. VaR forecast. The "fitted" innovations are bootstrapped to better fit the actual empirical distribution. Estimation is still performed using Gaussian (quasi) maximum-likelihood.
Best Answer
Due to the nature of the autoregressive components in GARCH, the shocks dissipate at an exponential rate and never fully.* So to answer your question, for a shock to dissipate fully would take an infinitely long time, and the long-run variance would never be reached, only approched however closely.
Therefore, people study half-life instead: $$ \ell:=\frac{\ln(0.5)}{\ln(\sum_{i=1}^s \alpha_i+\sum_{\beta_j}^r \beta_j)} $$ where $\alpha$s and $\beta$s are the GARCH model coefficients. In the case of a GARCH(1,1), $$ \ell=\frac{\ln(0.5)}{\ln(\alpha_1+\beta_1)}. $$ Half-life tells you after how many periods half of the shock has dissipated.
In your case, $\ell=\frac{\ln(0.5)}{\ln(0.07+0.70)}\approx 2.65$.
*This is in contrast to an ARCH(s) model where shocks completely dissipate after $s$ periods.