A conditional volatility model such as the GARCH model is defined by the mean equation
\begin{equation}
r_t = \mu + \sigma_t z_t = \mu + \varepsilon_t
\end{equation}
and the GARCH equation (this is for the simple GARCH)
\begin{equation}
\sigma^2_t = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2
\end{equation}
To perform maximum-likelihood estimation, we must make distributional assumptions on $z_t$. It is typically assumed to be i.i.d. $N(0,1)$.
Conditional on the informationset at time t, we have that
\begin{equation}
r_t \sim N(\mu, \sigma_t^2)
\end{equation}
or
\begin{equation}
\varepsilon_t = r_t - \mu \sim N(0, \sigma_t^2)
\end{equation}
However when we perform maximum-likelihood estimation, we are interested in the joint distribution
\begin{equation}
f(\varepsilon_1,...,\varepsilon_T; \theta)
\end{equation}
where $\theta$ is the parameter vector. Using iteratively that the joint distribution is equal to the product of the conditional and the marginal density, we obtain
\begin{eqnarray}
f(\varepsilon_0,...,\varepsilon_T; \theta) &=& f(\varepsilon_0;\theta)f(\varepsilon_1,...,\varepsilon_T\vert \varepsilon_0 ;\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T f(\varepsilon_t \vert \varepsilon_{t-1},...,\varepsilon_{0} ;\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T f(\varepsilon_t \vert \varepsilon_{t-1};\theta) \\
&=& f(\varepsilon_0;\theta) \prod_{t=1}^T \frac{1}{\sqrt{2\pi \sigma_t^2}}\exp\left(-\frac{\varepsilon_t^2}{2\sigma_t^2}\right)
\end{eqnarray}
Dropping $f(\varepsilon_0;\theta)$ and taking logs, we obtain the (conditional) log-likelihood function
\begin{equation}
L(\theta) = \sum_{t=1}^T \frac{1}{2} \left[-\log2\pi-\log(\sigma_t^2) -\frac{\varepsilon_t^2}{\sigma_t^2}\right]
\end{equation}
To question 1): The exact same steps can be followed for the GJR-GARCH model. The log-likelihood functions are similar but not the same due to the different specification for $\sigma_t^2$.
To question 2): One is free to use whatever assumption about the distribution of the innovations, but the calculations will become more tedious. As far as I know, Filtered Historical Simulation is used to performe e.g. VaR forecast. The "fitted" innovations are bootstrapped to better fit the actual empirical distribution. Estimation is still performed using Gaussian (quasi) maximum-likelihood.
GARCH was made to address varying volatility in financial and economic return time series. Note, that the returns on these series usually have very small or no drift at all. Hence, $E[X_t]\approx 0$ and subsequently $\operatorname{Var}\left[X_t\right]\approx X_t^2$ is a good estimate of the observed volatility.
This will explain the form of the second equation in GARCH. Compare it to a typical exponential smoothing:
$$s_t=\lambda x_t+(1-\lambda)s_{t-1}$$
So, the idea was to construct a process where the previous volatility $\sigma_{t-1}$ for time $t-1$ gets constantly updated by the new observed volatility estimate $X_{t-1}^2$ that defines the volatility for the next time step $\sigma_{t}$.
Note, that at time $t-1$ all terms on the right hand side of your second equation of GARCH are already observed, which means that $\sigma_t^2$ is deterministic as time $t-1$. The random draw of the return $X_t$ happens in the first equation using this known (at time $t-1$) volatility $\sigma_t$.
When you understand the background of GARCH it should be clear that the answer to your question must be $\sigma_t^2$
Best Answer
Typically, we will assume that $X_t = \sigma_t Z_t$ where $Z_t$ is iid(0,1).
How to find the log-likelihood is described in Maximum likelihood in the GJR-GARCH(1,1) model
and how to implement the procedure is described in Fitting a GARCH(1, 1) model
Regarding initial values of $\sigma_1^2$, I have seen the approaches in Initial value of the conditional variance in the GARCH process