It is not a convention, but quite often $\theta$ stands for the set of parameters of a distribution.
That was it for plain English, let's show examples instead.
Example 1. You want to study the throw of an old fashioned thumbtack (the ones with a big circular bottom). You assume that the probability that it falls point down is an unknown value that you call $\theta$. You could call a random variable $X$ and say that $X=1$ when the thumbtack falls point down and $X=0$ when it falls point up. You would write the model
$$P(X = 1) = \theta \\
P(X = 0) = 1-\theta,$$
and you would be interested in estimating $\theta$ (here, the proability that the thumbtack falls point down).
Example 2. You want to study the disintegration of a radioactive atom. Based on the literature, you know that the amount of radioactivity decreases exponentially, so you decide to model the time to disintegration with an exponential distribution. If $t$ is the time to disintegration, the model is
$$f(t) = \theta e^{-\theta t}.$$
Here $f(t)$ is a probability density, which means that the probability that the atom disintegrates in the time interval $(t, t+dt)$ is $f(t)dt$. Again, you will be interested in estimating $\theta$ (here, the disintegration rate).
Example 3. You want to study the precision of a weighing instrument. Based on the literature, you know that the measurement are Gaussian so you decide to model the weighing of a standard 1 kg object as
$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left\{ -\left( \frac{x-\mu}{2\sigma} \right)^2\right\}.$$
Here $x$ is the measure given by the scale, $f(x)$ is the density of probability, and the parameters are $\mu$ and $\sigma$, so $\theta = (\mu, \sigma)$. The paramter $\mu$ is the target weight (the scale is biased if $\mu \neq 1$), and $\sigma$ is the standard deviation of the measure every time you weigh the object. Again, you will be interested in estimating $\theta$ (here, the bias and the imprecision of the scale).
Best Answer
Vanilla means standard, usual, or unmodified version of something. Vanilla gradient descent means the basic gradient descent algorithm without any bells or whistles.
There are many variants on gradient descent. In usual gradient descent (also known as batch gradient descent or vanilla gradient descent), the gradient is computed as the average of the gradient of each datapoint.
$$\nabla f = \frac{1}{n}\sum_i \nabla \text{loss}(x_i)$$
In stochastic gradient descent with a batch size of one, we might estimate the gradient as
$$\nabla f \approx \nabla \text{loss}(x^*)$$, where $x^*$ is randomly sampled from our entire dataset. It is a variant of normal gradient descent, so it wouldn't be vanilla gradient descent. However, since even stochastic gradient descent has many variants, you might call this "vanilla stochastic gradient descent", when comparing it to other fancier SGD alternatives, for example, SGD with momentum.