one way to sample is to apply argmax(softmax($\alpha_j$))
That is hardly "sampling", given that you deterministically pick the largest $\alpha_j$ every time. (also, you said that $\alpha$ is the unnormalized probability but that doesn't make sense seeing as log probabilities go into the softmax). The correct way to sample would be sample(softmax($x$)), where $x$ are the logits. Indeed, the goal of gumbel-softmax is not to replace the softmax operation as you've written it, but the sampling operation:
We can replace sample($p$) where $p$ are a vector of probabilities with argmax($\log p + g$) where $g$ is the gumbel noise. Of course, this is equivalent to argmax($x + g$) where $x$ are again the logits. To conclude, sample(softmax($x$)) and argmax($x+g)$ are equivalent procedures.
Then, if the goal was to have the full distribution over possible outcomes for $z_j$, we can use softmax transformation on top of the perturbation with Gumbel noise.
In fact you already have a distribution over all possible outcomes.
However, argmax($x+g$) is not differentiable wrt $x$, therefore to backpropagate we replace its gradient with the gradient of softmax($(x+g)\tau^{-1}$). When $\tau \rightarrow 0$, the expression approaches argmax.
Picking a reasonable, small values of $\tau$ will ensure a good estimate of the gradient while ensuring that the gradients are numerically well behaved.
and $\tau=1$ just makes the two equations identical
In fact, there is no special significance to $\tau = 1$. Rather, $\tau \rightarrow 0$ makes the gradient estimate unbiased but high in variance, where as larger values of $\tau$ add more bias to the gradient estimate but lower the variance.
Gregory Gundersen wrote a blog post about this in 2018. He explictly answers the questions:
What does a “random node” mean and what does it mean for backprop to “flow” or not flow through such a node?
The following excerpt should answer your questions:
Undifferentiable expectations
Let’s say we want to take the gradient w.r.t. $\theta$ of the following expectation, $$\mathbb{E}_{p(z)}[f_{\theta}(z)]$$ where $p$ is a density. Provided we can differentiate $f_{\theta}(x)$, we can easily compute the gradient:
$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p(z)}[f_{\theta}(z)]
&= \nabla_{\theta} \Big[ \int_{z} p(z) f_{\theta}(z) dz \Big] \\
&= \int_{z} p(z) \Big[\nabla_{\theta} f_{\theta}(z) \Big] dz \\
&= \mathbb{E}_{p(z)} \Big[\nabla_{\theta} f_{\theta}(z) \Big] \end{align}
$$
In words, the gradient of the expectation is equal to the expectation of the gradient. But what happens if our density $p$ is also parameterized by $\theta$?
$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p_{\theta}(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} \nabla_{\theta} \Big[ p_{\theta}(z) f_{\theta}(z) \Big] dz \\ &= \int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)dz + \int_{z} p_{\theta}(z) \nabla_{\theta} f_{\theta}(z)dz \\ &= \underbrace{\int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)}_{\text{What about this?}}dz + \mathbb{E}_{p_{\theta}(z)} \Big[f_{\theta}(z)\Big] \end{align}$$
The first term of the last equation is not guaranteed to be an expectation. Monte Carlo methods require that we can sample from $p_{\theta}(z)$, but not that we can take its gradient. This is not a problem if we have an analytic solution to $\nabla_{\theta}p_{\theta}(z)$, but this is not true in general. 1
Best Answer
After $\epsilon$ is sampled, it is completely known; we can treat it the same way as any other data (image, text, feature vector) that's input to a neural network. Just like your input data, $\epsilon$ is known and won't change after you sample it.
This means that the expression $ z = \mu + \sigma \odot \epsilon $ has no random components after sampling: you know $\mu,\sigma$ because you obtained them from the encoder, and you know $\epsilon$ because you've sampled it. As a result of sampling, $\epsilon$ is known and fixed at a particular value. This means that you can backprop $\mu + \sigma \odot \epsilon$ with respect to $\mu, \sigma$ because all of its elements are known and fixed.
By contrast, the expression $z \sim \mathcal{N}(\mu,\sigma^2)$ is not deterministic in $\mu, \sigma$, so you can't write a backprop expression with respect to $\mu, \sigma$ for it. Even though $\mu, \sigma$ are fixed, you can obtain any real number as an output.