The Baum-Welch algorithm and the Viterbi algorithm calculate different things.
If you know the transition probabilities for the hidden part of your model, and the emission probabilities for the visible outputs of your model, then the Viterbi algorithm gives you the most likely complete sequence of hidden states conditional on both your outputs and your model specification.
The Baum-Welch algorithm gives you both the most likely hidden transition probabilities as well as the most likely set of emission probabilities given only the observed states of the model (and, usually, an upper bound on the number of hidden states). You also get the "pointwise" highest likelihood points in the hidden states, which is often slightly different from the single hidden sequence that is overall most likely.
If you know your model and just want the latent states, then there is no reason to use the Baum-Welch algorithm. If you don't know your model, then you can't be using the Viterbi algorithm.
Edited to add: See Peter Smit's comment; there's some overlap/vagueness in nomenclature. Some poking around led me to a chapter by Luis Javier Rodrıguez and Ines Torres in "Pattern Recognition and Image Analysis" (ISBN 978-3-540-40217-6, pp 845-857) which discusses the speed versus accuracy trade-offs of the two algorithms.
Briefly, the Baum-Welch algorithm is essentially the Expectation-Maximization (EM) algorithm applied to an HMM; as a strict EM-type algorithm you're guaranteed to converge to at least a local maximum, and so for unimodal problems find the MLE. It requires two passes over your data for each step, though, and the complexity gets very big in the length of the data and number of training samples. However, you do end up with the full conditional likelihood for your hidden parameters.
The Viterbi training algorithm (as opposed to the "Viterbi algorithm") approximates the MLE to achieve a gain in speed at the cost of accuracy. It segments the data and then applies the Viterbi algorithm (as I understood it) to get the most likely state sequence in the segment, then uses that most likely state sequence to re-estimate the hidden parameters. This, unlike the Baum-Welch algorithm, doesn't give the full conditional likelihood of the hidden parameters, and so ends up reducing the accuracy while saving significant (the chapter reports 1 to 2 orders of magnitude) computational time.
Baum-Welch is an example of the expectation-maximization algorithm, which is explained well by the answers to this question. The answer by Berkmeister gives a detailed example (with code and a link to a nice paper) with an HMM.
The short version is that you alternate between two steps:
estimating the probabilities given $\theta$ (e.g. using the forward-backward algorithm you mentioned)
treating the estimated probabilities as "data" and finding the values of $\theta$ that are most consistent with that data (e.g. by counting the number of times each transition would have to occur, given the probabilities you estimated).
Each time you repeat those two steps, you get closer to the maximum likelihood estimate for $\theta$.
Best Answer
I have the perfect paper for you: http://www.stat.columbia.edu/~liam/teaching/neurostat-fall17/papers/hmm/rabiner.pdf
Note afaik there are newer BW methods.
Attempted simple explanation for Viterbi: The goal of viterbi is to find an optimal sequence of states during some discrete period of time.
Start off with the initial probabilities of being in a state $s$ at time $t=0$. Then increase $t$ and for each state at that time ($t=1$), find the best state $s$ at $t-1$ (best as in the sequence which gives the highest probability). Save this information. That means for each state $s$ at $t=1$ save what the best state $s$ at $t=0$ is. Also save the probability for that (optimal) sequence.
Then increase $t$ to $t=2$, and repeat the procedure (with the probabilities for each state $s$ at $t-1$ being the just calculated best probabilities for that state). At some point you'll reach $t=T$ (last point in time). Then you just pick the state with the highest probability, and remember for each state at each time you have saved the best state for $t-1$, so take the best state at $t=T$ and then work your way backwards always picking that saved optimal state.
To elaborate on what it does on a more general level: If you have a set of states $S$ which can all be picked over some duration of time $T$, a naive solution to find the best sequence would be to enumerate all possible sequences and then calculate the probability for each of them (that means calculating the product of the state and transition probabilities). The mistake one is doing here is that a lot of the calculations are duplicated, as an example you could have two sequences $p_1, p_2$ ($p$ is a sequence of states) where the only difference is the final state, yet with this approach you are calculating the probabilities of the shared states for both $P(p_1)$ and $P(p_2)$.
By taking the iterative approach used in viterbi, there are no duplicated calculations, so obviously it will be a lot more efficient.