Let us imagine that you want to infer some parameter $\beta$ from some observed input-output pairs $(x_1,y_1)\dots,(x_N,y_N)$. Let us assume that the outputs are linearly related to the inputs via $\beta$ and that the data are corrupted by some noise $\epsilon$:
$$y_n = \beta x_n + \epsilon,$$
where $\epsilon$ is Gaussian noise with mean $0$ and variance $\sigma^2$.
This gives rise to a Gaussian likelihood:
$$\prod_{n=1}^N \mathcal{N}(y_n|\beta x_n,\sigma^2).$$
Let us regularise parameter $\beta$ by imposing the Gaussian prior $\mathcal{N}(\beta|0,\lambda^{-1}),$ where $\lambda$ is a strictly positive scalar ($\lambda$ quantifies of by how much we believe that $\beta$ should be close to zero, i.e. it controls the strength of the regularisation).
Hence, combining the likelihood and the prior we simply have:
$$\prod_{n=1}^N \mathcal{N}(y_n|\beta x_n,\sigma^2) \mathcal{N}(\beta|0,\lambda^{-1}).$$
Let us take the logarithm of the above expression. Dropping some constants we get:
$$\sum_{n=1}^N -\frac{1}{\sigma^2}(y_n-\beta x_n)^2 - \lambda \beta^2 + \mbox{const}.$$
If we maximise the above expression with respect to $\beta$, we get the so called maximum a-posteriori estimate for $\beta$, or MAP estimate for short. In this expression it becomes apparent why the Gaussian prior can be interpreted as a L2 regularisation term.
The relationship between the L1 norm and the Laplace prior can be understood in the same fashion. Instead of a Gaussian prior, multiply your likelihood with a Laplace prior and then take the logarithm.
A good reference (perhaps slightly advanced) detailing both issues is the paper "Adaptive Sparseness for Supervised Learning", which currently does not seem easy to find online. Alternatively look at "Adaptive Sparseness using Jeffreys Prior". Another good reference is "On Bayesian classification with Laplace priors".
A general advise would be going online on university statistics department websites and on individual class pages and looking at their recommended textbooks.
Here is the graduate level probability class website from UC Berkeley:
https://www.stat.berkeley.edu/~aldous/205A/index.html
Here is the graduate level statistics class website from UC Berkeley:
https://www.stat.berkeley.edu/~wfithian/courses/stat210a/
Here are some textbooks for statistical modeling:
• Freedman, D.A., 2009. Statistical Models: Theory and Practice, Cambridge
University Press, ISBN 978-0521743853
• Freedman, D.A., 2009. Statistical Models and Causal Inference: A Dialogue
with the Social Sciences, Cambridge University Press, ISBN 978-0521123907
For linear algebra theory (assuming you are comfortable with matrix operations and already know some of the definitions):
S. H. Friedberg, A. J. Insel, and L. E. Spence, Linear Algebra, 4th edition, Prentice Hall, 2002
Disclaimer: I just graduated from UC Berkeley and am biased towards our programs.
Best Answer
(Very) short story
Long story short, in some sense, statistics is like any other technical field: There is no fast track.
Long story
Bachelor's degree programs in statistics are relatively rare in the U.S. One reason I believe this is true is that it is quite hard to pack all that is necessary to learn statistics well into an undergraduate curriculum. This holds particularly true at universities that have significant general-education requirements.
Developing the necessary skills (mathematical, computational, and intuitive) takes a lot of effort and time. Statistics can begin to be understood at a fairly decent "operational" level once the student has mastered calculus and a decent amount of linear and matrix algebra. However, any applied statistician knows that it is quite easy to find oneself in territory that doesn't conform to a cookie-cutter or recipe-based approach to statistics. To really understand what is going on beneath the surface requires as a prerequisite mathematical and, in today's world, computational maturity that are only really attainable in the later years of undergraduate training. This is one reason that true statistical training mostly starts at the M.S. level in the U.S. (India, with their dedicated ISI is a little different story. A similar argument might be made for some Canadian-based education. I'm not familiar enough with European-based or Russian-based undergraduate statistics education to have an informed opinion.)
Nearly any (interesting) job would require an M.S. level education and the really interesting (in my opinion) jobs essentially require a doctorate-level education.
Seeing as you have a doctorate in mathematics, though we don't know in what area, here are my suggestions for something closer to an M.S.-level education. I include some parenthetical remarks to explain the choices.
Complements
Here are some other books, mostly of a little more advanced, theoretical and/or auxiliary nature, that are helpful.
More Advanced (Doctorate-Level) Texts
Lehmann and Casella, Theory of Point Estimation. (PhD-level treatment of point estimation. Part of the challenge of this book is reading it and figuring out what is a typo and what is not. When you see yourself recognizing them quickly, you'll know you understand. There's plenty of practice of this type in there, especially if you dive into the problems.)
Lehmann and Romano, Testing Statistical Hypotheses. (PhD-level treatment of hypothesis testing. Not as many typos as TPE above.)
A. van der Vaart, Asymptotic Statistics. (A beautiful book on the asymptotic theory of statistics with good hints on application areas. Not an applied book though. My only quibble is that some rather bizarre notation is used and details are at times brushed under the rug.)