[Math] A path to truly understanding probability and statistics

I'm embarrassed to say that I have a PhD and hold an asst professorship, but get tripped up when reading statistics research. I am in a field of Business that is similar to IO Psychology or Social Psych. I spend too much time reading applied stats books, but I find even with all the reading I don't have a firm grasp of what I'm actually doing. Everything is very 'seat of the pants.' (As sad as it seems, I think this is not a unique situation among the faculty in the social sciences…) The biggest problem comes when I need to apply a rarely used stat technique. I can find an article from a mathematical stats journal with the equations that would solve my problem, but I don't have the math to convert those into code. I am forever relying on other prof's R packages, and crossing my fingers hoping it will work (I can't even check to verify if it did or not). It's been over 15 years since I took Calculus and Algebra in undergrad, and I think I want to start at the beginning and truly understand probability and statistics.

I am starting with Gelfand's Algebra and Trigonometry books for a quick refresher of the basics — I know it's hard to believe, but in an applied research field we rarely have use for sin or cos. I'm even trying to finally learn how to correctly do a proof, using the books from Velleman ("How to Prove It") and Houston ("How to Think Like a Mathematician") — I'm serious about doing this right and understanding the subject. From there I want to move on to (correctly) learn the Calculus and Linear Algebra I need to tackle probability and statistics. I was thinking of using Strang's Calculus and Algebra books. But Apostol's Caculus comes highly recommended as well. After that I am completely at a loss. Further, I don't know how far to go into Calculus or Linear Algebra before I reach diminishing returns. (Apostle introduces Probability in the second half of Vol. 2 — is it vital that I work through everything preceding it before tackling Probability?)

So my question is: if you had to do it over again with the goal of truly, deeply understanding statistics, where would you start? What books are the modern path to deep understanding? I would like to follow a modern path so that I can understand current research in statistics, including Bayesian approaches. But not in a machine learning context (which seems to be the all the rage at the moment), rather a social science / design and analysis of experiments / multilevel modeling context. Perhaps my goal would be the work of Andrew Gelman; his and Hill's book showed me how I should be looking at modeling and statistics (simulation, uncertainty estimates everywhere, bayesian inference, and so on). How should I go about relearning this material with that end goal in mind?

Update 1: Possible texts, starting from scratch with a focus on proofs and deep understanding. Not necessarily one after another.

Relearn the basics:

How to Prove It by Velleman
How to Think Like a Mathematician by Houston
Algebra and Trigonometry by Gelfand (for understanding why and how instead of what)
Precalculus in a Nutshell by Simmons (for reference)
Measurement by Lockhart (for inspiration)

Calculus (which one(s), and how deep?):

Calculus by Strang
Calculus vol. 1 and Calculus vol. 2 by Apostol
Calculus by Spivak (solutions)
Introduction to Calculus and Analysis: Volume I by Curant (and II/1 II/2?)

Linear Algebra (which one(s) and how deep?):

Intro to Linear Algebra by Strang
Matrix Algebra Useful for Statistics by Searle
Matrix Algebra: Theory, Computations, and Applications in Statistics by Gentle

Probability (which one(s)?):

An Introduction to Probability Theory and Its Applications, Vol. 1 and Vol. 2 by Feller (for intuitive understanding)
Introduction to Probability Theory by Hoel, Port, Stone
A Probability Path by Resnick (for measure theoretic / modern approach?)
Fifty Challenging Problems in Probability by Mosteller

Core Statistics (which one(s)?):

Probability and Statistics by DeGroot and Schervish
Statistical Inference Casella and Berger

Other suggestions? Again with the goal of understanding and developing (or at least implementing) new methods in hierarchical modelling (generalized and linear).

Best Answer

As someone who started out their career thinking of statistics as a messy discipline, I'd like to share my epiphany regarding the matter. For me, the insight came from Linear Algebra, so I would urge you to push in that direction.

Specifically, once you realize that the sum of squares, $\sum_i X_i^2$, and sum of products, $\sum_i X_i Y_i$, are both inner products (aka dot products), you realize that nearly all of statistics can be thought of as various operations from linear algebra.

If you sample $n$ values from a population, you have an $n$-dimensional vector. The sample mean is a projection of this vector onto the $n$-dimensional all-ones vector. The standard deviation is projection onto the $(n-1)$-dimensional hyperplane normal to the all-ones vector (finally an intuitive reason for the "$n-1$" in the denominator!). Specifically, for the sample variance $s^2$ for sample $X$, here is the linear algebra:

First, we work with deviations from the mean. The mean in linear algebra terms is

$\bar{X}=\frac{\langle X,\mathbf{1}\rangle}{\langle \mathbf{1},\mathbf{1}\rangle} \mathbf{1}$

where $\langle \cdot, \cdot \rangle$ is the inner product and $\mathbf{1}$ is the $n$-dimensional ones vector. Then the deviation from the mean is

$x = X - \bar{X}$

Note that $x$ is constrained to an $(n-1)$-dimensional subspace. The usual equation for variance is

$s^2 = \dfrac{\sum_i (X_i - \bar{X})^2}{n-1}$

For us, that's

$s^2 = \dfrac{\langle x, x \rangle}{\langle \mathbf{1}, \mathbf{1} \rangle}$

which, without going into too much detail (too late) is a normalized deviation. The trick there is that the new $\mathbf{1}$ has dimension $n-1$.

The other good example is that correlation between two samples is related to the angle between them in that $n$-dimensional space. To see this, consider that the angle between two vectors $v$ and $w$ is:

$\theta = \arccos \dfrac{\langle v, w \rangle}{\|v\|\|w\|}$

where $\|\cdot\|$ is vector length. Compare this to one of the forms for the Pearson Correlation and you will see that $r = \cos \theta$.

There are many other examples, and these have barely been explained here, but I just hope to give an impression of how you can think in these terms.

Best Answer

Related Solutions

[Math] Which book among these would you recommend for first year calculus

[Math] Preferable Order of Mathematics Study

Related Question