As someone who started out their career thinking of statistics as a messy discipline, I'd like to share my epiphany regarding the matter. For me, the insight came from Linear Algebra, so I would urge you to push in that direction.
Specifically, once you realize that the sum of squares, $\sum_i X_i^2$, and sum of products, $\sum_i X_i Y_i$, are both inner products (aka dot products), you realize that nearly all of statistics can be thought of as various operations from linear algebra.
If you sample $n$ values from a population, you have an $n$-dimensional vector. The sample mean is a projection of this vector onto the $n$-dimensional all-ones vector. The standard deviation is projection onto the $(n-1)$-dimensional hyperplane normal to the all-ones vector (finally an intuitive reason for the "$n-1$" in the denominator!). Specifically, for the sample variance $s^2$ for sample $X$, here is the linear algebra:
First, we work with deviations from the mean. The mean in linear algebra terms is
$\bar{X}=\frac{\langle X,\mathbf{1}\rangle}{\langle \mathbf{1},\mathbf{1}\rangle} \mathbf{1}$
where $\langle \cdot, \cdot \rangle$ is the inner product and $\mathbf{1}$ is the $n$-dimensional ones vector. Then the deviation from the mean is
$x = X - \bar{X}$
Note that $x$ is constrained to an $(n-1)$-dimensional subspace. The usual equation for variance is
$s^2 = \dfrac{\sum_i (X_i - \bar{X})^2}{n-1}$
For us, that's
$s^2 = \dfrac{\langle x, x \rangle}{\langle \mathbf{1}, \mathbf{1} \rangle}$
which, without going into too much detail (too late) is a normalized deviation. The trick there is that the new $\mathbf{1}$ has dimension $n-1$.
The other good example is that correlation between two samples is related to the angle between them in that $n$-dimensional space. To see this, consider that the angle between two vectors $v$ and $w$ is:
$\theta = \arccos \dfrac{\langle v, w \rangle}{\|v\|\|w\|}$
where $\|\cdot\|$ is vector length. Compare this to one of the forms for the Pearson Correlation and you will see that $r = \cos \theta$.
There are many other examples, and these have barely been explained here, but I just hope to give an impression of how you can think in these terms.
Best Answer
I am an information theorist and I try to give you my personal take. About three years ago, a conference on information geometry was held in Germany. As you can see there, there are lot of applications but most of the applications are deeply tied with Statistics. However, in my opinion, information geometry could not bring results that influence the related research in probability and statistics. And also no fundamentally original results was observed that excite researchers. Differential geometry is itself an obstruction due to the its complexity. Of course these are highly subjective judgments and no one can predict the future but at least currently, researchers are not so excited about the results.
You can have interesting connections with other discoveries of science but still, I think we have not seen some results from information geometry that are being widely used even in statistics.
I believe this is so, partly because the field is still very young and immature. I see a big potential in bring geometrical insights to statistics but I think that is very challenging for a young researcher.
I am looking forward to find a way to apply it in my own research though I have not found any way yet.
The section 2 and 3 of this book discusses the relation with statistics.