This won't directly answer the question, but here are some things a mathematician who wants to learn about statistics should learn:
- When is a random variable a statistic and when is it not? (A statistic is an observable random variable. For example $X - E(X)$ is not a statistic if the "population average" $E(X)$ is not observable.
- Fisher's concept of sufficiency. Examples, characterizations, theorems. In particular, the Rao--Blackwell theorem and examples of its use. That's way cool.
- So is the concept of completeness and the Lehmann--Scheffe theorem.
- If you think that linear regression is called linear because you're fitting a line, then you are naive. If you're fitting, e.g., a parabola by finding least-squares estimators of three parameters, then you're doing linear regression. There is also such a thing as non-linear regression.
- Learn the Gauss--Markov theorem on Best Linear Unbiased Estimators (BLUEs).
- Look at my recent answer to a question on prediction intervals. Why do you need the (finite-dimensional version of) the spectral theorem to understand linear regression? (Look at the aforementioned answer and consider this question an exercise.)
- As long as we're on linear regression (the topic of the three bullets immediately above this one), look at the Wikipedia article titled "errors and residuals in statistics" (written mostly by me). Learn the difference between an error and a residual. Maybe look at "Studentized residual" as an afterthought.
- ....and then at "lack-of-fit sum of squares".
- If you think linear regression is child's play rather than something to which the most brilliant person could devote a long career in research, grow up.
- Learn the difference between frequentism and Bayesianism. In fact, look at the rant I posted on nLab about this. (The essence of Bayesianism is that probabilities are taken to be epistemic. Bayesianism is not more subjective than frequentism; rather Bayesians and frequentists put their subjectivity in different places. (A really glaring example is the 5% critical value legendarily used in medical journals. Why 5%? Because that's a subjective economic choice.))
- Learn design of experiments. Learn why Latin squares and a myriad of other combinatorial designs are used.
- OK, maybe a small and incomplete but nonetheless direct answer to the original question: perhaps Hocking's book on linear models.
- Learn to use the word "sample" correctly. If you ask the next 100 people you meet whether they intend to vote "yes" or "no", that's not 100 samples; that's one sample.
- Another thing that will give you some idea of the distinct flavor of the subject, and how it differs from probability theory and some other fields, is books on sampling.
- Learn about the Wishart distribution.
- And the multivariate normal distribution.
- Exercise: How do you prove that every non-negative-definite matrix is the variance of some random vector?
- Learn why the Behrens--Fisher problem cannot be regarded as a math problem. It belongs up there with Hilbert's problems as one of the great challenges, but it's not mathematics for this reason: One can model it as a math problem in any of a variety of different non-equivalent ways. One can solve those math problems. But which one is the "right" model? That's essentially a philosophical question. And that question, not the math problems, is the Behrens--Fisher problem. (The Behrens--Fisher problem is this: how do you draw inferences about the difference between the means of two normally distributed populations which may have different variances? "Inferences" can mean point-estimates or interval estimates or perhaps other things.)
This is just a sampling of the first things that come immediately to mind. It leans toward showing you what the subject tastes like rather than what it's important to know to do theoretical or applied research.
Statistics is an immensely broader field than mathematical probablity theory.
The inequality $E(S_K) \geq E(S_i)$ holds.
To avoid any doubt, let me be more specific. Let $Y_1, Y_2, ..., Y_N$ be a collection of random variables, and write $X_1 \geq X_2 \geq ... \geq X_N$ for their reordering in non-increasing order.
Suppose $K < N$ is fixed and let $S_K$ be the sum of the $K$ largest of the random variables, that is $S_K=X_1+...+X_K$.
Let $R$ be a random variable taking values in $0,1,...,N$ which is independent of the random variables $Y_i$. The independence from the $Y_i$ is important of course (this is how I interpret your "based on some criteria". If the $R$ is allowed to depend on the realisation of the $Y_i$ then all sorts of different behaviours are possible).
Now let $S_R$ be the sum of the $R$ largest of the random variables, that is $S_R=X_1+...+X_R$. (In your notation this is $S_i$).
Suppose that $ER=K$. Then I claim that $E S_R \leq E S_K$, with equality iff $R=K$ with probability 1. (Unless the $Y_i$ are somehow degenerate, in which case equality can occur in other cases as well).
Proof: Write $p_k=P(R\geq k)$ for $k=1,2,...,N$. We have $\sum p_k=ER=K$.
Also
$S_R=\sum_{k=1}^N X_k I(R\geq k)$
so
$ES_R=\sum_{k=1}^N P(R\geq k) E X_k = \sum_{k=1}^N p_k E X_k$.
(Here we used the independence of $R$ from the $X_i$).
Consider maximising this sum subject to the constraints that $\sum p_k=K$ and that
$1\geq p_1 \geq p_2 \geq p_3\geq ...$.
Since the terms $E X_k$ are decreasing in $k$,
the maximum is achieved when $p_k=1$ for $k\leq K$ and $p_k=0$ for $k>K$.
(Provided the $Y_i$ are not degenerate, the terms $E X_k$ are strictly decreasing,
and this is the only way to achieve the maximum. If not, the maximum may be achieved in some other cases too).
That is, the maximum value of $ES_R$ occurs precisely if $R$ is equal to $K$ with probability 1.
It doesn't matter whether the $Y_i$ are identically distributed, and also they don't need to be independent. However, it is important that $R$ is independent of the $Y_i$.
Best Answer
Many hold that Bayesian statistics "from a purely mathematical point of view" is entirely coextensive with probability (however it is that you want to define its boundaries as a mathematical discipline). Nonetheless, if I interpret your request as being for a mathematically sophisticated and rigorous exposition on why the Bayesian approach is a worthy one, three book spring to mind.
The first of these is a general graduate text in statistics, but the author gives uncommonly complete coverage of both Bayesian and frequentist methods.
The second is a smaller volume and, as I recall, is devoted to some of the more delicate issues surround finite versus countable additivity as relates to using probability distributions as priors in a Bayesian approach.
The final book is more general, but the style is more formal than the Bernardo and Smith book mentioned by PaPiro. (This is, in my experience, true of the style of French Bayesians :)
As I said, the distinctive elements of the Bayesian perspective are more philosophical than technical, but there are some technical areas that have received attention in the Bayesian community that may be of independent mathematical interest. One would be the role of so-called "improper" priors as mentioned above.
Another is the role of conditional distributions as a primitive rather than derived notion, leading to the idea of disintegration, as in this manuscript of Pollard.
Also, because of a keen interest in the application of Monte Carlo methods, Bayesian statisticians have to a lot of work on various aspects of computational methods for sampling from various distributions. Christian Robert is a prominent researcher in this area, and he has a blog. The current post happens to be about Bayesian foundations.
Finally, at the heart of a many arguments in favor of a Bayesian approach (early chapters in Bernardo and Smith and Robert are dedicated to it) are de Finetti type representation theorems, which sanction prior distributions via appeals to exchangeability. You can start with the wiki entry for de Finetti theorems and then look at the work of Persi Diaconis on the topic. In this vein see also Lauritzen's monograph, which (for me anyway) is the last word on the matter.