This is really a comment, but too long for a comment. It is trying to clarify the definition of "tantile" (in the $p=0.5$ case which is analogous to the median). Let $X$ be a (for simplicity) absolutely continuous random variable with density function $f(x)$. We assume that the expectation $\mu= \mathbb E X$ does exist, that is the integral $\mu=\int_{-\infty}^\infty x f(x)\; dx $ converges. Define, analogously with the cumulative distribution function, a "cumulative expectation function" (I have never seen such a concept, does it have an official name?) by
$$
G(t) = \int_{-\infty}^t x f(x) \; dx
$$
Then the "tantile" is the solution $t^*$ of the equation $G(t^*) = \mu/2$.
Is this interpretation correct? Is this what was intended?
To return to the original question, in the context of an income distribution, the tantile is the value of income such that half of total income is for people with above that income, and half of total income is for people with below that income.
EDIT
These quantities ( function $G(t)$ above) are related to various risk measures used in some financial literature, such as "expected shortfall".
Have a look at the paper A J Ostaszewski & M B Gietzmann: "Value Creation with Dye's Disclosure Option: Optimal Risk-Shielding with an Upper Tailed Disclosure Strategy" (may 2006), especially around page 15, where they define something they call "Hemi-mean" which is related to $G(t)$ above, also "expected shortfall relative to $t$ and also known as $first lower partial moment". It would be interesting to look into these connections ...
Another term used for this idea is "partial expectation". See for instance https://math.stackexchange.com/questions/1080530/the-partial-expectation-mathbbex-xk-for-an-alpha-stable-distributed-r and use google!
Also, the book Kotz & Kleiber:"Statistical Size Distributions in Economics and Actuarial Science" give relevant information, on page 22 they define (Here $X>0$)
$$
F_k(x) = \frac1{E X^k} \int_0^x t^k f(t)\; dt
$$
which is "the $k$th-moment distribution", note that $G(t)=\mu F_1(t)$ so is basically the first-moment distribution. They refer to Champernowne (1974) who calls $F_1$ the "income curve", and denotes the underlying cdf $F$ by $F_0$. In terms of the first moment distribution the Lorenz curve can be given as
$$
\{(u, L(u))\} = \{(u,v)\colon u=F(x),v=F_1(x); x\ge 0\}
$$
I think it might help to think of this as a two-level "meta-model". You have some collection of individual models (indexed by $m$), and then you have a meta-model, which is a distribution over the individual models (or equivalently, a distribution over values of $m$).
You can think about the model averaging as working in two steps:
- First, you get the posterior predictive distribution for each model $m$ by integrating out its model-specific parameters $\theta$:
$$ P(y|x, D, m) = \int P(y|x, D, \theta, m)P(\theta| D, m)d\theta $$
- Then you get the posterior predictive distribution for the meta-model, now integrating out the distribution over the models:
$$ P(y|x,D) = \int P(y|x, D, m)P(m|x, D)dm $$
Then in the machine learning context you would make predictions about $y$ based on its posterior predictive distribution given the observed covariates $x$.
To answer your question, the second step is where this is model averaging. When you "integrate out" or "sum out" a parameter (incidentally, you can think of these as the same operation for continuous and discrete distributions respectively), that's equivalent to taking the expected value of some quantity (i.e. averaging) over that parameter. In this case, you're taking the expected value of the posterior density of $y$, which is the definition of a posterior predictive distribution.
As for priors, you're going to have two sets of them in this model: a prior for each model $m$, and a prior for the meta-model over different $m$. They will factor into determining the posterior distributions over parameters that we've integrated out (i.e. $P(\theta|D,m)$ and $P(m|x,D)$).
I will point out that in this model the authors have apparently specified that the posterior over $m$ might depend on the test predictors $x$, but the posterior over $\theta$ does not. That is, $x$ might influence how you weight the different models, but not how you weight the parameters of each individual model. I don't think that's a crazy choice, but it's not the only way to do this.
Okay. An example. I can't think of a machine learning example that's simple, but here's an easier textbook statistics example. In this model the individual models are going to be normal distributions with a fixed variance $\sigma^2$, and a random mean $\mu$. The collection of distributions (the meta-model) is over different values of $\sigma^2$. So here $\theta = \mu$ and $m = \sigma^2$. The standard prior for $\mu|\sigma^2$ is a normal distribution, and then the prior over $\sigma^2$ is an inverse-gamma distribution. You can show that the posterior predictive distribution $y$ over $\mu$ given a fixed value of $\sigma^2$ is another normal distribution with its mean pulled in the direction of the sample mean. Then you integrate out (model average) $\sigma^2$, and the posterior predictive distribution becomes a Student-t distribution over $y$. Essentially, you get something that looks kind of like a normal distribution, but it has fat tails because you've averaged over different possibilities for the variance.
Best Answer
Mean versus average
Side point
AVERAGE()
for its arithmetic mean function, where R usesmean()
.