This got too long for a comment. The problem with your function is that the determinant of K_plus
was getting infinite or zero very quickly. I tweaked your function to calculate the log-determinant directly. I then used optim
with different methods as well as nlm
to search for the maximum likelihood estimates. The algorithms converged without problems. I also included the code to calculate the standard errors and confidence intervals based on the Hessian. All algorithms give very similar estimates.
The estimates are: $\widehat{\sigma}_{n}=5.30,\hat{l}=5.12,\widehat{\sigma}_{f}=45.01$.
The code is:
load("A.Rdata")
load("y.Rdata")
num_unique <- 786
Calculate_K_plus <- function(vect){
sn2 <- (vect[1]*vect[1])
exponent <- 1/(vect[2]*vect[2])
sf2 <- vect[3]*vect[3]
B <- A^exponent
B <- sf2 * B
B <- B + sn2*diag(num_unique)
B
}
minus_log_likelihood <- function(vect){
K_plus <- Calculate_K_plus(vect)
K_plus_inv <- solve(K_plus)
z <- determinant(K_plus, logarithm=TRUE)
K_plus_log_det <- as.numeric((z$sign*z$modulus)) # log-determinant of K_plus
out <- 0.5 * ( t(y) %*% K_plus_inv %*% y ) + 0.5 * K_plus_log_det + (num_unique/2)*log(2*pi)
out
}
#-----------------------------------------------------------------------------
# "Nelder-Mead" algorithm
#-----------------------------------------------------------------------------
res.optim <- optim(par=c(5.3, 5.1, 44.9), fn=minus_log_likelihood, hessian=TRUE, control=list(trace=TRUE, maxit=1000))
res.optim$par
[1] 5.302362 5.123045 45.011507
fisher_info<- solve(res.optim$hessian)
prop_sigma<-sqrt(diag(fisher_info))
upper<-res.optim$par+1.96*prop_sigma
lower<-res.optim$par-1.96*prop_sigma
interval<-data.frame(value=res.optim$par, lower=lower, upper=upper)
interval
value lower upper
1 5.302362 5.032848 5.571877
2 5.123045 3.442932 6.803157
3 45.011507 17.952756 72.070257
#-----------------------------------------------------------------------------
# "L-BFGS-B" algorithm
#-----------------------------------------------------------------------------
res.optim2 <- optim(par=c(5.3, 5.1, 44.9), fn=minus_log_likelihood, method=c("L-BFGS-B"), hessian=TRUE, control=list(trace=3, maxit=1000))
res.optim2
[1] 5.301418 5.114984 44.901863
fisher_info<- solve(res.optim2$hessian)
prop_sigma<-sqrt(diag(fisher_info))
upper<-res.optim2$par+1.96*prop_sigma
lower<-res.optim2$par-1.96*prop_sigma
interval2<-data.frame(value=res.optim2$par, lower=lower, upper=upper)
interval2
value lower upper
1 5.301418 5.031988 5.570848
2 5.114984 3.437925 6.792043
3 44.901863 17.982520 71.821206
#-----------------------------------------------------------------------------
# With "nlminb"
#-----------------------------------------------------------------------------
res.nlm <- nlminb(objective=minus_log_likelihood, start=c(5.3, 5.1, 44.9), control=list(iter.max=200, trace=1))
res.nlm$par
[1] 5.301542 5.123718 45.072189
#-----------------------------------------------------------------------------
# With "nlm"
#-----------------------------------------------------------------------------
res.nlm2 <- nlm(f=minus_log_likelihood, p=c(5.3, 5.1, 44.9), print.level=2)
res.nlm2$estimate
[1] 5.301534 5.123776 45.072711
According to the question, it is a an assumed fact that both populations have common variance, and not something one wishes to test.
Maximum likelihood estimators can be derived as usual either from the two samples separately, or by pooling them, in which case we will have an independent but non-identically distributed sample and corresponding log-likelihood, something that nevertheless creates no special issues. So, more than deriving the MLEs (which is straightforward), I would say that this is a good example in order to examine whether pooling samples ("unite and conquer"?) is more beneficial than keeping the samples separate ("divide and conquer"?). But "more beneficial" according to which criteria?
We will discuss them as we go along.
Note that we need both sample sizes to be larger than unity, $n_1 >1, n_2 > 1$, otherwise the variance estimator will equal zero.
If we keep the samples separate we will obtain
$$\hat \mu_v = \frac 1{n_1}\sum_{i=1}^{n_1}v_i,\;\;\; \hat \sigma^2_1 = \frac 1{n_1}\sum_{i=1}^{n_1}(v_i-\hat \mu_v)^2$$
and
$$\hat \mu_w = \frac 1{n_2}\sum_{i=1}^{n_2}w_i,\;\;\; \hat \sigma^2_2 = \frac 1{n_2}\sum_{i=1}^{n_2}(w_i-\hat \mu_w)^2$$
The MLEs for the means will be unbiased, efficient, consistent and asymptotically normal.
The variance estimators will be biased, consistent and asymptotically normal (see this post, which holds in general, even for normal samples).
Since we have bias here, it is an easy thought to turn to Mean Squared Error. The populations are normal, so we also have a finite-sample result:
$$\frac {n_i\hat \sigma^2_i}{\sigma^2} \sim \chi^2_{n_i-1} \Rightarrow \hat \sigma^2_i \sim \operatorname{Gamma}(k_i,\theta_i),\;\; k_i = \frac {n_i-1}{2},\;\; \theta_i = \frac {2\sigma^2}{n_i},\;\;i=1,2$$
Therefore we can calculate the Mean Squared Error (MSE) as
$$MSE(\hat \sigma^2_i) = \text{Var}(\hat \sigma^2_i)+\left[B(\hat \sigma^2_i)\right]^2 = \frac{2(n_i-1)}{n_i^2} \sigma^4 + \frac 1{n_i^2}\sigma^4 = \frac{2n_i-1}{n_i^2} \sigma^4$$
We turn now to the pooled-samples case.
It is easy to verify that the MLE's for the two means will be identical with the separate-samples approach. So as regards these estimators, pooling the two samples or not, makes no difference as regards the functional form of the estimators, or their properties.
But the variance estimator will be different. It is also rather easy to derive that
$$\hat \sigma^2_p = \frac{n_1}{n_1+n_2}\hat \sigma^2_1+\frac{n_2}{n_1+n_2}\hat \sigma^2_2$$
This is also a biased an consistent estimator, and also asymptotically normal, being the convex combination of two asymptotically normal variables.
Turning to the issue of bias and Mean Squared Error, since the two separate-samples estimators are independent we have that
$$\text{Var}(\hat \sigma^2_p) = \frac{n_1^2}{(n_1+n_2)^2}\frac{2(n_1-1)}{n_1^2} \sigma^4+\frac{n_2^2}{(n_1+n_2)^2}\frac{2(n_2-1)}{n_2^2}\sigma^4 = \frac {2n_1+2n_2-4}{(n_1+n_2)^2}\sigma^4$$
and
$$B\left(\hat \sigma^2_p\right) = \frac{n_1}{n_1+n_2}E(\hat \sigma^2_1)+\frac{n_2}{n_1+n_2}E(\hat \sigma^2_2) - \sigma^2 = \frac {-2}{n_1+n_2} \sigma^2$$
So the MSE here is
$$MSE(\hat \sigma^2_p) = \frac {2n_1+2n_2-4}{(n_1+n_2)^2}\sigma^4+\frac {4}{(n_1+n_2)^2} \sigma^4 = \frac {2}{n_1+n_2}\sigma^4$$
In order for sample-pooling to be superior in MSE terms we want that
$$MSE(\hat \sigma^2_p) < MSE(\hat \sigma^2_i), i=1,2$$
$$\Rightarrow \frac {2}{n_1+n_2}\sigma^4 < \frac{2n_i-1}{n_i^2} \sigma^4 \Rightarrow 2n_i^2 < 2n_in_1 - n_1 + 2n_in_2 - n_2$$
This reduces to the same condition for either $i=1$ or $i=2$, namely
$$0 < - n_1 + 2n_1n_2 - n_2 \Rightarrow \frac {n_1+n_2}{n_1n_2} < 2 \Rightarrow \frac 1{n_2} + \frac {1}{n_1} < 2$$
which holds, since both sample sizes are strictly higher than unity.
Therefore we conclude, that "unite & conquer" is the MSE-efficient approach here.
But we will lose something: if $n_1 \neq n_2$ the pooled-sample variance estimator does not give a Gamma finite sample distributional result, because it is the linear combination of two Gamma random variables with different scale parameters (different $\theta_i$'s). This does not result into a Gamma, but into a rather complicated infinite sum expression (see this paper). Which means that for conducting tests related to the pooled-sample variance estimator, we will have to resort to the asymptotic normality result.
Alternatively, if the difference between $n_1$ and $n_2$ is not large, and both samples have respectable sizes, we may even consider dropping observations from the larger sample in order to make $n_1 =n_2$ and preserve the Gamma distribution result.
Best Answer
This answer goes through the minimal amount of analysis to discern the nature of the problem, find a solution method, and identify potential problems or limitations. In the end, you will need to use numerical methods to maximize the likelihood.
In order for any function to be a density, it has to integrate to $1$ and be nonnegative throughout its domain, which here is $[0,1]$. When we add the requirement that it be decreasing, we obtain a set of linear conditions on the parameters $a$ and $b$:
$1 = \int_0^1 f(x) dx$ automatically.
$1-a-b = f(0) \ge f(1) = 3a + 2b + 1-a-b = 1 + 2a + b \ge 0$.
$6a x + 2b = f^\prime(x) \le 0$ for $x\in [0,1]$, so in particular $2b = f^\prime(0) \le 0$ and $6a + 2b = f^\prime(1) \le 0$.
Conditions (2) and (3) determine five half-planes in $(a,b)$ coordinates. Their intersection is a triangle with vertices at $(-1,0)$,$(0,0)$, and $(1,3)$, as shown in the figure.
This is the set of possible values of the parameters. Its boundary consists of portions of the zeros of three functions which are negative on the interior:
$$\cases{g_1(a,b) = 3a + b \\ g_2(a,b) = b \\ g_3(a,b) = -(2a + b + 1).}$$
Given iid data $\mathbf{x}=(x_1, x_2, \ldots, x_n)$ drawn from such a distribution, the log likelihood of $(a,b)$ is (by definition) the logarithm of the probability of $\mathbf{x}$. It depends on the unknown value $(a,b)$:
$$\Lambda(a,b|\mathbf{x}) = \sum_{i=1}^n \log f(x_i|a,b) = \sum_{i=1}^n \log \left(3ax_i^2 + 2b x_i + 1-a-b\right).$$
Critical points $(a,b)$ satisfy the KKT conditions: there must exist constants $\mu_1, \mu_2, \mu_3$ such that
$$D f(a,b) = \mu_1 D g_1(a,b) + \mu_2 D g_2(a,b) + \mu_3 D g_3(a,b).\tag{1}$$
(There are additional conditions that amount to asserting that $(a,b)$ lies within the triangle, the $\mu_i$ are non-negative, and each product $\mu_i g_i(a,b)$ is zero.)
The derivatives are readily computed:
$$\eqalign{ D f(a,b) &= \left(\frac{\partial f}{\partial a}(a,b), \frac{\partial f}{\partial b}(a,b)\right)^\prime \\ &= \left(\sum_{i=1}^n \frac{3x_i^2 - 1}{3ax_i^2 + 2b x_i + 1-a-b}, \sum_{i=1}^n \frac{2x_i - 1}{3ax_i^2 + 2b x_i + 1-a-b}\right)^\prime; \\ Dg_1(a,b) &= (3,1)^\prime; \\ Dg_2(a,b) &= (0,1)^\prime;\\ Dg_3(a,b) &= (2,1)^\prime. }$$
When the algebra is performed and a common denominator is found, the numerators in $(1)$ become high-degree polynomials in $a$ and $b$: you need to solve this one numerically. When you do, you will obtain at least one $(a,b)$ for which $\Lambda$ is maximized. If one or more of the $\mu_i$ are nonzero (and any constrained optimizer will tell whether this is so), then the solution is on the boundary and special techniques are needed to develop confidence intervals for the estimates.