Consider a vector of parameters $(\theta_1, \theta_2)$, with $\theta_1$ the parameter of interest, and $\theta_2$ a nuisance parameter.
If $L(\theta_1, \theta_2 ; x)$ is the likelihood constructed from the data $x$, the profile likelihood for $\theta_1$ is defined as $L_P(\theta_1 ; x) = L(\theta_1, \hat{\theta}_2(\theta_1) ; x)$ where $ \hat{\theta}_2(\theta_1)$ is the MLE of $\theta_2$ for a fixed value of $\theta_1$.
$\bullet$ Maximising the profile likelihood with respect to $\theta_1$ leads to same estimate $\hat{\theta}_1$ as the one obtained by maximising the likelihood simultaneously with respect to $\theta_1$ and $\theta_2$.
$\bullet$ I think the standard deviation of $\hat{\theta}_1$ may also be estimated from the second derivative of the profile likelihood.
$\bullet$ The likelihood statistic for $H_0: \theta_1 = \theta_0$ can be written in terms of the profile likelihood: $LR = 2 \log( \tfrac{L_P(\hat{\theta}_1 ; x)}{L_P(\theta_0 ; x)})$.
So, it seems that the profile likelihood can be used exactly as if it was a genuine likelihood. Is it really the case ? What are the main drawbacks of that approach ? And what about the 'rumor' that the estimator obtained from the profile likelihood is biased (edit: even asymptotically) ?
Best Answer
The estimate of $\theta_1$ from the profile likelihood is just the MLE. Maximizing with respect to $\theta_2$ for each possible $\theta_1$ and then maximizing with respect to $\theta_1$ is the same as maximizing with respect to $(\theta_1, \theta_2)$ jointly.
The key weakness is that, if you base your estimate of the SE of $\hat{\theta}_1$ on the curvature of the profile likelihood, you are not fully accounting for the uncertainty in $\theta_2$.
McCullagh and Nelder, Generalized linear models, 2nd edition, has a short section on profile likelihood (Sec 7.2.4, pgs 254-255). They say: