[Note: This is my answer to the Dec. 19, 2014, version of the question.]
If you operate the change of variable $y=x^2$ in your density
$$f_X(x|\alpha,\beta,\sigma)=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{x^2}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{x^{2\alpha-1}}{2^{\alpha-1}\sigma^{2\alpha}}\mathbb{I}_{{\mathbb{R}}^{+}}(x)
$$ the Jacobian is given by $\dfrac{\text{d}y}{\text{d}x}= 2x = 2y^{1/2}$ and hence
\begin{align*}
f_Y(y|\alpha,\beta,\sigma)&=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{y}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{y^{\frac{2\alpha-1}{2}}}{2^{\alpha-1}\sigma^{2\alpha}}\frac{1}{2 y^{1/2}}\mathbb{I}_{{\mathbb{R}}^{+}}(y)\\
&=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{y}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{y^{{\alpha-1}}}{2^{\alpha}\sigma^{2\alpha}}\mathbb{I}_{{\mathbb{R}}^{+}}(y)
\end{align*}
This shows that
- This is a standard $\mathcal{G}(\alpha,2\sigma^2\beta)$ model, i.e. you observe $$(x_1^2,\ldots,x_n^2)=(y_1,\ldots,y_n)\stackrel{\text{iid}}{\sim}\mathcal{G}(\alpha,\eta);$$
- the model is over-parametrised since only $\eta=2\sigma^2\beta$ can be identified;
- EM is not necessary to find the MLE of $(\alpha,\eta)$, which is not available in closed form but solution of$$\hat\eta^{-1}=\bar{y}/\hat{\alpha}n\qquad\log(\hat{\alpha})-\psi(\hat{\alpha})=\log(\bar{y})-\frac{1}{n}\sum_{i=1}^n\log(y_i)$$ where $\psi(\cdot)$ is the di-gamma function. This paper by Thomas Minka indicates fast approximations to the resolution of the above equation.
Your reasoning is mostly correct.
The joint density of the sample $(X_1,X_2,\ldots,X_n)$ is
\begin{align}
f_{\theta}(x_1,x_2,\ldots,x_n)&=\frac{\theta^n}{\left(\prod_{i=1}^n (1+x_i)\right)^{1+\theta}}\mathbf1_{x_1,x_2,\ldots,x_n>0}\qquad,\,\theta>0
\\\\\implies \ln f_{\theta}(x_1,x_2,\ldots,x_n)&=n\ln(\theta)-(1+\theta)\sum_{i=1}^n\ln(1+x_i)+\ln(\mathbf1_{\min_{1\le i\le n} x_i>0})
\\\\\implies\frac{\partial}{\partial \theta}\ln f_{\theta}(x_1,x_2,\ldots,x_n)&=\frac{n}{\theta}-\sum_{i=1}^n\ln(1+x_i)
\\\\&=-n\left(\frac{\sum_{i=1}^n\ln(1+x_i)}{n}-\frac{1}{\theta}\right)
\end{align}
Thus we have expressed the score function in the form
$$\frac{\partial}{\partial \theta}\ln f_{\theta}(x_1,x_2,\ldots,x_n)=k(\theta)\left(T(x_1,x_2,\ldots,x_n)-\frac{1}{\theta}\right)\tag{1}$$
, which is the equality condition in the Cramér-Rao inequality.
It is not difficult to verify that $$E(T)=\frac{1}{n}\sum_{i=1}^n\underbrace{E(\ln(1+X_i))}_{=1/\theta}=\frac{1}{\theta}\tag{2}$$
From $(1)$ and $(2)$ we can conclude that
- The statistic $T(X_1,X_2,\ldots,X_n)$ is an unbiased estimator of $1/\theta$.
- $T$ satisfies the equality condition of the Cramér-Rao inequality.
These two facts together imply that $T$ is the UMVUE of $1/\theta$.
The second bullet actually tells us that variance of $T$ attains the Cramér-Rao lower bound for $1/\theta$.
Indeed, as you have shown,
$$E_{\theta}\left[\frac{\partial^2}{\partial\theta^2}\ln f_{\theta}(X_1)\right]=-\frac{1}{\theta^2}$$
This implies that the information function for the whole sample is $$I(\theta)=-nE_{\theta}\left[\frac{\partial^2}{\partial\theta^2}\ln f_{\theta}(X_1)\right]=\frac{n}{\theta^2}$$
So the Cramér-Rao lower bound for $1/\theta$ and hence the variance of the UMVUE is
$$\operatorname{Var}(T)=\frac{\left[\frac{d}{d\theta}\left(\frac{1}{\theta}\right)\right]^2}{I(\theta)}=\frac{1}{n\theta^2}$$
Here we have exploited a corollary of the Cramér-Rao inequality, which says that for a family of distributions $f$ parametrised by $\theta$ (assuming regularity conditions of CR inequality to hold), if a statistic $T$ is unbiased for $g(\theta)$ for some function $g$ and if it satisfies the condition of equality in the CR inequality, namely $$\frac{\partial}{\partial\theta}\ln f_{\theta}(x)=k(\theta)\left(T(x)-g(\theta)\right)$$, then $T$ must be the UMVUE of $g(\theta)$. So this argument does not work in every problem.
Alternatively, using the Lehmann-Scheffe theorem you could say that $T=\frac{1}{n}\sum_{i=1}^{n} \ln(1+X_i)$ is the UMVUE of $1/\theta$ as it is unbiased for $1/\theta$ and is a complete sufficient statistic for the family of distributions. That $T$ is compete sufficient is clear from the structure of the joint density of the sample in terms of a one-parameter exponential family. But variance of $T$ might be a little tricky to find directly.
Best Answer
The first question is somewhat difficult to answer, because it requires speculation about the author's psychology, but let me at least try to give some intuition that it is a sensible score. Recall that what it means for $\psi(W;\theta,\eta)$ to be a score is that at the true values of $\eta_0 = (\ell_0,m_0)$, we have the moment condition
$$E[\psi(W;\theta,\eta_0)] = 0$$
Let us now write out what that entails given the $\psi$ defined above. We end up with the equation
$$E[(Y - \ell_0(X) - \theta_0(D-m_0(X)))\cdot(D-m_0(X))] = 0$$ Using the linearity of expectations and rearranging, we can get a closed form expression for $\theta_0$ given the above equation: $$\theta_0 = \frac{E[(Y-\ell_0(X))(D-m_0(X))]}{E[(D-m_0(X))^2]} = \frac{E[\mathrm{Cov}(Y,D|X)]}{E[\mathrm{Var}(D|X)]}$$
I like to think about this expression for $\theta_0$ in terms of the Frisch-Waugh-Lovell (FWL) theorem. Recall that this theorem states that in the linear model $Y = \beta D + \gamma X + \varepsilon$, $\beta$ is numerically equivalent to the outcome of the model $r_Y = \beta r_D + \delta$ where $r_Y = Y - \mathrm{L}(Y|X_2)$ and $r_D = Y - \mathrm{L}(D|X_2)$ are respectively the residuals from predicting $Y$ and $D$ using the $X$'s (here $\mathrm{L}(A|B)$ is defined to be the best linear predictor of $A$ given $B$). Recall additionally that when $D$ is scalar, the OLS is just the covariance of the outcome and the predictor divided by the variance of the predictor, i.e. $\beta = \frac{\mathrm{Cov}(r_Y,r_D)}{\mathrm{Var}(r_D)}$. Note that the expression for $\theta$ can thus be thought of as the "nonparametric" analogue of the FWL theorem: rather than taking the residual from the best linear predictor, we take the residual from the best predictor, i.e. the conditional expectation function.
Now, addressing your second point, let $\delta_\ell(X),\delta_m(X)$ be two test functions respectively perturbing $\ell$ and $m$. Then Neyman orthogonality states that for choices of $\delta_\ell$ and $\delta_m$, we have
$$\frac{\mathrm d E[\psi(W;\theta,\eta_0 + r(\delta_\ell,\delta_m))]}{\mathrm d r} = 0$$ where the derivative is taken around the point where $r=0$. To prove this, we can simply expand out the definition of $\psi$ to obtain
$$\begin{aligned}E[\psi(W;\theta,\eta_0 + r(\delta_\ell,\delta_m))] &= E[(Y - \ell_0(X)-r \delta_\ell(X))(D - m_0(X) - r\delta_m(X))]\\ &- \theta E[(D-m_0(X)-r\delta_m(X))^2] \end{aligned}$$ Let us first check that the derivative of the first term is mean 0. To do so, we note that by differentiating under the expectation sign around $r=0$, we have $$\begin{aligned} \frac{\mathrm d}{\mathrm dr} E[(Y - \ell_0(X)-r \delta_\ell(X))(D - m_0(X) - r\delta_m(X))] &= E\left[\frac{\mathrm d}{\mathrm dr} (Y - \ell_0(X)-r \delta_\ell(X))(D - m_0(X) - r\delta_m(X))\right]\\ &= -E[(Y-\ell_0(X))\delta_m(X) + \delta_\ell(X)(D- m_0(X))]\\ &= -E\left[\underbrace{E[Y-\ell_0(X)|X]}_{=0} \delta_m(X)\right] - E\left[E[\delta_\ell(X)\underbrace{E[D-m_0(X)|X]}_{=0}\right] = 0 \end{aligned}$$ Note that the two terms above are mean 0 as a result of the fact that $\ell_0$ and $m_0$ are by definition conditional expectation functions.
In light of the above, all that remains to be checked is that the limit of the third term goes to 0 as $r\to 0$. Specifically, we must show $$\frac{\theta\mathrm dE[(D-m_0(X)-r\delta_m(X))^2]}{\mathrm dr} = 0$$ Now, differentiating under the integral sign again, we have $$\begin{aligned}\frac{\mathrm d \theta E[(D-m_0(X)-r\delta_m(X))^2]}{\mathrm dr} &= \theta E\left[\frac{\mathrm d}{\mathrm dr}D-m_0(X)-r\delta_m(X))^2\right] \\&= -2 \theta E[(D-m_0(X))\delta_m(X)]\\ &= \theta E[E[(D-m_0(X))\delta_m(X) | X]]\\ &= \theta E[\underbrace{E[(D-m_0(X))|X]}_{=0}\delta_m(X)]\\ &= \theta E[0|X] = 0\end{aligned}$$ So once again, after some manipulation, this third term equalling 0 is due to the definition of $m_0$ as the conditional expectation function of $D$.