Solved – Derivation of Fisher information

fisher information

The Fisher information is derived from the second moment of the log likelihood equation. I am unable to follow the step below (from wiki) moving from the Expectation of the second partial derivative to the (chain rule result?) difference of a second order partial derivative and the square of the first order partial derivative.

If someone were feeling "talkative" and would be willing to fill in the blanks there – it would be appreciated.

https://en.wikipedia.org/wiki/Fisher_information

Best Answer

This is a chain rule problem.

Given

d(log(x))/dx = 1/x

and

 d(log(f(x))/dx = [d(f(x))/dx] / x

and: starting with the first order derivative:

Let us now take the second partial derivative by re-differentiating that first order one. This requires the chain rule:

d(f(a)f(b))/ dx  = d(f(a))/dx f(b) + f(a) d(f(b))/dx

Where we set

f(a) =

f(b) = 1/ (f(X; theta)  = f(X;theta)**-1

The final result then follows.

Related Solutions

Solved – Cauchy distribution (likelihood and Fisher information)

The Fisher information for one observation is given by\begin{align*}I(\theta) &= -\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X;\theta)}{\partial\theta^2}\right]\\ &=\mathbb{E}_\theta\left[ \frac{\partial^2 \log \{1+(X-\theta)^2\}}{\partial\theta^2}\right]\\ &=2\mathbb{E}_\theta\left[ -\frac{\partial }{\partial\theta}\frac{(X-\theta)}{1+(X-\theta)^2}\right]\\ &=2\mathbb{E}_\theta\left[\frac{1}{1+(X-\theta)^2}-\frac{2(X-\theta)^2}{[1+(X-\theta)^2]^2}\right]\\ &= \frac{2}{\pi}\int_\mathbb{R} \frac{1}{[1+(x-\theta)^2]^2}-\frac{2(x-\theta)^2}{[1+(x-\theta)^2]^3} \text{d}x\\ &= \frac{2}{\pi}\int_\mathbb{R} \frac{1}{[1+x^2]^2}-\frac{2x^2}{[1+x^2]^3} \text{d}x\\ &= \frac{2}{\pi}\int_\mathbb{R} \frac{1}{[1+x^2]^2}-\frac{2}{[1+x^2]^2}+\frac{2}{[1+x^2]^3} \text{d}x\\ &= \frac{2}{\pi}\int_\mathbb{R} \frac{-1}{[1+x^2]^2}+\frac{2}{[1+x^2]^3} \text{d}x \end{align*} because the integral (and the information) is translation invariant.

Now it is easy to establish a recurrence relation on$$I_k=\int_\mathbb{R} \frac{1}{[1+x^2]^k}\text{d}x$$Indeed \begin{align*} I_k &= \int_\mathbb{R} \frac{1+x^2}{[1+x^2]^{k+1}}\text{d}x\\ &= I_{k+1} + \int_\mathbb{R} \frac{2kx}{[1+x^2]^{k+1}}\frac{x}{2k}\text{d}x\\ &= I_{k+1} + \frac{1}{2k} \int_\mathbb{R} \frac{1}{[1+x^2]^{k}}\text{d}x = I_{k+1} + \frac{1}{2k} I_k \end{align*} by an integration by parts. Hence $$I_1=\pi\quad\text{and}\quad I_{k+1}=\frac{2k-1}{2k}I_k\quad k>1$$ which implies $$I_1=\pi\quad I_2=\frac{\pi}{2}\quad I_3=\frac{3\pi}{8}$$ and which leads to the Fisher information: $$I(\theta)=\frac{2}{\pi}\left\{-I_2+2I_3\right\}=\frac{2}{\pi}\left\{\frac{-\pi}{2}+\frac{3\pi}{4}\right\}=\frac{1}{2}$$

Solved – Basic Question on Defining the Dimensions and Entries of the Fisher Information Matrix

The Fisher information is a symmetric square matrix with a number of rows/columns equal to the number of parameters you're estimating. Recall that it's a covariance matrix of the scores, & there's a score for each parameter; or the expectation of the negative of a Hessian, with a gradient for each parameter. When you want to consider different experimental treatments you represent their effects by adding more parameters to the model; i.e. more rows/columns (rather than more dimensions—a matrix has two dimensions by definition). When you're estimating only a single parameter, the Fisher information is just a one-by-one matrix (a scalar)—the variance of, or the expected value of the negative of the second derivative of, the score.

For a simple linear regression model of $Y$ on $x$ with $n$ observations

$y_i = \beta_0 +\beta_1 x_i + \varepsilon_i$

where $\varepsilon \sim \mathrm{N}(0,\sigma^2)$, there are three parameters to estimate, the intercept $\beta_0$, the slope $\beta_1$, & the error variance $\sigma^2$ ; the Fisher information is

$$ \begin{align} \mathcal{I}(\beta_0,\beta_1,\sigma^2) =& \operatorname{E} \left[ \begin{matrix} \left(\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0}\right)^2 & \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0} \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1} & \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0}\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2}\\ \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1}\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0} & \left(\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1}\right)^2& \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1}\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2}\\ \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2}\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0} & \tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2}\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1} & \left(\tfrac{\partial \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2}\right)^2\\ \end{matrix} \right] \\ \\ =& -\operatorname{E}\left[ \begin{matrix} \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{(\partial \beta_0)^2} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0 \partial \beta_1} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_0\partial \sigma^2}\\ \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1\partial \beta_0} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{(\partial \beta_1)^2} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \beta_1\partial \sigma^2}\\ \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2\partial \beta_0} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{\partial \sigma^2\partial \beta_1} & \tfrac{\partial^2 \ell(\beta_0,\beta_1,\sigma^2)}{(\partial \sigma^2)^2}\\ \end{matrix} \right]\\ \\ =& \left[ \begin{matrix} \tfrac{n}{\sigma^2} & \tfrac{\sum_i^n x_i}{\sigma^2} & 0\\ \tfrac{\sum_i^n x_i}{\sigma^2} & \tfrac{\sum_i^n x_i^2}{\sigma^2} & 0\\ 0 & 0 & \tfrac{n}{2\sigma^4} \end{matrix} \right] \end{align} $$

where $\ell(\cdot)$ is the log-likelihood function of the parameters. (Note that $x$ might be a dummy variable indicating a particular treatment.)

Best Answer

Related Solutions

Solved – Cauchy distribution (likelihood and Fisher information)

Solved – Basic Question on Defining the Dimensions and Entries of the Fisher Information Matrix

Related Question