One does not compare the absolute values of two AICs (which can be like $\sim 100$ but also $\sim 1000000$), but considers their difference:
$$\Delta_i=AIC_i-AIC_{\rm min},$$
where $AIC_i$ is the AIC of the $i$-th model, and $AIC_{\rm min}$ is the lowest AIC one obtains among the set of models examined (i.e., the prefered model). The rule of thumb, outlined e.g. in Burnham & Anderson 2004, is:
- if $\Delta_i<2$, then there is substantial support for the $i$-th model (or the evidence against it is worth only a bare mention), and the proposition that it is a proper description is highly probable;
- if $2<\Delta_i<4$, then there is strong support for the $i$-th model;
- if $4<\Delta_i<7$, then there is considerably less support for the $i$-th model;
- models with $\Delta_i>10$ have essentially no support.
Now, regarding the 0.7% mentioned in the question, consider two situations:
- $AIC_1=AIC_{\rm min}=100$ and $AIC_2$ is bigger by 0.7%: $AIC_2=100.7$. Then $\Delta_2=0.7<2$ so there is no substantial difference between the models.
- $AIC_1=AIC_{\rm min}=100000$ and $AIC_2$ is bigger by 0.7%: $AIC_2=100700$. Then $\Delta_2=700\gg 10$ so there is no support for the 2-nd model.
Hence, saying that the difference between AICs is 0.7% does not provide any information.
The AIC value contains scaling constants coming from the log-likelihood
$\mathcal{L}$, and so $\Delta_i$ are free of such constants. One
might consider $\Delta_i = AIC_i − AIC_{\rm min}$ a rescaling transformation that forces the best model to have $AIC_{\rm min} := 0$.
The formulation of AIC penalizes the use of an excessive number of parameters, hence discourages overfitting. It prefers models with fewer parameters, as long as the others do not provide a substantially better fit. AIC tries to select a model (among the examined ones) that most adequately describes reality (in the form of the data under examination). This means that in fact the model being a real description of the data is never considered. Note that AIC gives you the information which model describes the data better, it does not give any interpretation.
Personally, I would say that if you have a simple model and a complicated one that has a much lower AIC, then the simple model is not good enough. If the more complex model is really much more complicated but the $\Delta_i$ is not huge (maybe $\Delta_i<2$, maybe $\Delta_i<5$ - depends on the particular situation) I would stick to the simpler model if it's really easier to work with.
Further, you can ascribe a probability to the $i$-th model via
$$p_i=\exp\left(\frac{-\Delta_i}{2}\right),$$
which provides a relative (compared to $AIC_{\rm min}$) probability that the $i$-th models minimizes the AIC. For example, $\Delta_i=1.5$ corresponds to $p_i=0.47$ (quite high), and $\Delta_i=15$ corresponds to $p_i=0.0005$ (quite low). The first case means that there is 47% probability that the $i$-th model might in fact be a better description than the model that yielded $AIC_{\rm min}$, and in the second case this probability is only 0.05%.
Finally, regarding the formula for AIC:
$$AIC=2k-2\mathcal{L},$$
it is important to note that when two models with similar $\mathcal{L}$ are considered, the $\Delta_i$ depends solely on the number
of parameters due to the $2k$ term. Hence, when $\frac{\Delta_i}{2\Delta k} < 1$, the relative improvement is due to actual improvement of the fit, not to increasing the number of parameters only.
TL;DR
- It's a bad reason; use the difference between the absolute values of the AICs.
- The percentage says nothing.
- Not possible to answer this question due to no information on the models, data, and what does different results mean.
Consider scalar parameters $\theta_0$ and the corresponding scalar estimate $\hat \theta$ for simplicity.
I will answer Q1 and Q3 which are essentially asking why is the mean of the score function $\Bbb{E}_{\theta}(s(\theta)) =0 $. This is a widely known result.. To put it simply, Notice that score function $s(\theta)$ depends of the random observations $X$. We can take its expectation as follows:
\begin{align}
\Bbb{E}_{\theta}(s) & = \int_x f(x;\theta) \frac{\partial \log f(x;\theta)}{\partial \theta} dx \\
&=\int_x \frac{\partial f(x;\theta)}{\partial \theta} dx = 0 \qquad \text{(exchanging integral and derivative)}
\end{align}
Now, notice that $S_n$ is nothing but averaged-sum of score functions based on independent observations. Hence, its expectation will also be zero.
For Q2) the motivation is to find study the asymptotic properties of our estimator wrt to the true parameter. Let $\hat{\theta}$ be the maximizer of $L_{n}(\theta)=\frac{1}{n} \sum_{i=1}^{n} \log f\left(X_{i} | \theta\right)$. Now, by meanvalue theorem
\begin{align}
0=L_{n}^{\prime}(\hat{\theta}) & =L_{n}^{\prime}\left(\theta_{0}\right)+L_{n}^{\prime \prime}\left(\hat{\theta}_{1}\right)\left(\hat{\theta}-\theta_{0}\right) \quad \text{(for some $\theta_1 \in [\hat\theta,\theta_0]$)}\\
\implies & \left(\hat{\theta}-\theta_{0}\right) = \frac{L_{n}^{\prime}\left(\theta_{0}\right)}{L_{n}^{\prime \prime}\left(\hat{\theta}_{1}\right)}
\end{align}
Consider the numerator:
\begin{align} \sqrt{n}\left(\frac{1}{n} \sum_{i=1}^{n} l^{\prime}\left(X_{i} | \theta_{0}\right)-\mathbb{E}_{\theta_{0}} l^{\prime}\left(X_{1} | \theta_{0}\right)\right) & = \sqrt{n}(S_n - \Bbb{E}(S_n)) \\ & \rightarrow N\left(0, \operatorname{Var}_{\theta_{0}}\left(l^{\prime}\left(X_{1} | \theta_{0}\right)\right)\right) = N(0,V)
\end{align}
Now, the denominator $L^{''}_n$ coverges to the Fisher's information $(J)$ by LLN.
Therefore, for the scalar paramters case, we can see that $$\sqrt{n}(\hat \theta - \theta_0) \rightarrow N(0,\frac{V}{J^2})$$
Best Answer
There is no particular meaning to AIC for comparison between different data sets. Yes, the AIC value can change for increased $n$. However, AIC is self-referential, which means that one can only compare different models using the SAME data set, not different data sets. This is also tricky, for example, it applies to probably detecting better nested models (models that are in a set/subset format, that is, when all of the models tested can be obtained by eliminating parameters from the most inclusive model).
Some experts suggest that AIC also applies to probably detecting better non-nested models, but there are counterexamples, see this Q/A. Perhaps a more meaningful question, that the OP question above is only indirectly implying, is "How well AIC can discriminate between two models when the sample is larger?" and the answer to that is apparently better for increasing $n$. This latter is not unexpected in the sense that AIC is only asymptotically correct, e.g., from Wikipedia, "We ... choose the candidate model that minimized the information loss. We cannot choose with certainty (Sic, italics are mine), because we do not know f (Sic, the unknown data generating process). Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g1 than by g2. The estimate, though, is only valid asymptotically; if the number of data points is small, then some correction is often necessary (see AICc...)."
Now arbitrary examples of how AIC changes. The first change we consider examines how AIC varies using the same random standard normal variate but different seeds. Shown is a histogram of 1000 repetitions of (normal distribution) model AIC values each from 100 random standard normal outcomes.
This shows a distribution for which normalcy is not excluded with $\mu \to -497.672,\sigma \to 48.5034$. This illustrates that a mean AIC value for 1000 independent repetitions of $n=100$ is an educated guess for location of AIC. Next, we apply this "educated guess" and fit it to show the trend:
This plot shows how mean AIC values (from 1000 independent trials) change when the number of random outcomes in each trial is $n=5,10,15,...,95,100$. This appears to be approximately a cubic with an SE of 1 AIC unit (R$^2=0.999964$). The meaning of this is like the sound of one hand clapping; all we have done is find a result that is consistent with AIC being a better discriminator for increasing $n$; without comparing to a second model for each trial we cannot detect anything. The only question remaining is why the AIC values increase for more data in the OP's question. Some software packages will sometimes show $-$AIC values in tables so that more is better, as opposed to less is better, but use the AIC values themselves for discriminating between models.