You cannot use likelihood-based statistics like AIC to compare across models with different likelihood functions - the underlying formulas are different. In linear regression, the likelihood function is the normal density function, in Poisson regression it is the Poisson function. That will account for the differences in the AIC probably more than any differences in fit.
Before you decide to even use a linear model, you need to make sure that the residuals from the model are normally distributed (you can proxy that by looking at the distribution of the outcome variable, though keep in mind it isn't the same). If they are not normally distributed, or close enough for the eye, then you can't use a normal regression model to do any hypothesis testing.
Assuming that it is approximately normal, I would take a two broad approaches to choose the model to report.
1) Predicted outcomes. Estimate the predicted outcomes of each model and compare. Does the linear model have better predictive ability? You may want to do this in a cross-validation framework, where you "train" your model on part of your data and use the rest for prediction.
2) Intuitive interpretation of coefficients. Poisson coefficients can be complicated to understand - they are not the change in number of y but rather a proportional change. Depending on your context this may be more or less useful. Sometimes it is worth sacrificing fit if your model can be more easily interpreted by the end-user - for example, some researchers are willing to avoid the complexity of logit and probit models for the easier-to-interpret coefficients in a linear probability model, even though the LPM has tons of setbacks. Think about who your audience is, what is your context, what is your research question, etc., as you make these decisions.
EDIT: I forgot to add this paper, which gives a good comparison across a range of different count models and may be helpful.
Poisson regression is just a GLM:
People often speak of the parametric rationale for applying Poisson regression. In fact, Poisson regression is just a GLM. That means Poisson regression is justified for any type of data (counts, ratings, exam scores, binary events, etc.) when two assumptions are met: 1) the log of the mean-outcome is a linear combination of the predictors and 2) the variance of the outcome is equal to the mean. These two conditions are respectively referred to as the mean-model and the mean-variance relationship.
The mean-model assumption can be relaxed somewhat by using a complex set of adjustments for predictors. This is nice because the link function affects the interpretation of the parameters; the subtlety of interpretation makes the difference between answering a scientific question and completely eluding the consumers of your statistical analysis. In another SE post I discuss the usefulness of log-transforms for interpretation.
It turns out, however, that the second assumption (mean-variance relationship) has strong implications on inference. When the mean-variance relationship is not true, the parameter estimates are not biased. However, the standard errors, confidence intervals, p-values, and predictions are all miscalibrated. That means you cannot control for Type I error and you may have suboptimal power.
What if the mean-variance could be relaxed so that the variance is simply proportional to the mean? Negative binomial regression and Quasipoisson regression do this.
Quasipoisson models
Quasipoisson models are not likelihood based. They maximize a "quasilikelihood" which is a Poisson likelihood up to a proportional constant. That proportional constant happens to be the dispersion. The dispersion is considered a nuisance parameter. While the maximization routine comes up with an estimate of the nuisance parameter, that estimate is merely an artifact of the data rather than any value which generalizes to the population. The dispersion only serves to "shrink" or "widen" the SEs of the regression parameters according to whether the variance is proportionally smaller than or larger than the mean. Since the dispersion is treated as a nuisance parameter, quasipoisson models enjoy a host of robust properties: the data can in fact be heteroscedastic (not meeting the proportional mean-variance assumption) and even exhibit small sources of dependence, and the mean model need not be exactly correct, but the 95% CIs for the regression parameters are asymptotically correct. If your goal of the data analysis is to measure the association between a set of regression parameters and the outcome, quasipoisson models are usually the way to go. A limitation of these models is that they cannot yield prediction intervals, the Pearson residuals cannot tell you much about how accurate the mean model is, and information criteria like the AIC or BIC cannot effectively compare these models to other types of models.
Negative binomial models
It's most useful to understand negative binomial regression as a 2-parameter Poisson regression. The mean model is the same as in Poisson and Quasipoisson models where the log of the outcome is a linear combination of predictors. Furthermore, the "scale" parameter models a mean-variance relationship where the variance is merely proportional to the mean as before. However, unlike quasipoisson models, this type of model is an exact likelihood based procedure. In this case the dispersion is an actual parameter which has some extent of generalizability to the population. This introduces a few advantages over quasipoisson but, in my opinion, imposes more (untestable) assumptions. Unlike quasipoisson models: the data must be independent, the mean model must be correct, and the scale parameter must be homoscedastic across the range of fitted values to obtain correct inference. However, these can be assessed somewhat by inspecting Pearson residuals, and the model produces viable prediction and prediction intervals, and is amenable to comparison with information criteria.
Negative binomial probability models arise from a Poisson-Gamma mixture. That is, there is an unknown fluctuating Gamma random variable "feeding into" the Poisson rate parameter. Since NB GLM fitting is likelihood based, it is usually helpful to state prior beliefs about the data generating mechanism and connect them to the probabilistic rationale for the model at hand. For instance, if I am testing number of racers retiring from 24-hour endurance racing, I might consider that the environmental conditions are all stressors that I did not measure and thus contribute to the risk of DNF, such as moisture or cold temperature affecting tire traction and thus the risk of a spin-out and wreck.
Models for dependent data: GLMMs vs GEE
Generalized linear mixed models (GLMMs) for Poisson data do not compare with the above approaches. GLMMs answer a different question and are used in different data structures. Here sources of dependence between data are measured explicitly. GLMMs make use of random intercepts and random slopes to account for individual level heterogeneity. This modifies what we estimate. The random effects modify the mean and the variance that is modeled rather than just the variance as was discussed above.
There are two possible levels of association which can be measured in dependent data: population level (marginal) and individual level (conditional). GLMMs claim to measure individual level (conditional) associations: that is, given the whole host of individual level contributors to the outcome, what is the relative effect of a combination of predictors. As an example, exam prep courses may be of little effect to children who attend exemplary schools, whereas inner city children may benefit tremendously. The individual level effect is then substantially higher in this circumstance since advantaged children are too far above the curve in terms of positive exposures.
If we naively applied quasipoisson or negative binomial models to dependent data, the NB models would be wrong, and the Quasipoisson models would be inefficient. The GEE, however, extends the quasipoisson model to explicitly model dependence structures like the GLMM, but the GEE measures an marginal (population level) trend and obtains the correct weights, standard errors, and inference.
Data analysis example:
This post is already too long :) There is a nice illustration of the first two models in this tutorial, along with references to more reading if you are interested. The data in question involve the nesting habits of horseshoe crabs: females sit in nests and males (satellites) attach to her. The investigators wanted to measures the number of males attached to a female as a function of the female's characteristics. I hope I've underscored why mixed models are noncomparable: if you have dependent data, you must use the correct model for the question those dependent data are trying to answer, either a GLM or a GEE.
References:
[1] Agresti, Categorical Data Analysis 2nd Edition
[2] Diggle, Heagerty, Liang, Zeger, Analysis of Longitudinal Data 2nd ed.
Best Answer
There is a maximum possible number of counted answers, related to the number of questions asked. Although one can model this as a Poisson process of the counting type, another interpretation is that a Poisson process has no theoretical limit for the number of counted answers, that is, it is on $[0,\infty)$. Another distribution, i.e., a discrete one that has finite support, e.g., the beta binomial, might be more appropriate as it has a more mutable shape. However, that is just a guess, and, in practice, I would search for an answer to a more general question using brute force...
Rather than check for overdispersion, which has no guarantee of leading to a useful answer, and, although one can examine indices of dispersion to quantify dispersion, I would more usefully suggest searching for a best distribution using a discrete distribution option of a fit quality search program, e.g., Mathematica's FindDistribution routine. That type of a search does a fairly exhaustive job of guessing what known distribution(s) work(s) best not only to mitigate overdispersion, but also to more usefully model many of other data characteristics, e.g., goodness of fit as measured a dozen different ways.
To further examine my candidate distributions, I would post hoc examine residuals to check for homoscedasticity, and/or distribution type, and also consider whether the candidate distributions can be reconciled as corresponding to a physical explanation of the data. The danger of this procedure is identifying a distribution that is inconsistent with best modelling of an expanded data set. The danger of not doing a post hoc procedure is to a priori assign an arbitrarily chosen distribution without proper testing (garbage in-garbage out). The superiority of the post hoc approach is that it limits the errors of fitting, and that is also its weakness, i.e., it may understate the modelling errors through pure chance as many distributions fits are attempted. That then, is the reason for examining residuals and considering physicality. The top down or a priori approach offers no such post hoc check on reasonableness. That is, the only method of comparing the physicality of modelling with different distributions, is to post hoc compare them. Thus arises the nature of physical theory, we test a hypothetical explanation of data with many experiments before we accept them as exhausting alternative explanations.