Imagine (so that the models are comparable) that we fit the same model for the mean using least squares and Poisson regression (or indeed any other model with the same model for how the mean relates to predictors)
To further simplify the notions, consider the model is for the conditional mean and linear in the predictors ($E(Y|x) = x\beta$) so that we're comparing ordinary linear regression with Poisson regression (now with the identity link because of the previous assumption).
[Further (to simplify ideas), assume the model for the mean is correct (so our estimates are unbiased). This is not necessary though]
Then your question amounts to asking "if I use an estimator that minimizes the mean square residual, will that smallest-possible mean square residual be smaller than an estimator that does anything else?"
The answer is hopefully now obvious - when you measure fit by the very criterion that one estimator minimizes it must win when compared with anything else
So the question then becomes why would you use RMSE to compare the two models?
Edit: Note that counts whose mean is near 0 will have smaller variance than counts whose mean is larger (indeed, if they're Poisson, the variance will be equal to the mean). So a large value is less "precise" than a value near zero (on average it will be further from the mean of the distribution that generated it). So why would you weight the squared deviation for all points equally?
If you want the most precise estimate (in the MSE sense, say) of the Poisson mean, you want to minimize the MSE of the estimator of that mean (i.e. minimize $E[(\hat{\mu}-\mu)^2]$ ... or equivalently, minimize its square root). That's not the same as minimizing the MSE in the data.
If your estimator is unbiased, then asymptotically (as $n\to\infty$) in in most situations the MLE will have lowest variance and so will also be asymptotically minimum MSE. In small samples (for easy cases) you can often compute the MSE and compare directly.
Some general answers:
Using the Poisson distribution to estimate the mean parameter requires relatively weak assumption, essentially only that our model for the mean (expected value) of the response variable y given the explanatory variables x is correctly specified. With the default log-link this is
$$E(y | x) = exp(x b)$$
We can use this to predict the mean for a new observation given a new set of explanatory variables. We can also get standard errors and confidence intervals for the prediction of the mean.
However, when we want the entire predictive distribution we need the assumption that we have a correctly specified distribution for the response. statsmodels does not have much of built-in support for goodness of fit tests for distributions outside the normal distribution case, especially not for discrete distribution like Bernoulli, Binomial or Poisson.
However, it is easy to get the predictive distribution using the scipy.stats distributions.
For example a sequence to get the results for Poisson could be
from scipy import stats
import statsmodels.api as sm
results = sm.Poisson(y, x).fit()
mean_predicted = results.predict() # or use a new x
prob_2more = stats.poisson.sf(2 - 1, mean_predicted).mean() # average over x in sample
freq_2more = (y > 2 - 1).mean() # sample average
or similar probability and frequency for observing y=2 given the x or predicted mean for each observation:
prob_2_obs = stats.poisson.pmf(2, mean_predicted)
Note: 2 - 1
in sf is used because of the weak inequality in sf(k)=Prob(y > k)
The pmf could be used to create an analog of a classification table comparing predicted versus empirical counts.
related code:
The first is essentially doing what I explained for sm.Poisson
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L2691 and not yet available Vuong test for comparing model (not yet in statsmodels) https://gist.github.com/jseabold/6617976#file-zip_model-py-L5
Caveat: The above predictions ignore parameter uncertainty, i.e. they don't take into account that our parameters for the mean of the Poisson distribution are estimated. We can use simulation to get the distribution of the predicted Poisson probabilities, but I didn't manage to come up with an API for it, it has to be simulated for each x in the prediction and produces a large collection of numbers, i.e. a distribution over distributions for each explanatory set x.
technical aside:
statsmodels has 3 versions for Poisson that all produce the same results but have different extras, sm.Poisson (from discrete_model), GLM with family Poisson and GEE with family Poisson and independence or singleton clusters as in your case. I used Poisson above because it is easier to type, i.e. no family or extras to include.
Best Answer
Your suggestion about using a null model is similar to $R^2$. $R^2$ is defined as $1-MSE/V$, where $MSE$ is the model's mean squared error and $V$ is the variance of the observed output. You can think of the variance as the mean squared error of a null model that always gives the mean as its predicted output. Even here, the question is: how much better can you do? This is very hard to answer. The reason is that it's hard to know whether the error reflects variation in the output that's fundamentally unpredictable from the input (e.g. 'noise', but could be something else), or whether additional structure is present that the model has simply failed to capture. Sometimes looking at the residuals can give a hint. Under some circumstances, it's possible to estimate the 'noise' level. For example, if you have many repeated trials where inputs are identical, you can measure variability of the output for equal inputs. This gives a bound on the maximum possible performance. You would typically encounter this situation in the context of controlled experiments. Or, you may be able to do something similar if you have access to a known 'correct model' (e.g. in a theoretical setting, or if you're modeling a well understood physical system). Otherwise, it's hard to know whether there's a better model out there.
Looking at the training vs. test error can give you some idea about the extent to which your model is overfitting (the expected training error would be lower than the expected test error). There can be variability here when using a small number of samples and/or few repetitions. A gap between training and test error isn't a problem per se, but a large gap might signal a problem. Even so...one model that overfits might still have better generalization performance than another model that doesn't.
Instead of asking how good your model is, you can also ask how bad it is. You could use a significance testing approach to see whether your prediction is better than 'chance'. For example, you might compare the test error on real data to the test error on permuted data (where relationships between the input/output have been destroyed, and any apparent performance is due to sampling variability or overfitting).