Yes, there are some simple relationships between confidence interval comparisons and hypothesis tests in a wide range of practical settings. However, in addition to verifying the CI procedures and t-test are appropriate for our data, we must check that the sample sizes are not too different and that the two sets have similar standard deviations. We also should not attempt to derive highly precise p-values from comparing two confidence intervals, but should be glad to develop effective approximations.
In trying to reconcile the two replies already given (by @John and @Brett), it helps to be mathematically explicit. A formula for a symmetric two-sided confidence interval appropriate for the setting of this question is
$$\text{CI} = m \pm \frac{t_\alpha(n) s}{\sqrt{n}}$$
where $m$ is the sample mean of $n$ independent observations, $s$ is the sample standard deviation, $2\alpha$ is the desired test size (maximum false positive rate), and $t_\alpha(n)$ is the upper $1-\alpha$ percentile of the Student t distribution with $n-1$ degrees of freedom. (This slight deviation from conventional notation simplifies the exposition by obviating any need to fuss over the $n$ vs $n-1$ distinction, which will be inconsequential anyway.)
Using subscripts $1$ and $2$ to distinguish two independent sets of data for comparison, with $1$ corresponding to the larger of the two means, a non-overlap of confidence intervals is expressed by the inequality (lower confidence limit 1) $\gt$ (upper confidence limit 2); viz.,
$$m_1 - \frac{t_\alpha(n_1) s_1}{\sqrt{n_1}} \gt m_2 + \frac{t_\alpha(n_2) s_2}{\sqrt{n_2}}.$$
This can be made to look like the t-statistic of the corresponding hypothesis test (to compare the two means) with simple algebraic manipulations, yielding
$$\frac{m_1-m_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \gt \frac{s_1\sqrt{n_2}t_\alpha(n_1) + s_2\sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 s_2^2 + n_2 s_1^2}}.$$
The left hand side is the statistic used in the hypothesis test; it is usually compared to a percentile of a Student t distribution with $n_1+n_2$ degrees of freedom: that is, to $t_\alpha(n_1+n_2)$. The right hand side is a biased weighted average of the original t distribution percentiles.
The analysis so far justifies the reply by @Brett: there appears to be no simple relationship available. However, let's probe further. I am inspired to do so because, intuitively, a non-overlap of confidence intervals ought to say something!
First, notice that this form of the hypothesis test is valid only when we expect $s_1$ and $s_2$ to be at least approximately equal. (Otherwise we face the notorious Behrens-Fisher problem and its complexities.) Upon checking the approximate equality of the $s_i$, we could then create an approximate simplification in the form
$$\frac{m_1-m_2}{s\sqrt{1/n_1 + 1/n_2}} \gt \frac{\sqrt{n_2}t_\alpha(n_1) + \sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 + n_2}}.$$
Here, $s \approx s_1 \approx s_2$. Realistically, we should not expect this informal comparison of confidence limits to have the same size as $\alpha$. Our question then is whether there exists an $\alpha'$ such that the right hand side is (at least approximately) equal to the correct t statistic. Namely, for what $\alpha'$ is it the case that
$$t_{\alpha'}(n_1+n_2) = \frac{\sqrt{n_2}t_\alpha(n_1) + \sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 + n_2}}\text{?}$$
It turns out that for equal sample sizes, $\alpha$ and $\alpha'$ are connected (to pretty high accuracy) by a power law. For instance, here is a log-log plot of the two for the cases $n_1=n_2=2$ (lowest blue line), $n_1=n_2=5$ (middle red line), $n_1=n_2=\infty$ (highest gold line). The middle green dashed line is an approximation described below. The straightness of these curves belies a power law. It varies with $n=n_1=n_2$, but not much.
The answer does depend on the set $\{n_1, n_2\}$, but it is natural to wonder how much it really varies with changes in the sample sizes. In particular, we could hope that for moderate to large sample sizes (maybe $n_1 \ge 10, n_2 \ge 10$ or thereabouts) the sample size makes little difference. In this case, we could develop a quantitative way to relate $\alpha'$ to $\alpha$.
This approach turns out to work provided the sample sizes are not too different from each other. In the spirit of simplicity, I will report an omnibus formula for computing the test size $\alpha'$ corresponding to the confidence interval size $\alpha$. It is
$$\alpha' \approx e \alpha^{1.91};$$
that is,
$$\alpha' \approx \exp(1 + 1.91\log(\alpha)).$$
This formula works reasonably well in these common situations:
Both sample sizes are close to each other, $n_1 \approx n_2$, and $\alpha$ is not too extreme ($\alpha \gt .001$ or so).
One sample size is within about three times the other and the smallest isn't too small (roughly, greater than $10$) and again $\alpha$ is not too extreme.
One sample size is within three times the other and $\alpha \gt .02$ or so.
The relative error (correct value divided by the approximation) in the first situation is plotted here, with the lower (blue) line showing the case $n_1=n_2=2$, the middle (red) line the case $n_1=n_2=5$, and the upper (gold) line the case $n_1=n_2=\infty$. Interpolating between the latter two, we see that the approximation is excellent for a wide range of practical values of $\alpha$ when sample sizes are moderate (around 5-50) and otherwise is reasonably good.
This is more than good enough for eyeballing a bunch of confidence intervals.
To summarize, the failure of two $2\alpha$-size confidence intervals of means to overlap is significant evidence of a difference in means at a level equal to $2e \alpha^{1.91}$, provided the two samples have approximately equal standard deviations and are approximately the same size.
I'll end with a tabulation of the approximation for common values of $2\alpha$. In the left hand column is the nominal size $2\alpha$ of the original confidence interval; in the right hand column is the actual size $2\alpha^\prime$ of the comparison of two such intervals:
$$\begin{array}{ll}
2\alpha & 2\alpha^\prime \\ \hline
0.1 &0.02\\
0.05 &0.005\\
0.01 &0.0002\\
0.005 &0.00006\\
\end{array}$$
For example, when a pair of two-sided 95% CIs ($2\alpha=.05$) for samples of approximately equal sizes do not overlap, we should take the means to be significantly different, $p \lt .005$. The correct p-value (for equal sample sizes $n$) actually lies between $.0037$ ($n=2$) and $.0056$ ($n=\infty$).
This result justifies (and I hope improves upon) the reply by @John. Thus, although the previous replies appear to be in conflict, both are (in their own ways) correct.
I think it is at times appropriate to interpret non-statistically significant results in the spirit of "accept the null hypothesis". In fact, I have seen statistically significant studies interpreted in such a fashion; the study was too precise and results were consistent with a narrow range of non-null but clinically insignificant effects. Here's a somewhat blistering critique of a study (or moreover its press) about the relation between chocolate/red wine consumption and its "salubrious" effect on diabetes. The probability curves for insulin resistance distributions by high/low intake is hysterical.
Whether one can interpret findings as "confirming H_0" depends on a great number of factors: the validity of the study, the power, the uncertainty of the estimate, and the prior evidence. Reporting the confidence interval (CI) instead of the p-value is perhaps the most useful contribution you can make as a statistician. I remind researchers and fellow statisticians that statistics do not make decisions, people do; omitting p-values actually encourages a more thoughtful discussion of the findings.
The width of the CI describes a range of effects which may or may not include the null, and may or may not include very clinically significant values like life-saving potential. However, a narrow CI confirms one type of effect; either the latter type which is "significant" in a true sense, or the former which may be the null or something very close to the null.
Perhaps what is needed is a broader sense of what "null results" (and null effects) are. What I find disappointing in research collaboration is when investigators cannot a priori state what range of effects they are targeting: if an intervention is meant to lower blood pressure, how many mmHg? If a drug is meant to cure cancer, how many months of survival will the patient have? Someone who is passionate with research and "plugged-in" to their field and science can rattle off the most amazing facts about prior research and what has been done.
In your example, I can't help but notice that the p-value of 0.82 is likely very close to the null. From that, all I can tell is that the CI is centered on a null value. What I do not know is whether it encompasses clinically significant effects. If the CI is very narrow, the interpretation they give is, in my opinion, correct but the data do not support it: that would be a minor edit. In contrast, the second p-value of 0.22 is relatively closer to its significance threshold (whatever it may be). The authors correspondingly interpret it as "not giving any evidence of difference" which is consistent with a "do not reject H_0"-type interpretation. As far as the relevance of the article, I can say very little. I hope that you browse the literature finding more salient discussions of study findings! As far as analyses, just report the CI and be done with it!
Best Answer
The rule for the proper formulation of a hypothesis test is that the alternative or research hypothesis is the statement that, if true, is strongly supported by the evidence furnished by the data.
The null hypothesis is generally the complement of the alternative hypothesis. Frequently, it is (or contains) the assumption that you are making about how the data are distributed in order to calculate the test statistic.
Here are a few examples to help you understand how these are properly chosen.
Suppose I am an epidemiologist in public health, and I'm investigating whether the incidence of smoking among a certain ethnic group is greater than the population as a whole, and therefore there is a need to target anti-smoking campaigns for this sub-population through greater community outreach and education. From previous studies that have been published in the literature, I find that the incidence among the general population is $p_0$. I can then go about collecting sample data (that's actually the hard part!) to test $$H_0 : p = p_0 \quad \mathrm{vs.} \quad H_a : p > p_0.$$ This is a one-sided binomial proportion test. $H_a$ is the statement that, if it were true, would need to be strongly supported by the data we collected. It is the statement that carries the burden of proof. This is because any conclusion we draw from the test is conditional upon assuming that the null is true: either $H_a$ is accepted, or the test is inconclusive and there is insufficient evidence from the data to suggest $H_a$ is true. The choice of $H_0$ reflects the underlying assumption that there is no difference in the smoking rates of the sub-population compared to the whole.
Now suppose I am a researcher investigating a new drug that I believe to be equally effective to an existing standard of treatment, but with fewer side effects and therefore a more desirable safety profile. I would like to demonstrate the equal efficacy by conducting a bioequivalence test. If $\mu_0$ is the mean existing standard treatment effect, then my hypothesis might look like this: $$H_0 : |\mu - \mu_0| \ge \Delta \quad \mathrm{vs.} \quad H_a : |\mu - \mu_0| < \Delta,$$ for some choice of margin $\Delta$ that I consider to be clinically significant. For example, a clinician might say that two treatments are sufficiently bioequivalent if there is less than a $\Delta = 10\%$ difference in treatment effect. Note again that $H_a$ is the statement that carries the burden of proof: the data we collect must strongly support it, in order for us to accept it; otherwise, it could still be true but we don't have the evidence to support the claim.
Now suppose I am doing an analysis for a small business owner who sells three products $A$, $B$, $C$. They suspect that there is a statistically significant preference for these three products. Then my hypothesis is $$H_0 : \mu_A = \mu_B = \mu_C \quad \mathrm{vs.} \quad H_a : \exists i \ne j \text{ such that } \mu_i \ne \mu_j.$$ Really, all that $H_a$ is saying is that there are two means that are not equal to each other, which would then suggest that some difference in preference exists.