Let's consider one fundamental concept of (linear) correlation, covariance (which is Pearson's correlation coefficient "un-standardized"). For two discrete random variables $X$ and $Y$ with probability mass functions $p(x)$, $p(y)$ and joint pmf $p(x,y)$ we have
$$\operatorname{Cov}(X,Y) = E(XY) - E(X)E(Y) = \sum_{x,y}p(x,y)xy - \left(\sum_xp(x)x\right)\cdot \left(\sum_yp(y)y\right)$$
$$\Rightarrow \operatorname{Cov}(X,Y) = \sum_{x,y}\left[p(x,y)-p(x)p(y)\right]xy$$
The Mutual Information between the two is defined as
$$I(X,Y) = E\left (\ln \frac{p(x,y)}{p(x)p(y)}\right)=\sum_{x,y}p(x,y)\left[\ln p(x,y)-\ln p(x)p(y)\right]$$
Compare the two: each contains a point-wise "measure" of "the distance of the two rv's from independence" as it is expressed by the distance of the joint pmf from the product of the marginal pmf's: the $\operatorname{Cov}(X,Y)$ has it as difference of levels, while $I(X,Y)$ has it as difference of logarithms.
And what do these measures do? In $\operatorname{Cov}(X,Y)$ they create a weighted sum of the product of the two random variables. In $I(X,Y)$ they create a weighted sum of their joint probabilities.
So with $\operatorname{Cov}(X,Y)$ we look at what non-independence does to their product, while in $I(X,Y)$ we look at what non-independence does to their joint probability distribution.
Reversely, $I(X,Y)$ is the average value of the logarithmic measure of distance from independence, while $\operatorname{Cov}(X,Y)$ is the weighted value of the levels-measure of distance from independence, weighted by the product of the two rv's.
So the two are not antagonistic—they are complementary, describing different aspects of the association between two random variables. One could comment that Mutual Information "is not concerned" whether the association is linear or not, while Covariance may be zero and the variables may still be stochastically dependent. On the other hand, Covariance can be calculated directly from a data sample without the need to actually know the probability distributions involved (since it is an expression involving moments of the distribution), while Mutual Information requires knowledge of the distributions, whose estimation, if unknown, is a much more delicate and uncertain work compared to the estimation of Covariance.
Best Answer
I think you should distinguish between categorical (discrete) data and continuous data.
For continuous data, Pearson correlation measures a linear (monotonic) relationship, rank correlation a monotonic relationship.
MI on the other hand "detects" any relationship. This is normally not what you are interested in and/or is likely to be noise. In particular, you have to estimate the density of the distribution. But since it is continuous, you would first create a histogram [discrete bins], and then calculate MI. But since MI allows for any relationship, the MI will change as you use smaller bins (i.e. so you allow more wiggles). So you can see that the estimation of MI will be very unstable, not allowing you to put any confidence intervals on the estimate etc. [Same goes if you do a continuous density estimate.] Basically there are too many things to estimate before actually calculating the MI.
Categorical data on the other hand fits quite nicely into MI framework (see G-test), and there is not much to choose between G-test and chi-squared.