I believe you were on the correct path but you did a small mistake while calculating the joint entropy. There will be 100 unique pairs of symbols so the joint entropy will be $\log 100$, that will make the mutual information equal to zero.
We know that $H(X)$ quantifies the amount of information that each observation of $X$ provides, or, equivalently, the minimal amount of bits that we need to encode $X$ ($L_X \to H(X)$, where $L_X$ is the optima average codelength - first Shannon theorem)
The mutual information
$$I(X;Y)=H(X) - H(X \mid Y)$$
measures the reduction in uncertainity (or the "information gained") for $X$ when $Y$ is known.
It can be written as $$I(X;Y)=D(p_{X,Y}\mid \mid p_X \,p_Y)=D(p_{X\mid Y} \,p_Y \mid \mid p_X \,p_Y)$$
wher $D(\cdot)$ is the Kullback–Leibler divergence or distance, or relative entropy... or information gain (this later term is not so much used in information theory, in my experience).
So, they are the same thing. Granted, $D(\cdot)$ is not symmetric in its arguments, but don't let confuse you. We are not computing $D(p_X \mid \mid p_Y)$, but $D(p_X \,p_Y\mid \mid p_{X,Y})$, and this is symmetric in $X,Y$.
A slightly different situation (to connect with this) arises when one is interested in the effect of knowing a particular value of $Y=y$ . In this case,
because we are not averaging on $y$, the amount of bits gained [*] would be $ D(p_{X\mid Y} \mid \mid p_X )$... which depends in $y$.
[*] To be precise, that's actually the amount of bits we waste when coding the conditioned source $X\mid Y=y$ as if we didn't knew $Y$ (using the unconditioned distribution of $X$)
Best Answer
Estimating mutual information fast and accurately is non-trivial. I recommend using a library or you will have to look at the literature (e.g. see Kraskov et al, 2004, Estimating Mutual Information).
However, you can get an estimate naively as follows. You talk about two lists of values, $X$ and $Y$; hence, you can estimate the discrete rather than continuous mutual information: $$ I(X,Y) = \sum_{x\in X}\sum_{y\in Y} P(x,y) \log\left( \frac{P(x,y)}{P(x)\,P(y)} \right) $$ where $x\in X$ means $x$ runs over the range of $X$. The formula just requires that you know $P(x)$, $P(y)$, and $P(x,y)$, which you can estimate from your data (e.g. kernel density estimation, or fitting a Gaussian mixture model).
Maybe slightly better (and just as easy with libraries) is (1) determine the joint and marginal densities $P(x)$, $P(y)$, and $P(x,y)$ via density estimation (e.g. in Python, see scipy or sklearn) and (2) numerically integrate the resulting density functions (e.g. in scipy) using the continuous formula: $$ I(X,Y) = \int_{Y}\int_{X} P(x,y) \log\left( \frac{P(x,y)}{P(x)\,P(y)} \right)dx\,dy $$ so it boils down to running a density estimation algorithm, followed by a numerical integration algorithm.
You can also check out this question, which computes mutual information by density estimation (using histograms) first and then uses the representation of mutual information via Shannon Entropy, i.e. $$ I(X,Y) = H(X) - H(Y) - H(X,Y) $$ where $H$ is the information entropy, to finish the calculation instead.
(1/11/19) Since this has become of greater importance in certain fields (AI, data science, and machine learning, for example), I'll just mention some of the literature in those areas that uses/requires mutual information estimation.
For instance, infoGAN and HFVAE use Monte Carlo estimates of bounds of the information and/or entropy (also called variational information estimates). See also MINE and Deep InfoMax. Note that these methods (and the methods that cite or are cited by them) are often more concerned with optimizing mutual information, not just estimating it, which of course has different requirements.