There is a simpler and better way to deal with this problem. A categorical variable is effectively just a set of indicator variable. It is a basic idea of measurement theory that such a variable is invariant to relabelling of the categories, so it does not make sense to use the numerical labelling of the categories in any measure of the relationship between another variable (e.g., 'correlation'). For this reason, and measure of the relationship between a continuous variable and a categorical variable should be based entirely on the indicator variables derived from the latter.
Given that you want a measure of 'correlation' between the two variables, it makes sense to look at the correlation between a continuous random variable $X$ and an indicator random variable $I$ derived from t a categorical variable. Letting $\phi \equiv \mathbb{P}(I=1)$ we have:
$$\mathbb{Cov}(I,X) = \mathbb{E}(IX) - \mathbb{E}(I) \mathbb{E}(X) = \phi \left[ \mathbb{E}(X|I=1) - \mathbb{E}(X) \right] ,$$
which gives:
$$\mathbb{Corr}(I,X) = \sqrt{\frac{\phi}{1-\phi}} \cdot \frac{\mathbb{E}(X|I=1) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$
So the correlation between a continuous random variable $X$ and an indicator random variable $I$ is a fairly simple function of the indicator probability $\phi$ and the standardised gain in expected value of $X$ from conditioning on $I=1$. Note that this correlation does not require any discretization of the continuous random variable.
For a general categorical variable $C$ with range $1, ..., m$ you would then just extend this idea to have a vector of correlation values for each outcome of the categorical variable. For any outcome $C=k$ we can define the corresponding indicator $I_k \equiv \mathbb{I}(C=k)$ and we have:
$$\mathbb{Corr}(I_k,X) = \sqrt{\frac{\phi_k}{1-\phi_k}} \cdot \frac{\mathbb{E}(X|C=k) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$
We can then define $\mathbb{Corr}(C,X) \equiv (\mathbb{Corr}(I_1,X), ..., \mathbb{Corr}(I_m,X))$ as the vector of correlation values for each category of the categorical random variable. This is really the only sense in which it makes sense to talk about 'correlation' for a categorical random variable.
(Note: It is trivial to show that $\sum_k \mathbb{Cov}(I_k,X) = 0$ and so the correlation vector for a categorical random variable is subject to this constraint. This means that given knowledge of the probability vector for the categorical random variable, and the standard deviation of $X$, you can derive the vector from any $m-1$ of its elements.)
The above exposition is for the true correlation values, but obviously these must be estimated in a given analysis. Estimating the indicator correlations from sample data is simple, and can be done by substitution of appropriate estimates for each of the parts. (You could use fancier estimation methods if you prefer.) Given sample data $(x_1, c_1), ..., (x_n, c_n)$ we can estimate the parts of the correlation equation as:
$$\hat{\phi}_k \equiv \frac{1}{n} \sum_{i=1}^n \mathbb{I}(c_i=k).$$
$$\hat{\mathbb{E}}(X) \equiv \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i.$$
$$\hat{\mathbb{E}}(X|C=k) \equiv \bar{x}_k \equiv \frac{1}{n} \sum_{i=1}^n x_i \mathbb{I}(c_i=k) \Bigg/ \hat{\phi}_k .$$
$$\hat{\mathbb{S}}(X) \equiv s_X \equiv \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}.$$
Substitution of these estimates would yield a basic estimate of the correlation vector. If you have parametric information on $X$ then you could estimate the correlation vector directly by maximum likelihood or some other technique.
Best Answer
The mutual information between two continuous random variables $X$ and $Y$ is defined as the following double integral over the domains $\mathcal {X}$ and $\mathcal {Y}$: $$ {\displaystyle \operatorname {I} (X;Y)=\int _{\mathcal {Y}}\int _{\mathcal {X}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}}\;dx\,dy,} $$
The most straightforward way to estimate the mutual information is to use binning to estimate the integral, which essentially converts continuous variables into discrete variables for which the approach you outlined above can be used.
Alternatively, k-nearest neighbor distances can be used to estimate the Shannon entropy terms. This is for instance the approach that's used in scikit-learn and was proposed in A. Kraskov, H. Stogbauer and P. Grassberger, "Estimating mutual information". Phys. Rev. E 69, 2004.