As far as I know, and I've researched this issue deeply in the past, there are no predictive modeling techniques (beside trees, XgBoost, etc.) that are designed to handle both types of input at the same time without simply transforming the type of the features.
Note that algorithms like Random Forest and XGBoost accept an input of mixed features, but they apply some logic to handle them during split of a node.
Make sure you understand the logic "under the hood" and that you're OK with whatever is happening in the black-box.
Yet, distance/kernel based models (e.g., K-NN, NN regression, support vector machines) can be used to handle mixed type feature space by defining a “special” distance function. Such that, for every feature, applies an appropriate distance metric (e.g., for a numeric feature we’ll calculate the Euclidean distance of 2 numbers while for a categorical feature we’ll simple calculate the overlap distance of 2 string values).
So, the distance/similarity between user $u_1$ and $u_2$ in feature $f_i$, as follows:
$d(u_1,u_2 )_{f_i}=(dis-categorical(u_1,u_2 )_{f_i} $ if feature $f_i$ is categorical,
$d(u_1,u_2 )_{f_i}=dis-numeric(u_1,u_2 )_{f_i} $ if feature $f_i$ is numerical. and 1 if feature $f_i$ is not defined in $u_1$ or $u_2$.
Some known distance function for categorical features:
Sounds like a simple case for multiple regression. The comment is correct: the predictors you mention are only categorical if you've discretized them for whatever reason. If you have access to the un-discretized data, you might consider some semi-parametric estimators.
One complication that you might face is the fact that your data are undefined above 1 or below zero, given that it is a proportion. I know of three ways of dealing with this:
Just run an OLS regression $y=\alpha+X'\beta + \epsilon$ (where $X$ is a matrix of your three variables AND multiplicative interaction terms that you deem important (e.g. pH $\times$ SAR). Check to see whether any of your predicted values $\hat{y}$ come close to or exceed 0 or 1. Or whether the standard errors of predictions come close to zero or 1. If not, you can probably get away with just running OLS, even though you are violating the OLS assumption of a normally-distributed error term. Furthermore, the even more important assumption of a linear relationship might not make physical sense.
GLM: use a logit link function -- coefficients on variables then will give estimates of marginal change in the dependent variable on the logistic scale.
The advantage here is that the predicted values cannot be outside the physically possible range, but that in itself does not guarantee this to be the best model. The details here require care, but see http://www.stata-journal.com/sjpdf.html?articlenum=st0147 for a concise introduction.
- Beta-regression. The beta distribution is a family of occasionally symmetric but usually non-symmetric bell-shaped curves defined between zero and one. I haven't used this myself, but it is designed for problems like what you're describing.
Probably the best way forward is to run all three and confirm that the choice of modeling specification does or doesn't change your results. If it does, you need to pick the one that makes the most sense. If it doesn't, you're good.
If you've got the original, non-discretized data, consider semi-parametrics, such as can be found in mgcv
in R
. This could mitigate some of the functional form worries about running a logistic regression -- if a variable causes a linear response, the response will be non-linear on the logit scale. Allowing the functional form to be arbitrary will reduce mis-specification bias.
Best Answer
I would recommend reading about logistic or log-linear models in particular, and methods of categorical data analysis in general. The notes on the following course are pretty good for a start: Analysis of Discrete Data. The textbook by Agresti is quite good. You might also consider Kleinbaum for a quick start.