Logistic regression with box constraints served me well in the past for problems similar to yours. You are interested in prediction, not in inference, so, as long a suitable estimate of generalization error (for example, 5-fold cross-validation error) is low enough, you should be ok.
Let's consider a logit link function and a linear model:
$$\beta_0+\beta_1 x_1 +\beta_2 x_2 =\log{\frac{\mu}{1-\mu}}$$
where $\mu=\mathbb{E}[y|x_1]$
Then
$$\frac{\partial \mu}{\partial x_1}=\beta_1 \frac{\exp{(-\beta_0-\beta_1 x_1 -\beta_2 x_2)}}{(1+\exp{(-\beta_0-\beta_1 x_1 -\beta_2 x_2)})^2}>0 \iff \beta_1>0 $$
Thus constraint 1 and 2 are satisfied if you just use logistic regression with the constraint that $\beta_1>0$. In general, monotonicity constraints with respect to one or more variables are relatively easy to enforce with GLMs (Generalized Linear Models) such as logistic regression, because the monotonicity of the link function and the fact that it's expressed it as a linear function of the predictors imply that $\mu$ is always monotonic with respect to the continuous predictors.
An R package which supports logistic regression with box constraints (constraints of the type $a_i\leq\beta_i\leq b_i$) is glmnet
. Its usage is a bit different from other regression functions in R, so have a look at ?glmnet
. Constraint 3 wouldn't need specific attention in most cases, because most R regression functions will automatically encode categorical variables using dummy variables. Unfortunately, glmnet
is one of the few functions which doesn't do that. You need to use model.matrix
to solve this: if my_data
holds your observations $X=\{x_{1i},x_{2i}\}_{i=1}^N$, then
M <- model.matrix(~ x1 + x2, my_data)
will build a design matrix suitable for use with glmnet
.
The only limitation of this approach lies in the fact that we have modeled the logit function as a linear function of the predictors. This may prove not flexible enough for your problem: in other words, you could get a large cross-validation error. If this is so, you should look into nonparametric logistic regression - here, however, you need to fit GAMs (Generalized Additive Models), not GLMs, and imposing monotonicity becomes more complicated. The package mgcv
and the function mono.con
are your friends here - you'll need to read quite a lot of documentation. Gavin Simpson's answer to question
How to smooth data and force monotonicity
which you linked in your question, has a good example.
Finally, I reiterate that this approach (as well as all other approaches which rely on logistic regression, whether Bayesian or frequentist) only makes sense because you need a quick tool to approximate in an automated way multiple unknown functions inside your reinforcement learning workflow. $y|\mathbf{x}$ doesn't really have the binomial distribution, so you cannot expect to get realistic estimates of standard errors, confidence intervals, etc. If you need a real statistical model, which would give you not only point estimates but also realistic prediction intervals, then you need to take into account the real conditional distribution of your output. This question might help:
Judging the quality of a statistical model for a percentage
The question is not clear to me. But if you want to use formula instead of matrix input. You can use caret
pacakge, which provide universal interface to many packages including glmnet
. Or you can use model.matrix
to convert formula to matrix form.
If you want to fit a logistic regression with $\lambda=0$, you can directly use traditional glm
. If you want to manually implement it, my answer here has the code to to calculate the objective function and gradient of logistic loss and use BFGS to optimize.
For the revised question: projected gradient descent may work well, since the "box constraints" is easy to solve. My answer here gives some details on projected gradient descent. (I believe projected gradient descent is used by glmnet
).
Solving constrained optimization problem: projected gradient vs. dual?
Additional reference for projected gradient descent, book from the author of glmnet: Statistical Learning with Sparsity page 120
Best Answer
Regardless of what your computing platform may be, you can trick any linearly constrained linear model solver into doing what you want without having to modify any code at all. Observe that the model
$$\mathbb{E}(Y) = X \beta$$
is equivalent to the model
$$\mathbb{E}(Y + X\gamma) = X(\beta + \gamma)$$
for any fixed coefficients $\gamma$, because the constant vector $X\gamma$ has been added to both sides. Moreover, fitting the second model will produce the same covariance matrix of estimates, etc., as the first because no change has been made in the errors whatsoever. Consequently, after estimating $\beta+\gamma$ as $\widehat{\beta+\gamma}$ using any linear procedure (not just
nnls
), the estimate of $\beta$ is obtained as$$\hat\beta = \widehat{\beta+\gamma} -\gamma.$$
Exploit this flexibility to make sure that some of the coefficients of $\widehat{\beta+\gamma}$ will naturally be positive (and therefore not constrained by
nnls
). This can be done by estimating what those coefficients might be, performing the adjusted procedure, and checking that it has not constrained the corresponding estimates. If any have been constrained, increase the amount of adjustment and repeat until success is achieved.I propose using an initial ordinary least squares fit to start the procedure. To be conservative, change the estimates by some small multiple $\rho$ of their standard errors. If iteration is needed, keep increasing $\rho$. The following code doubles $\rho$ at each iteration. The entire algorithm is contained within the short
repeat
block in the middle of the code example below.As an example, I generated a problem with $200$ observations of seven variables (and included a constant term). The following tableau summarizes the results:
The true values of $\beta$ are listed first. The first half are negative, the second half positive. These are followed by their OLS estimates, their NNLS estimates, and the modified NNLS estimates wherein the first two coefficients (
AIntercept
andAX1
) were not constrained to be positive. The last two lines summarize the constraints, printing "1.0" where positivity constraints were applied. The constraints actually imposed in the solution, using the same indicator method, appear last. In this case the procedure worked well.