Solved – regression with constraints

approximationconstrained regressiondomain-adaptationregressionreinforcement learning

I have some domain knowledge I want to use in a regression problem.

Problem statement

The dependent variable $y$ is continuous.
The independent variables are $x_1$ and $x_2$.

  • Variable $x_1$ is continuous and positive.
  • Variable $x_2$ is categorical, ordered, and takes only few different values (i.e, less than 10).

I want to fit a regression function $f$ so that $y = f(x_1,x_2)$, with the constraints

  1. $f(x_1,x_2)$ is monotonically increasing in $x_1$
  2. $f(x_1,x_2)$ is bounded in $[0,1]$
  3. $f(x_1,x_2)$ is "smooth" in $x_2$

These constraints come from domain knowledge of the problem.

Samples are evenly distributed among $x_2$ but not $x_1$.

My question: which techniques do you recommend for such regression problem?


I'm currently ignoring the last two constraints, and using the monreg package. (I run one regression for each possible value of $x_2$)

I can not give a formal definition of "smooth" in this context. I can assume that $f(x_1,x_2)$ does not change much along $x_2$ values that are consecutive.

I have found some SO questions regarding this issue but it looks like they have not raised much attention, or are very focused on specific R packages. Q1 Q2

Problem context: This regression will be used as a function approximation of (a component) the value function in a reinforcement learning algorithm. Because of that the constraints have to be enforced by the regression model, and can not be hand controlled. Moreover, the regression will be run several times with increased number of samples.

Best Answer

Logistic regression with box constraints served me well in the past for problems similar to yours. You are interested in prediction, not in inference, so, as long a suitable estimate of generalization error (for example, 5-fold cross-validation error) is low enough, you should be ok.

Let's consider a logit link function and a linear model:

$$\beta_0+\beta_1 x_1 +\beta_2 x_2 =\log{\frac{\mu}{1-\mu}}$$

where $\mu=\mathbb{E}[y|x_1]$

Then

$$\frac{\partial \mu}{\partial x_1}=\beta_1 \frac{\exp{(-\beta_0-\beta_1 x_1 -\beta_2 x_2)}}{(1+\exp{(-\beta_0-\beta_1 x_1 -\beta_2 x_2)})^2}>0 \iff \beta_1>0 $$

Thus constraint 1 and 2 are satisfied if you just use logistic regression with the constraint that $\beta_1>0$. In general, monotonicity constraints with respect to one or more variables are relatively easy to enforce with GLMs (Generalized Linear Models) such as logistic regression, because the monotonicity of the link function and the fact that it's expressed it as a linear function of the predictors imply that $\mu$ is always monotonic with respect to the continuous predictors.

An R package which supports logistic regression with box constraints (constraints of the type $a_i\leq\beta_i\leq b_i$) is glmnet. Its usage is a bit different from other regression functions in R, so have a look at ?glmnet. Constraint 3 wouldn't need specific attention in most cases, because most R regression functions will automatically encode categorical variables using dummy variables. Unfortunately, glmnet is one of the few functions which doesn't do that. You need to use model.matrix to solve this: if my_data holds your observations $X=\{x_{1i},x_{2i}\}_{i=1}^N$, then

M <- model.matrix(~ x1 + x2, my_data)

will build a design matrix suitable for use with glmnet.


The only limitation of this approach lies in the fact that we have modeled the logit function as a linear function of the predictors. This may prove not flexible enough for your problem: in other words, you could get a large cross-validation error. If this is so, you should look into nonparametric logistic regression - here, however, you need to fit GAMs (Generalized Additive Models), not GLMs, and imposing monotonicity becomes more complicated. The package mgcv and the function mono.con are your friends here - you'll need to read quite a lot of documentation. Gavin Simpson's answer to question

How to smooth data and force monotonicity

which you linked in your question, has a good example.


Finally, I reiterate that this approach (as well as all other approaches which rely on logistic regression, whether Bayesian or frequentist) only makes sense because you need a quick tool to approximate in an automated way multiple unknown functions inside your reinforcement learning workflow. $y|\mathbf{x}$ doesn't really have the binomial distribution, so you cannot expect to get realistic estimates of standard errors, confidence intervals, etc. If you need a real statistical model, which would give you not only point estimates but also realistic prediction intervals, then you need to take into account the real conditional distribution of your output. This question might help:

Judging the quality of a statistical model for a percentage

Related Question