GLM – How to Specify and Restrict the Sign of Coefficients in a GLM or Similar Model in R

bayesiangeneralized linear modelpredictive-modelsr

The situation: I'm struggling with a predictive analysis of food sales prices using a generalized linear model. My dataset contains different kinds of food (cheeses, vegetables, meats, spices etc.) and hence I am splitting the dataset completely by these kinds when doing the analysis, because they are very different by nature.

The current model: The dataset/model contains both factors such as "country of production" and numeric variables such as "transport distance" which is all used in the gamma based GLM i R.

The problem: Now in general my model fits pretty well, however sometimes in rare cases some of the metric variables gets the opposite sign (+/-) than you would expect it to have, because the model somehow catches other influences.

An example: An example would be spices. All spices have a relative long "transport distance" and a relative long shelf life and hence a pretty small impact on the sales price compared to e.g. meat. So in this case the model might by accident end up giving the "transport distance" variable a small but negative value – which is of cause wrong because it would mean that the longer the distance the food was transported the lower the price would be.

My question: What kind of model should I use in R if I wan't something similar to a GLM model but I want to be able to specify restrictions on some of the variables/coefficients? E.g. if I want to say that an increased "transport distance" should ALWAYS have a positive impact on the sales price?

Ideas: I have heard something about both "Bayesian GLM" models or using a so called "prior distribution" but I have no idea which one, if any, would be the best to use..?

UPDATE
The answer below by @ACD is not, exactly what I'm looking for. I don't need an explanation of WHY this occurs, I need a solution to restricting the coefficient signs 🙂

Best Answer

The negative estimated coefficient on something that you KNOW is positive comes from omitted variable bias and/or colinearity between your regressors.

For prediction, this isn't so problematic, so long as you are sampling new data to predict the outcome (price?) of from the same population as your sample. The negative coefficient comes because the variable is highly correlated with something else, making the coefficient estimate highly variable, OR because it is correlated with something important that is omitted from your model, and the negative sign is picking up the effect of that omitted factor.

But it sounds like you are also trying to do inference -- how much does an exogenous change in $X$ change $Y$. Causal inferential statistics uses different methods and has different priorities than predictive statistics. It is particularly well developed in econometrics. Basically you need to find strategies such that you can convince yourself that $E(\hat\beta|X,whatever)=\beta$, which generally involves making sure that the regressor of interest is not correlated with the error term, which is generally accomplished by controlling for observables (or unobservables in certain cases). Even if you get to that point however, colinearity will still give you highly variable coefficients, but negative signs on something that you KNOW is positive will generally come with huge standard errors (assuming no omitted variable bias).

Edit: if your model is

$$ price = g^{-1}\left(\alpha + country'\beta + \gamma distance + whatever + \epsilon\right) $$

then country will be correlated with distance. hence, if you are in Tajikistan and you are getting a spice from Vanuatu, then the coefficient on Vanuatu will be really high. After controlling for all of these country effects, the additional effect of distance may well not be positive. In this case, if you want to do inference and not prediction (and think that you can specify and estimate a model that gives a causal interpretation), then you may with to take out the country variables.

Related Question