The OLS estimator in the linear regression model is quite rare in
having the property that it can be represented in closed form, that is
without needing to be expressed as the optimizer of a function. It is, however, an optimizer of a function -- the residual sum of squares
function -- and can be computed as such.
The MLE in the logistic regression model is also the optimizer of a
suitably defined log-likelihood function, but since it is not available
in a closed form expression, it must be computed as an optimizer.
Most statistical estimators are only expressible as optimizers of
appropriately constructed functions of the data called criterion functions.
Such optimizers require the use of appropriate numerical optimization
algorithms.
Optimizers of functions can be computed in R using the optim()
function
that provides some general purpose optimization algorithms, or one of the more
specialized packages such as optimx
. Knowing which
optimization algorithm to use for different types of models and statistical criterion
functions is key.
Linear regression residual sum of squares
The OLS estimator is defined as the optimizer of the well-known residual sum of
squares function:
$$
\begin{align}
\hat{\boldsymbol{\beta}} &= \arg\min_{\boldsymbol{\beta}}\left(\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\right)'\left(\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\right) \\
&= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{Y}
\end{align}
$$
In the case of a twice differentiable, convex function like the residual sum of squares,
most gradient-based optimizers do good job. In this case, I will be using the BFGS
algorithm.
#================================================
# reading in the data & pre-processing
#================================================
urlSheatherData = "http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv"
dfSheather = as.data.frame(read.csv(urlSheatherData, header = TRUE))
# create the design matrices
vY = as.matrix(dfSheather['InMichelin'])
mX = as.matrix(dfSheather[c('Service','Decor', 'Food', 'Price')])
# add an intercept to the predictor variables
mX = cbind(1, mX)
# the number of variables and observations
iK = ncol(mX)
iN = nrow(mX)
#================================================
# compute the linear regression parameters as
# an optimal value
#================================================
# the residual sum of squares criterion function
fnRSS = function(vBeta, vY, mX) {
return(sum((vY - mX %*% vBeta)^2))
}
# arbitrary starting values
vBeta0 = rep(0, ncol(mX))
# minimise the RSS function to get the parameter estimates
optimLinReg = optim(vBeta0, fnRSS,
mX = mX, vY = vY, method = 'BFGS',
hessian=TRUE)
#================================================
# compare to the LM function
#================================================
linregSheather = lm(InMichelin ~ Service + Decor + Food + Price,
data = dfSheather)
This yields:
> print(cbind(coef(linregSheather), optimLinReg$par))
[,1] [,2]
(Intercept) -1.492092490 -1.492093965
Service -0.011176619 -0.011176583
Decor 0.044193000 0.044193023
Food 0.057733737 0.057733770
Price 0.001797941 0.001797934
Logistic regression log-likelihood
The criterion function corresponding to the MLE in the logistic regression model is
the log-likelihood function.
$$
\begin{align}
\log L_n(\boldsymbol{\beta}) &= \sum_{i=1}^n \left(Y_i \log \Lambda(\boldsymbol{X}_i'\boldsymbol{\beta}) +
(1-Y_i)\log(1 - \Lambda(\boldsymbol{X}_i'\boldsymbol{\beta}))\right)
\end{align}
$$
where $\Lambda(k) = 1/(1+ \exp(-k))$ is the logistic function. The parameter estimates are the optimizers of this function
$$
\hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}}\log L_n(\boldsymbol{\beta})
$$
I show how to construct and optimize the criterion function using the optim()
function
once again employing the BFGS algorithm.
#================================================
# compute the logistic regression parameters as
# an optimal value
#================================================
# define the logistic transformation
logit = function(mX, vBeta) {
return(exp(mX %*% vBeta)/(1+ exp(mX %*% vBeta)) )
}
# stable parametrisation of the log-likelihood function
# Note: The negative of the log-likelihood is being returned, since we will be
# /minimising/ the function.
logLikelihoodLogitStable = function(vBeta, mX, vY) {
return(-sum(
vY*(mX %*% vBeta - log(1+exp(mX %*% vBeta)))
+ (1-vY)*(-log(1 + exp(mX %*% vBeta)))
)
)
}
# initial set of parameters
vBeta0 = c(10, -0.1, -0.3, 0.001, 0.01) # arbitrary starting parameters
# minimise the (negative) log-likelihood to get the logit fit
optimLogit = optim(vBeta0, logLikelihoodLogitStable,
mX = mX, vY = vY, method = 'BFGS',
hessian=TRUE)
#================================================
# test against the implementation in R
# NOTE glm uses IRWLS:
# http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
# rather than the BFGS algorithm that we have reported
#================================================
logitSheather = glm(InMichelin ~ Service + Decor + Food + Price,
data = dfSheather,
family = binomial, x = TRUE)
This yields
> print(cbind(coef(logitSheather), optimLogit$par))
[,1] [,2]
(Intercept) -11.19745057 -11.19661798
Service -0.19242411 -0.19249119
Decor 0.09997273 0.09992445
Food 0.40484706 0.40483753
Price 0.09171953 0.09175369
As a caveat, note that numerical optimization algorithms require careful use or you can end up with
all sorts of pathological solutions. Until you understand them well, it is best to
use the available packaged options that allow you to concentrate on specifying the model
rather than worrying about how to numerically compute the estimates.
My question now is, which of the coeffecients (1, 2 or 3) gives me the true change of a one-unit change in a predictor variable
I've got several parts to my answer. The first is the most straightforward. You said you're using MinMaxScaler, so I'm guessing that you're using SciKit-Learn and doing this with Python. That package can make a scaler object that already has a method for undoing the scaling called inverse_transform(). There's more information over at stack exchange here or you can read the documentation here. I believe that your manual attempt at an inverse_transform simply didn't perform the right operations, and it is almost always safer to use pre-built well-tested functions from the kit anyway. It has the added benefit that if you'd like to see exactly how it works, it's open source--just read the code under the source link on the documentation site.
Part II may contain some things you already know, but here it is anyway. A logistic regression is non-linear, which means that the effect one-unit change in the predictor differs depending on the value of your predictor. The reason that we're allowed to make blanket statements in linear regression model interpretations, such as "for each 1 unit increase in $x$, $y$ tends to increase by such-and-such on average" is because a linear regression model has fixed slope coefficients. In other words, the first derivative with regard to any predictor is a constant, so the impact of a one-unit change is constant.
That is not the case in a logistic regression. As such, logistic regressions are typically used to predict the chance that a certain observation will fall into a certain category. The coefficients in a logistic regression are not regular slope coefficients that can be interpreted as simple unit-changes as in linear regression, they are in logged-odds. This makes logistic regressions much less intuitive to interpret. That much you already alluded to in the question.
However, the reason for performing the transform to begin with is unclear. If you had serious multicollinearity, sure. If the spread of the values was huge, maybe it makes sense. If there were several different units, like kilos, pounds, and inches all being used from different systems that measure different features, again, it might make sense. But I'm not convinced you really need to normalize your variables to begin with in this situation. I'd recommend trying the following instead:
- Do not standardize your variables before you fit the model
- Exponentiate the coefficients you get after fitting the model. This will convert them to odds instead of logged-odds. If you want, you could further convert them to probabilities to make interpretation even easier. The formula is $$probability=odds/(1+odds)$$ This will help in interpreting the impact of categorical variables.
- For continuous variables, don't get caught up in trying to find a simple explanation. The actual coefficients can be interpreted as "for an increase of one unit in x, the is an increase in [coefficient] logged-odds of y. Most people won't have an intuitive grasp of what this means. If you really want to get a grasp on how much of an impact the continuous variable is making, compare a model with that variable to a model without it. These are called nested models and are regularly used in these sorts of assessments.
how can I rank the importance of my features?
I'm not sure if you need a formal test of this. If you've got the odds or probabilities, you can use your best judgement to see which one is most impactful based on the magnitude of the coefficients and how many levels they can realistically take.
Can I even compare real-valued features with categorical variables?
Sure, but I wouldn't normalize them first. You can compare different types of variables to each other, just bear in mind the meaning of the different types. If increasing the distance from the goal by 1 meter decreases the probability of making the shot by 1% and having good weather instead of bad increases the probability of making the shot by 2%, that doesn't mean that weather is more impactful--you either have good or bad, so 2% is the max increase, whereas distance could keep increasing substantially and add up. That's just an example, but you can just use common sense like this and explain your reasoning when you're comparing them.
Best Answer
Unlike linear regression, where you can use matrix algebra and ordinary least squares to get the results in a closed form, for logistic regression you need to use some kind of optimization algorithm to find the solution with smallest loss, or greatest likelihood. For this, logistic regression most commonly uses the iteratively reweighted least squares, but if you really want to compute it by hand, then it would be probably easier to use gradient descent. You can find nice introduction to gradient descent in the lecture Lecture 6.5 — Logistic Regression | Simplified Cost Function And Gradient Descent by Andrew Ng (if it is unclear, see earlier lectures, also available on YouTube or on Coursera). Using his notation, the iteration step is
$ \mathtt{repeat} \, \{\\ \qquad\theta_j := \theta_j - \alpha\, (h_\theta(x^{(i)}) - y^{(i)}) \,x_j^{(i)}\\ \}$
where $\theta_j$ is the $j$-th parameter from the vector $(\theta_0, \theta_1, \dots, \theta_k)$, $x^{(i)}$ is the vector of variables for the $i$-th observation $(1, x_1^{(i)},\dots, x_k^{(i)})$, where $1$ comes from the column of ones for the intercept, and the inverse of the logistic link function is $h_\theta(x) = \tfrac{1}{1+\exp(\theta^T x)}$ and $\alpha$ is the learning rate. You iterate until convergence.