Solved – Details of binary logistic regression, estimating $P(Y=1|X)$

logisticrregression

I am trying to understand logistic regression, but most sources I have found tend to leave the actual computational step as sort of a "black box", like using r glm(y~x,…) which obscures the underlying compuation. As an exercise, I want to write my own routine for doing (binary) logistic regression which requires me knowing the details.

For binary outcomes $Y$, and some predictor data $X$, we attempt to model the conditional probability

$$ p(X) = P(Y=1|X) = \frac{1}{1+ e^-{\beta X}} $$

and the corresponding linear model

$$\mathcal{l} = \text{log}\bigg( \frac{p}{1-p} \bigg) = \beta X$$

How do we now proceed in practice to compute $\beta$? In my understanding we have something like a linear system, so that for $n$ measurements & $k$ predictive variables

$$l_1 = \beta_0 + \beta_1x_{11} + … +\beta_k x_{1k}$$
$$l_2 = \beta_0 + \beta_1x_{21} + … +\beta_k x_{21}$$
$$$$
$$l_n = \beta_0 + \beta_1x_{n1} + … +\beta_k x_{nk}$$

which could then be solved using standard methods (such as SGD for an appropriate cost function). But how do we compute/estimate the values in $l_i$? The observed value of $P(Y=1|X)$ would simply be the scalar $p* = \frac{\sum_{i=1}^n y_i}{n}$ for each row, meaning that the system to be solved would be

$$ \text{log}(\frac{p*}{1-p*})\times(1,1,….,1)^T = \beta X$$

Is this correct? I have been testing with the following r code:

set.seed(1234)
x <- rnorm(1000)
y <- rbinom(1000, 1, exp(-2.3 + 0.1*x)/(1+exp(-2.3 + 0.1*x)))

fit = glm(y ~ x, family=binomial(link='logit'))
fit$coefficients

p = sum(y)/length(y)
z = rep(p, 1000)
z_star = log(z/(1-z))
fit2 = lm(z_star ~ x)
fit2$coefficients

which gives the coefficients estimates for $\beta$

> fit$coefficients
(Intercept)           x 
 -2.2261215   0.1651474 

> fit2$coefficients
  (Intercept)             x 
-2.219647e+00 -5.338566e-16

which differ noticably in the estimates of the $x$ coefficent. I thought this may be becuse of the different optimization methods used in the glm() vs lm() methods, but is my understand of the modelling procedure correct?

Best Answer

The fit from the linear model is very different because it's a linear probability model, i.e. you're directly estimating

$$ P(Y=1|X=x) = \beta_0 + \beta_1 x $$

whereas in logistic regression you're modeling the log odds as a linear model

$$ \log \left( \frac{P(Y=1|X=x)}{P(Y=0|X=x)} \right) = \beta_0 + \beta_1 x $$

which is very different, i.e., the coefficient $\beta_1$ has a totally different meaning.

Also, to clarify: you estimate the coefficients in a logistic regression model by maximum likelihood--i.e. by maximizing the binomial (actually Bernoulli) likelihood, not by solving a system of equations.

Related Solutions

Weight of Evidence (WoE) in Logistic Regression – A Guide

The WoE method consists of two steps:

to split (a continuous) variable into few categories or to group (a discrete) variable into few categories (and in both cases you assume that all observations in one category have "same" effect on dependent variable)
to calculate WoE value for each category (then the original x values are replaced by the WoE values)

The WoE transformation has (at least) three positive effects:

It can transform an independent variable so that it establishes monotonic relationship to the dependent variable. Actually it does more than this - to secure monotonic relationship it would be enough to "recode" it to any ordered measure (for example 1,2,3,4...) but the WoE transformation actually orders the categories on a "logistic" scale which is natural for logistic regression
For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated) and the WoE can be used to express information for the whole category
The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables because WoE is standardized value (for example you can compare WoE of married people to WoE of manual workers)

It also has (at least) three drawbacks:

Loss of information (variation) due to binning to few categories
It is a "univariate" measure so it does not take into account correlation between independent variables
It is easy to manipulate (overfit) the effect of variables according to how categories are created

Conventionally, the betas of the regression (where the x has been replaced by WoE) are not interpreted per se but they are multiplied with WoE to obtain a "score" (for example beta for variable "marital status" can be multiplied with WoE of "married people" group to see the score of married people; beta for variable "occupation" can be multiplied by WoE of "manual workers" to see the score of manual workers. then if you are interested in the score of married manual workers, you sum up these two score and see how much is the effect on outcome). The higher the score is, the greater is probability of an outcome equal to 1.

Solved – MATLAB implementation of MLE for Logistic Regression

It should be pretty straightforward to code:

function llik = fun(b, X, Y)  
num = X * b;  
prb = exp(num .* Y) ./ (1 + exp(num));  
llik = -sum(log(prb));  
end

Where: (Y) is a column vector (say 1000 x 1)
(X) is a matrix of predictors (say 1000 x 5)
Key is exp(num .* Y) that will be used to obtain both proba(Y==1) and proba(Y==0)

Given the relative simplicity of this model and the efficiency of Matlab optimisation routines (fmincon/fminunc), I don't think you need to code the gradient, but can easily be done if really needed.

This model has a closed-form solution and then the results should in principle not depend on the choice of starting values. Hope this helps.

=== SCRIPT TO ESTIMATE THE MODEL ===
Here I assume that you first have created a script including the log-likelihood function only - Let's say this script is called "LOGISTIC_LL".

% IMPORT DATA
A = importdata('path.csv')

% DATA SPECIFICATION
Y = A.data(:,1);
X = A.data(:,2:end);

% VARIABLE NAMES
Vnames = {'x1','x2',etc.};

% STARTING VALUES
b = zeros(length(Vnames),1);

% OPTIMISATION
options = optimoptions(@fminunc,'Display','iter','MaxIterations',1e3,'MaxFunctionEvaluations',1e5);
tic;
[paramhat,fval,~,~,grad,hessian] = fminunc(@(b)LOGISTIC_LL(b, X, Y), b, options);

Finally, you could create another function to reshape results and compute other statistics (SEs, Pval, etc).

Best Answer

Related Solutions

Weight of Evidence (WoE) in Logistic Regression – A Guide

Solved – MATLAB implementation of MLE for Logistic Regression

Related Question