Solved – MATLAB implementation of MLE for Logistic Regression

generalized linear modellogisticMATLABmaximum likelihoodself-study

The following is an assignment.

The matrix $\mathbf{x}$ containing the external factors has dimension $4\times1000$, and the vector $\mathbf{y}$ containing the categorical variable has dimension $1\times1000$.

I have to fit a GLM model of the type

$$ \pi(t) = \frac{1}{1 + exp{\left(\beta_0 + \sum_{l=1}^m \beta_l u_t^l\right)}}, $$

where $u_t^l$ is the $l$th external factor at time $t$.

The likelihood function I came up with is the following:

$$ L(\beta) = -\sum_{i=1}^N \left[ y_i \log \left( \frac{1}{1+\exp(-\beta^T\mathbf{x}_i)} \right) +\\ (1-y_i)\log \left( 1 – \frac{1}{1+\exp(-\beta^T\mathbf{x}_i)} \right) \right] .$$

In order to maximize it, I computed the gradient:

$$ \nabla_{\beta} L = \sum_{i=1}^N \left( y_i – \frac{1}{1+\exp(-\beta^T\mathbf{x}_i)} \right)\mathbf{x}_i ,$$

and the updating expression is:

$$ \beta = \beta + \nabla_{\beta} L .$$

I want to implement it in MATLAB and for doing so I would like to first transform this last expression in matrix form:

$$ \beta = \beta + \left( \mathbf{y} – \frac{1}{1+\exp(-\beta^T\mathbf{x})} \right)\mathbf{x} .$$

I've been trying for hours now but the result is not correct (too many $1$s or $0$s according to the initial values).

Here's the question: can somebody explain what the dimensions of the different elements for the matrix form should be?

Best Answer

It should be pretty straightforward to code:

function llik = fun(b, X, Y)  
num = X * b;  
prb = exp(num .* Y) ./ (1 + exp(num));  
llik = -sum(log(prb));  
end

Where: (Y) is a column vector (say 1000 x 1)
(X) is a matrix of predictors (say 1000 x 5)
Key is exp(num .* Y) that will be used to obtain both proba(Y==1) and proba(Y==0)

Given the relative simplicity of this model and the efficiency of Matlab optimisation routines (fmincon/fminunc), I don't think you need to code the gradient, but can easily be done if really needed.

This model has a closed-form solution and then the results should in principle not depend on the choice of starting values. Hope this helps.

=== SCRIPT TO ESTIMATE THE MODEL ===
Here I assume that you first have created a script including the log-likelihood function only - Let's say this script is called "LOGISTIC_LL".

% IMPORT DATA
A = importdata('path.csv')

% DATA SPECIFICATION
Y = A.data(:,1);
X = A.data(:,2:end);

% VARIABLE NAMES
Vnames = {'x1','x2',etc.};

% STARTING VALUES
b = zeros(length(Vnames),1);

% OPTIMISATION
options = optimoptions(@fminunc,'Display','iter','MaxIterations',1e3,'MaxFunctionEvaluations',1e5);
tic;
[paramhat,fval,~,~,grad,hessian] = fminunc(@(b)LOGISTIC_LL(b, X, Y), b, options);

Finally, you could create another function to reshape results and compute other statistics (SEs, Pval, etc).

Related Question