Solved – modeling a mixture of a Gaussian and Uniform (Matlab)

MATLABmixed modelnormal distributionuniform distribution

I'm trying to fit some data to a Gaussian + Uniform mixture model.

This model has three parameters: the mean and standard deviation of the Gaussian, and the relative weights of the distributions (which sum to 1).

The idea underlying this data is that each observation is assumed to be drawn from either of these two distributions. The relative weights determine the probability that each observation was drawn from a particular distribution.

I'm using the mle function in Matlab to attempt to do this estimate (maximum likelihood estimation), and have made good progress, however I've reached a stumbling block, and I think it has to do with the way I've modeled the mixture distribution.

First, I've created some simulated data from these two distributions:

Data = [5+3*randn(2000,1); unifrnd(-100,100,2000,1)];

This creates observations drawn from both a Gaussian with Mean of 5, standard deviation of 3, and from a uniform distribution over the closed interval [-100 100]. As the data set contains equal observations from each distribution (each has 2000 observations), the corresponding relative weights for a model of this data are 0.5 and 0.5 for the Gaussian and Uniform distribution, respectively.

Next, I've defined my custom probability density function, which the mle function will then use to estimate the model parameters:

MixPDF = @(x,m,s,weight) (weight*normpdf(x,m,s) + (1-weight)*(1/length(x)));

here, x refers to the input vector for the function, and m and s are the mean and standard deviation of the Gaussian component. I've attempted to model the uniform component as a constant that is scaled by both the (1-weight) factor, and by the reciprocal of the length of the input vector. This is my attempt at normalizing the area under the uniform component to be 1, when the weight of the uniform component (1-weight) is 1.

However, I'm pretty sure I've not done this properly. I think what I need to do is specify the model in such a way that the area underneath the entire mixed distribution will always be equal to 1. And I'm not sure how to do this.

When I go ahead and use the mle procedure, as follows:

p = mle(Data, 'pdf', MixPDF, 'start', [0 1 0.5],'lowerbound', [-20 0.1 0], 'upperbound', [20 50 1])

I get an output that is somewhat accurate, but the accuracy is heavily dependent upon the way I scale the uniform component in the MixPDF model. The output also varies depending upon the interval over which I simulate the uniform component of the data.

I'd really appreciate any guidance on how to properly specify the model such that the area underneath the distribution is properly normalized (assuming this is my issue, of course!)

Best Answer

Answer compiled from comments per request.

Presume your Uniform distribution has lower limit = LL and upper limit = UL. In your code snippet above, LL = -100 and UL = 100.

Leaving aside the weight, just make sure your Uniform density integrates to 1, i.e, density = 1/(UL - LL) * indicator(LL <= x <= UL). indicator(LL <= x <= UL) is defined to be = 1 when LL <= x <= UL, and = 0 otherwise. Then it will be correct when you apply the weighting.

I don't know what you're doing with 1/length(x), but that doesn't look correct. length(x) in MATLAB is the length of the vector x. If x were a scalar and UL - LL = 1, as with a Uniform[0,1], then it would happen to be "correct" (leaving aside the indicator).

I think you could do something like

 MixPDF = @(x,m,s,LL,UL,weight) (weight*normpdf(x,m,s) + (1-weight) * 1/(UL-LL) * (LL <= x && x <= UL));

Note that the parentheses were in the wrong place in your 1-weight term in your comment above.