MATLAB: How to best determine the probability of a distribution given an outlying observation

distributionhypothesishypothesis testingprobabilityStatistics and Machine Learning Toolbox

Hi,

I have a classification problem. I have a set of data from a reference process (let's call that "known") and a set of data from a second process (let's call that "test").

Hypothesis 0 is that the test sample came from an identical process as the "known", and will therefore have the same distribution.

Hypothesis 1 is that the test sample came from a different process. However, here is the catch: for all but one sample, this process has an identical distribution to the "known". Just one sample will be "suspiciously" low.

I will add a picture to better explain:

In this case, the red histogram is the reference "known" distribution. The blue histogram is the questioned "test" distribution. In this case, I already know that the test came from a different process. It might not be completely clear due to the overlaying, but it can be seen that the distributions pretty well match, except for a single blue sample which is suspiciously low.

What I need now is to take each distribution and work out some method of returning a probability that the extremely low blue value would be observed given the distribution is the "known" distribution. I know how to calculate the probability of a particular single observation, but how do I properly balance this with the number of observations? Would just a KS test be appropriate? It strikes me as stats 101, but it's been a while, and I don't want to get this wrong.

Thanks in advance.

Best Answer

If you know the reference distribution analytically, you can compute its cdf at the smallest observed value. Suppose this cdf value is p. The p-value for your test would be then one minus the binomial probability of not observing any successes in N trials, where N is the sample size and p is the success probability. That is, it would be 1-(1-p)^N.

Related Solutions

MATLAB: How to go about finding the standard normal probability based on the z-score

doc normcdf
doc normpdf

When you know what you want but not sure the name, try something like

>> lookfor normal
realmin                        - Smallest positive normalized floating point number.
randn                          - Normally distributed pseudorandom numbers.
sprandn                        - Sparse normally distributed random matrix.
surfnorm                       - Surface normals.
isonormals                     - Isosurface normals.
cde                            - cd elliptic function with normalized complex argument.
sne                            - sn elliptic function with normalized complex argument.
addfreqcsmenu                  - Add a cs menu to switch between linear and normalized frequency
convertfrequnits               - converts between Normalized, Hz, kHz, etc
histfit                        - Histogram with superimposed fitted normal density.
jbtest                         - Jarque-Bera hypothesis test of composite normality.
lhsnorm                        - Generate a latin hypercube sample with a normal distribution
logncdf                        - Lognormal cumulative distribution function (cdf).
lognfit                        - Parameter estimates and confidence intervals for lognormal data.
logninv                        - Inverse of the lognormal cumulative distribution function (cdf).
lognlike                       - Negative log-likelihood for the lognormal distribution.
lognpdf                        - Lognormal probability density function (pdf).
lognrnd                        - Random arrays from the lognormal distribution.
lognstat                       - Mean and variance for the lognormal distribution.
mvncdf                         - Multivariate normal cumulative distribution function (cdf).
mvnpdf                         - Multivariate normal probability density function (pdf).
mvnrnd                         - Random vectors from the multivariate normal distribution.
normcdf                        - Normal cumulative distribution function (cdf).
normfit                        - Parameter estimates and confidence intervals for normal data.
norminv                        - Inverse of the normal cumulative distribution function (cdf).
normlike                       - Negative log-likelihood for the normal distribution.
normpdf                        - Normal probability density function (pdf).
normplot                       - Displays a normal probability plot.
normrnd                        - Random arrays from the normal distribution.
normspec                       - Plots normal density between specification limits.
normstat                       - Mean and variance for the normal distribution.
logn3fit                       - Fit a 3-param lognormal dist'n using cumulative probabilities.
wgtnormfit                     - Fitting example for a weighted normal distribution.
wgtnormfit2                    - Fitting example for a weighted normal distribution (log(sigma) parameterization).
>>

Judicious search terms help but seeing the list of things related to "normal" lets you find the two functions of interest (plus a lot more depending upon which toolboxes are available, maybe) that might be of use/interest...

MATLAB: Are there functions for calculating the PDF and CDF of the Pareto distribution in the Statistics Toolbox

This enhancement has been incorporated in Release 14 Service Pack 3 (R14SP3). For previous product releases, the Statistics Toolbox has no functionality for computing the PDF or CDF of the Pareto distribution.

Related Question