Solved – How to use/interpret empirical distribution

distributionsjavasampling

First of all I'd like to apologize for the vague title, I couldn't really formulate a better one just now, please feel free to change, or advice me to change, the title to make it better fit the core of the question.

Now about the question itself, I have been working on a software in which I have come across the idea of using an empirical distribution for sampling, however now that it's implemented I am not sure how to interpret it all. Allow me to describe what I have done, and why:

I have a bunch of calculations for a set of objects, yielding a final score. The score as it is however is very ad-hoc. So in order to make some sense out of the score of a particular object, what I do is to do a large number of (N = 1000) calculations of scores with mock/randomly generated values, yielding 1000 mock scores. Estimating an empirical "score distribution" for that particular object is then achieved by these 1000 mock score values.

I have implemented this in Java (as the rest of the software is also written in Java environment) using Apache Commons Math library, in particular the EmpiricalDistImpl class. According to the documentation this class uses:

what amounts to the Variable Kernel
Method with Gaussian smoothing:
Digesting the input file

  1. Pass the file once to compute min and max.
  2. Divide the range from min-max into binCount "bins."
  3. Pass the data file again, computing bin counts and univariate
    statistics (mean, std dev.) for each
    of the bins
  4. Divide the interval (0,1) into subintervals associated with the bins,
    with the length of a bin's subinterval
    proportional to its count.

Now my question is, does it make sense to sample from this distribution in order to calculate some sort of an expected value? In other words how much could I trust/rely on this distribution? Could I for instance draw conclusion about significance of observing a score $S$ by checking the distribution?

I realize that this is perhaps an unorthodox way looking at a problem like this but I think it would be interesting to get a better grip over the concept of empirical distributions, and how they can/can't be used in analysis.

Best Answer

Empirical distributions are used all the time for inference so you're definitely on the right track! One of the most common use of empirical distributions is for bootstrapping. In fact, you don't even have to use any of the machinery you've described above. In an nutshell, you make many draws (with replacement) from the original samples in a uniform fashion and the results can be used to calculate the confidence intervals on your previously calculated statistical quantities. Furthermore, these samples have well developed theoretical convergence properties. Check out the wikipedia article on the topic here.