Solved – Determining continuous vs discrete data sets

descriptive statisticsdistributions

When I collect data for my work, I get a set like the following:

[ 133, 183, 185.16, 188, 143, 128, 135.5, 100.55, 117.96, 95.5 ]

These points are part of a continuous scale, but there is no known function associated with them. How does one develop a probability density (continuous) function from this set?

I know I can find higher-order calculations based on this unknown probability density function, such as the value of the Expectation operator on this set by way of a basic average; however, this average requires treating the measurements as discrete values each with the same probability of 1/n, and thus this is managed as a discrete probability distribution function, even though the values are on a continuous scale.

This is unlike most discrete data problem examples, where dice are used and there is no continuous scale (we either have 1, 2, 3, 4, etc.).

This is the oddity that is bugging me: How to determine continuousness vs discreteness. If the function that describes the data is known, then my assumption is we can act on it as a continuous data set, but if not then its all simply discrete?

It is highly unlikely that within the precision of my instrumentation I will get the same value for two measurement events; however, this is entirely possible, so is this data simply a massive sample space of all possible measures (down to some finite precision), where it is possible to have overlap (though not probable)?

The stats books and examples I've dealt with so far love to outline discrete data sets as those with a great deal of obvious event overlap, such as dice, cards, and coins, where the values and events are finite and part of a rather small sample space. For continuous data sets, examples always include a function that describes the probability density over some domain range.

This leaves me wondering about this specific situation, where the measurements are clearly discrete in the sense that they are separate points on a line, but this seems to differ from other examples of discrete events, where in contrast, their values are not continuous.

Best Answer

In data analysis, we usually treat any variable that has a lot of distinct ordered values as continuous, even though all real-life measurements are discrete. Ben Kuhn's answer here gives some reasons why. This is an example of how doing good data analysis is about finding a model that is useful for the task at hand, rather than finding the model with the most realistic assumptions.

Applying statistical functionals (like the mean) to an entire infinite population is the kind of thing that you get to do only in the rarefied world of mathematical statistics. In applied data analysis, you have only a sample and you'll never know the true model, so you have to settle for estimation using the wrong model, or nonparametric estimation.

Related Question