Statistical Analysis – Can Survey Responses be Considered Random Variables?

binomial distributionrandom variablesurvey

I am trying to analyze responses from a survey. The outcome variable is named "reading_proficiency" and is dichotomous with two values 0 and 1. The data set has 3,539 observations so the column "reading_proficiency" has 3,539 observations of either 0 or 1.

I am trying to understand whether I can use the ideas of binomial distribution here. Can the variable "reading_proficiency" be a random variable?

The definition of a random variable I am using is as follows. A random variable is a variable that takes on different values determined by chance. In other words, it is a numerical quantity that varies at random.

Are the values of "reading_proficiency" truly determined by chance? Are two observations of "reading_proficiency" truly independent of one another?

If two observations are from the same survey cluster, they might have attended the same schools, been taught by the same teachers and thus have the same "reading_proficiency".

Is the fact that many observations are from the same cluster disqualify "reading_proficiency" from being a random variable?

I was reading that each observation of "reading_proficency" should be mutually independent but this is not the case for survey data, or…?

Does it mean survey data cannot be random variables?

Best Answer

Welcome to Cross Validated!

As user2974951 points out, a random variable could "...conceptually represent..the subjective randomness that results from incomplete knowledge of a quantity". That is, it 'seems' random to you because you simply don't have enough information about the factors that actually influence/determine the outcome.

So that answers the first part of your question: "Are the values of "reading_proficiency" truly determined by chance?"

Basically - regardless of what truly determines "reading proficiency", you do not have the information to know everything about it, meaning it can take on many different values, with a particular distribution, and you do not know what value it will take on in a particular case because you don't have enough information. So from that aspect, it's certainly a random variable. We're getting a bit into philosophy here, but essentially, if there's some variable/process that generates different results but you do not know everything about what determines an outcome (meaning you have uncertainty , it is a random variable). If this point is unclear, please ask in comments and we can discuss more.

Let's move on to the second part of your question: "Are two observations of "reading_proficiency" truly independent of one another?"

This is an interesting question (given your clustering concerns). However, let's be clear - I don't think "independence" is a prerequisite for being a random variable. For example, you can see clustering in almost all distributions - they tend to be clustered around the mean. So whatever process is generating that data must, in some way, result in most data points being quite connected/similar and close to the mean. You don't need sample independence for it to be a random variable.

So all in all, yes, your survey data is a random variable.

Perhaps you're in some way starting to confuse "random variable" with "random sample"? It's with taking "random samples" that you need to think about whether the individual samples are independent, and that's where your clustering concerns might come in.

Related Question