Solved – the difference between Universe and Population

terminology

On one hand the universe contains "all the entities one wishes to consider in a given situation" on the other the population is "a set of similar items or events which is of interest for some question or experiment".

A co-worker said when she learned statistics they used the word universe; when I did we said population. Neither of us had heard of the other word. What if any is the difference?

Best Answer

I just took stats last year. Population is, as you described, a complete set of elements (persons or objects) that possess some common characteristic defined by the sampling criteria established by the researcher.

In statistics, Universe is a synonym of Population.

Source: population. (n.d.) Collins English Dictionary – Complete and Unabridged, 12th Edition 2014. (1991, 1994, 1998, 2000, 2003, 2006, 2007, 2009, 2011, 2014). Retrieved October 20 2017 from https://www.thefreedictionary.com/population

Confirming the use of Universe and Population, as synonyms in modern data science:

https://stats.oecd.org/glossary/detail.asp?ID=2087

Related Solutions

Solved – detecting plagiarism on multiple-choice test

Here's a surprisingly vast array of the answer copying indexes, with little discussion of their merits though: http://www.bjournal.co.uk/paper/BJASS_01_01_06.pdf.

There's a field of (educational) psychology called item response theory (IRT) that provides the statistical background for questions like these. If you an American, and took an SAT, ACT or GRE, you dealt with a test developed with IRT in mind. The basic postulate of IRT is that each student $i$ is characterized by their ability $a_i$; each question is characterized by its difficulty $b_j$; and the probability to answer a question correctly is $$ \pi(a_i,b_j;c) = {\rm Prob}[\mbox{student $i$ answers question $j$ correctly}] = \Phi( c(a_i-b_j) ) $$ where $\Phi(z)$ is the cdf of the standard normal, and $c$ is an additional sensitivity/discrimination parameter (sometimes, it is made question-specific, $c_j$, if there's enough information, i.e., enough test takers, to identify the differences). A hidden assumption here that given the students ability $i$, answers to different questions are independent. This assumption is violated if you have a battery of questions about say the same paragraph of text, but let's abstract from it for a minute.

For "Yes/No" questions, this may be the end of the story. For more than two category questions, we can make an additional assumption that all wrong choices are equally likely; for a question $j$ with $k_j$ choices, probability of each wrong choice is $\pi'(a_i,b_j;c) = [1-\pi(a_i,b_j;c)]/(k_j-1)$.

For students of abilities $a_i$ and $a_k$, the probability that they match on their answers for a question with difficulty $b_j$ is $$ \psi(a_i,a_k;b_j,c) = \pi(a_i,b_j;c)\pi(a_k,b_j;c) + (k-1)\pi'(a_i,b_j;c)\pi'(a_k,b_j;c) $$ If you like, you can break this into probability of matching on the correct answer, $\psi_c(a_i,a_k;b_j,c) = \pi(a_i,b_j;c)\pi(a_k,b_j;c)$, and the probability of matching on an incorrect answer, $\psi_i(a_i,a_k;b_j,c) = (k-1)\pi'(a_i,b_j;c)\pi'(a_k,b_j;c)$, although from the conceptual framework of IRT, this distinction is hardly material.

Now, you can compute the probability of matching, but it will probably be combinatorially minuscule. A better measure may be the ratio of the information in the pairwise pattern of responses, $$ I(i,k) = \sum_j 1\{ \mbox{match}_j \} \ln \psi(a_i,a_k;b_j,c) + 1\{ \mbox{non-match}_j \} \ln [1- \psi(a_i,a_k;b_j,c) ] $$ and relate it to the entropy $$ E(i,k) = {\rm E}[ I(i,k) ] = \sum_j \psi(a_i,a_k;b_j,c) \ln \psi(a_i,a_k;b_j,c) + (1- \psi(a_i,a_k;b_j,c) ) \ln [1- \psi(a_i,a_k;b_j,c) ] $$ You can do this for all pairs of students, plot them or rank them, and investigate the greatest ratios of information to entropy.

The parameters of the test $\{c,b_j, j=1, 2, \ldots\}$ and student abilities $\{a_i\}$ won't fall out of blue sky, but they are easily estimable in modern software such as R with lme4 or similar packages:

    irt <- glmer( answer ~ 1 + (1|student) + (1|question), family = binomial)

or something very close to this.

Solved – the difference between buckets and bins

A very good question, and a question that I myself had because I have heard these called buckets, groups, groupings, categories, categorical variables, discrete variables, and bins as I have changed disciplines. In general, use the language that the end-users of your analysis are most comfortable using - in a sense, speak their language (or force them to use yours! ha). There is no wrong answer here, other than a countless number of statisticians that would say that you shouldn't be grouping your variables into bins/buckets without a very good reason (or ever!) as you are spending degrees of freedom, making arbitrary cutoffs to create your buckets/bins, and losing information that was provided by your once valuable, continuous variables.

Best Answer

Related Solutions

Solved – detecting plagiarism on multiple-choice test

Solved – the difference between buckets and bins

Related Question