Solved – What are good examples to show to undergraduate students

confidence intervalhypothesis testingteaching

I am going to teach statistics as a teaching assistant for the second half of this semester to CS-oriented undergraduate students. Most of the students took the class has no incentive to learn the subject and only took it for major requirements. I want to make the subject interesting and useful, not just a class they learn to get a B+ to pass.

As a pure-math PhD student I knew little on the real-life applied side. I want to ask for some real-life applications of undergraduate statistics. Examples I am looking for are ones (in spirit) like:

1) Showing central limit theorem is useful for certain large sample data.

2) Provide a counter-example that central limit theorem is not applicable (say, the ones following Cauchy distribution).

3) Showing how hypothesis testing works in famous real life examples using Z-test, t-test or something.

4) Showing how overfitting or wrong initial hypothesis could give to wrong results.

5) Showing how p-value and confidence interval worked in (well known) real life cases and where they do not work so well.

6) Similarly type I, type II errors, statistical power, rejection level $\alpha$, etc.

My trouble is that while I do have many examples on probability side (coin toss, dice toss, gambler's ruin, martingales, random walk, three prisoner's paradox, monty hall problem, probability methods in algorithm design, etc), I do not know as many canonical examples on the statistics side. What I mean is serious, interesting examples that has some pedagogical value, and it is not extremely artificially made up that seems very detached from real life. I do not want to give students the false impression that Z-test and t-test is everything. But because of my pure math background I do not know enough examples to make the class interesting and useful to them. So I am looking for some help.

My student's level is around calculus I and calculus II. They cannot even show the standard normal's variance is 1 by definition as they do not know how to evaluate the Gaussian kernel. So anything slightly theoretical or hands-on computational (like hypergeometric distribution, arcsin law in 1D random walk) is not going to work. I want to show some examples that they can understand not just "how", but also "why". Otherwise I am not sure if I will be proving what I said by intimidation.

Best Answer

One good way can be to install R (http://www.r-project.org/) and use its examples for teaching. You can access the help in R with commands "?t.test" etc. At end of each help file are examples. For t.test, for example:

> t.test(extra ~ group, data = sleep)

        Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33 

>  plot(extra ~ group, data = sleep)

enter image description here

Related Solutions

Solved – Examples of costly consequences from improper use of statistical tools

I'm not sure about data availability, but a great (if that's the right word) example of poor statistics is the Harvard Nurses' Study on the effectiveness of hormone replacement therapy (HRT) in menopausal women.

What's the general idea? The Nurses' Study suggested that HRT was beneficial for post-menopausal women. Turns out that this result arose because the control group was very different from the treatment group and these differences were not account for in the analysis. In subsequent randomized trials, HRT has been linked to cancer, heart attack, stroke, and blood clots. With appropriate corrections, the Nurses' study reveals these patterns as well.

I can't find estimates for US deaths related to HRT, but the magnitude was tens of thousands. One article links 1000 deaths in the UK to HRT.

This New York Times Magazine article provides good statistical background of the issues of confounding present in the study.

There's an academic discussion in this issue of the American Journal of Epidemiology. The articles compare the results of the observational Nurses' study to that of the Women's Health Initiative, based upon randomized trials.

There is also discussion (by many of the same individuals) in an issue of Biometrics See Freedman and Petitti's comment in particular [prepub version].

Distributions Sampling Teaching – Effective Strategies for Teaching the Sampling Distribution Concepts

In my opinion, sampling distributions are the key idea of statistics 101. You might as well skip the course as skip that issue. However, I am very familiar with the fact that students just don't get it, seemingly no matter what you do. I have a series of strategies. These can take up a lot of time, but I recommend skipping / abbreviating other topics, so as to ensure that they get the idea of the sampling distribution. Here are some tips:

Say it distinctly: I first explicitly mention that there 3 different distributions that we are concerned with: the population distribution, the sample distribution, and the sampling distribution. I say this over and over throughout the lesson, and then over and over throughout the course. Every time I say these terms I emphasize the distinctive ending: sam-ple, samp-ling. (Yes, students do get sick of this; they also get the concept.)
Use pictures (figures): I have a set of standard figures that I use every time I talk about this. It has the three distributions pictured distinctly, and typically labeled. (The labels that go with this figure are on the powerpoint slide and include short descriptions, so they don't show up here, but obviously it's: population at the top, then samples, then sampling distribution.)
Give the students activities: The first time you introduce this concept, either bring in a roll of nickles (some quarters may disappear) or a bunch of 6-sided dice. Have the students form into small groups and generate a set of 10 values and average them. Then you can make a histogram on the board or with Excel.
Use animations (simulations): I write some (comically inefficient) code in R to generate data & display it in action. This part is especially helpful when you transition to explaining the Central Limit Theorem. (Notice the Sys.sleep() statements, these pauses give me a moment to explain what is going on at each stage.)

N = 10
number_of_samples = 1000


iterations  = c(3, 7, number_of_samples)  
breakpoints = seq(10, 91, 3)  
meanVect    = vector()  
x           = seq(10, 90)  
height      = 30/dnorm(50, mean=50, sd=10)  
y           = height*dnorm(x, mean=50, sd=10)  

windows(height=7, width=5)  
par(mfrow=c(3,1), omi=c(0.5,0,0,0), mai=c(0.1, 0.1, 0.2, 0.1))  

for(i in 1:iterations[3]) {  
  plot(x,y, type="l", col="blue", axes=F, xlab="", ylab="")  
  segments(x0=20, y0=0, x1=20, y1=y[11], col="lightgray")  
  segments(x0=30, y0=0, x1=30, y1=y[21], col="gray")  
  segments(x0=40, y0=0, x1=40, y1=y[31], col="darkgray")  
  segments(x0=50, y0=0, x1=50, y1=y[41])  
  segments(x0=60, y0=0, x1=60, y1=y[51], col="darkgray")  
  segments(x0=70, y0=0, x1=70, y1=y[61], col="gray")  
  segments(x0=80, y0=0, x1=80, y1=y[71], col="lightgray")  
  abline(h=0)  

  if(i==1) {  
    Sys.sleep(2)  
  }  
  sample = rnorm(N, mean=50, sd=10)  
  points(x=sample, y=rep(1,N), col="green", pch="*")  

  if(i<=iterations[1]) {  
    Sys.sleep(2)  
  }  
  xhist1 = hist(sample, breaks=breakpoints, plot=F)  
  hist(sample, breaks=breakpoints, axes=F, col="green", xlim=c(10,90),  
       ylim=c(0,N), main="", xlab="", ylab="")  
  if(i==iterations[3]) {  
    abline(v=50)  
  }  

  if(i<=iterations[2]) {  
    Sys.sleep(2)  
  }  
  sampleMean = mean(sample)  
  segments(x0=sampleMean, y0=0, x1=sampleMean,   
           y1=max(xhist1$counts)+1, col="red", lwd=3)  

  if(i<=iterations[1]) {  
    Sys.sleep(2)  
  }  
  meanVect = c(meanVect, sampleMean)  
  hist(meanVect, breaks=x, axes=F, col="red", main="",   
       xlab="", ylab="", ylim=c(0,((N/3)+(0.2*i))))  
  if(i<=iterations[2]) {  
    Sys.sleep(2)  
  }  
}  

Sys.sleep(2)  
xhist2 = hist(meanVect, breaks=x, plot=F)  
xMean  = round(mean(meanVect), digits=3)  
xSD    = round(sd(meanVect), digits=3)  
histHeight = (max(xhist2$counts)/dnorm(xMean, mean=xMean, sd=xSD))  
lines(x=x, y=(histHeight*dnorm(x, mean=xMean, sd=xSD)),   
      col="yellow", lwd=2)  
abline(v=50)  

txt1 = paste("population mean = 50     sampling distribution mean = ",  
             xMean, sep="")  
txt2 = paste("SD = 10     10/sqrt(", N,") = 3.162     SE = ", xSD,  
            sep="")  
mtext(txt1, side=1, outer=T)  
mtext(txt2, side=1, line=1.5, outer=T)

Reinstantiate these concepts throughout the semester: I bring the idea of the sampling distribution up again each time we talk about the next subject (albeit typically only very briefly). The most important place for this is when you teach ANOVA, as the null hypothesis case there really is the situation in which you sampled from the same population distribution several times, and your set of group means really is an empirical sampling distribution. (For an example of this, see my answer here: How does the standard error work?.)

Best Answer

Related Solutions

Solved – Examples of costly consequences from improper use of statistical tools

Distributions Sampling Teaching – Effective Strategies for Teaching the Sampling Distribution Concepts

Related Question