Non-Normal Distribution – How Statisticians Make Better Guesses than Using Mean Alone

gamesmeansufficient-statistics

Let's say we have a game with two players. Both of them know that five samples are drawn from some distribution (not normal). None of them know the parameters of the distribution used to generate the data. The goal of the game is to estimate the mean of the distribution. The player that comes closer to the true mean wins 1\$ (absolute difference between estimated value and actual value is the objective function). If the distribution has a mean that blows up to $\infty$, the player guessing the larger number wins and for $-\infty$, the one guessing the smaller number.

While the first player is given all five samples, the second one is given just the sum of the samples (and they know there were five of them).

What are some examples of distributions where this isn't a fair game and the first player has an advantage? I guess the normal distribution isn't one of them since the sample mean is a sufficient statistic for the true mean.

Note: I asked a similar question here: Mean is not a sufficient statistic for the normal distribution when variance is not known? about the normal distribution and it was suggested I ask a new one for non-normal ones.

EDIT: Two answers with a uniform distribution. I would love to hear about more examples if people know of any.

Best Answer

For a uniform distribution between $0$ and $2 \mu$, the player who guesses the sample mean would do worse than one which guesses $\frac{3}{5} \max(x_i)$ (the sample maximum is a sufficient statistic for the mean of a uniform distribution lower bounded by 0).

In this particular case, it can be verified numerically. Without loss of generality, we set $\mu = 0.5$ in the simulation. It turns out that about 2/3rds of the time, the 3/5 max estimator does better.

Here is a Python simulation demonstrating this.

import numpy as np
Ntrials = 1000000
xs = np.random.random((5,Ntrials))
sample_mean_error = np.abs(xs.mean(axis=0)-0.5)
better_estimator_error = np.abs(0.6*xs.max(axis=0)-0.5)
print((sample_mean_error > better_estimator_error).sum())

Related Solutions

Solved – How to predict the results of a simple card game

The easiest way is just to simulate the game lots of times. The R code below simulates a single game.

nplayers = 4
#Create an empty data frame to keep track
#of card number, suit and if it's magic
empty.hand = data.frame(number = numeric(52),
  suit = numeric(52),
  magic  = numeric(52))

#A list of players who are in the game
players =list()
for(i in 1:nplayers)
  players[[i]] = empty.hand

#Simulate shuffling the deck
deck = empty.hand
deck$number = rep(1:13, 4)
deck$suit = as.character(rep(c("H", "C", "S", "D"), each=13))
deck$magic = rep(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0), each=4)
deck = deck[sample(1:52, 52),]

#Deal out five cards per person
for(i in 1:length(players)){
  r = (5*i-4):(5*i)
  players[[i]][r,] = deck[r,]
}

#Play the game
i = 5*length(players)+1
current = deck[i,]
while(i < 53){
  for(j in 1:length(players)){
    playersdeck = players[[j]]
    #Need to test for magic and suit also - left as an exercise!
    if(is.element(current$number, playersdeck$number)){
      #Update current card
      current = playersdeck[match(current$number,
        playersdeck$number),]
      #Remove card from players deck
      playersdeck[match(current$number, playersdeck$number),] = c(0,
                   0, 0)
    } else {
      #Add card to players deck
      playersdeck[i,] = deck[i,]
      i = i + 1
    }
    players[[j]] = playersdeck
    #Has someone won or have we run out of card
    if(sum(playersdeck$number) == 0 | i > 52){
      i = 53
      break
    }
  }
}

#How many cards are left for each player
for(i in 1:length(players))
{
  cat(sum(players[[i]]$number !=0), "\n") 
}

Some comments

You will need to add a couple of lines for magic cards and suits, but data structure is already there. I presume you didn't want a complete solution? ;)
To estimate the average game length, just place the above code in a function and call lots of times.
Rather than dynamically increasing a vector when a player gets a card, I find it easier just to create a sparse data frame that is more than sufficient. In this case, each player has a data frame with 52 rows, which they will never fill (unless it's a 1 player game).
There is a small element of strategy with this game. What should you do if you can play more than one card. For example, if 7H comes up, and you have in your hand 7S, 8H and the JC. All three of these cards are "playable".

Normal Distribution – Sufficiency of Mean Statistic When Variance is Unknown

$\bar X$ is not a sufficient statistic because it does not contain all the information about $(\mu,\sigma^2)$, which is what it would mean for it to be sufficient.

However, $\bar X$ does contain all the information about $\mu$ in the sample, whether or not $\sigma^2$ is known. For example, $\bar X$ attains the Cramèr-Rao bound. Similarly, if $\mu$ is not known, $s^2$ contains all the information about $\sigma^2$ (though not if $\mu$ is known since $(\mu-\bar X)^2$ has information about $\sigma^2$). Having all the information about parts of a parameter is a more complicated property than sufficiency, though it has been studied (see e.g., Sprott 1975).

Related Question