Estimating the Number of Pokemon Without Knowing How Many There Are

probabilitystatistics

I was imagining the following scenario: Suppose you are playing a Pokemon game for the first time and you don't know how many Pokemon are there. You spend some time playing today, encounter some random Pokemon, and record how many unique Pokemon you came across (e.g. you saw 15 total Pokemon, but only 8 of them were unique). Tomorrow, you spend some time playing and you encounter some Pokemon (e.g. 25, but only 7 were unique ) – now you record how many unique Pokemon you saw today and add these to the number of unique Pokemon you saw yesterday. You repeat this process a few times, and after collecting "n" number of samples, you have observed "m" number of unique samples.

Using this information, are there any mathematical formulas that you can use to estimate the total number of Pokemon that might exist in the game?

I am thinking that there might already exist some statistical/probability formulas that can be used for estimating the population size based on some finite samples, but so far I have not found any such formulas that can be directly used for this problem. I came across two somewhat related concepts in math (https://en.wikipedia.org/wiki/German_tank_problem , https://en.wikipedia.org/wiki/Coupon_collector%27s_problem), but I am not sure how to apply these concepts to this problem of estimating the number of Pokemon.

To make things a little more concrete, I wrote some computer code (R programming language) that attempts to simulate this problem. Suppose we are playing the original Pokemon game and there are 150 Pokemon :

library(dplyr)
library(ggplot2)

pokemon_id = 1:150

pokemon_names = names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
          "Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
          "Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
          "Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
          "Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
          "Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
          "Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
          ,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")

pokemon_data = data.frame(pokemon_id, pokemon_names)

Now, suppose you have 20 days to play this game – each day, you encounter a random number of Pokemon and keep track which of these Pokemon were unique (note: In the real example, we don't know there are 150 Pokemon, but I have included this number to facilitate some of the calculations) :

pokemon_function <- function() {

  pokemon_results <- list()

  for (i in 1:20) {

    run_i <- i
     pokemon_caught_i <- abs(sample.int(10, 1))
    sample_i <- pokemon_data[sample(nrow(pokemon_data), pokemon_caught_i), ]
    pokemon_tmp <- data.frame(run_i, sample_i)
      pokemon_results[[i]] <- pokemon_tmp
  }
  results_df <- do.call(rbind.data.frame,   pokemon_results)

  pokemon_int <- data.frame(results_df %>%
                         group_by(pokemon_id) %>%
                         filter(run_i == min(run_i)) %>%
                         distinct)

  pokemon_caught <- data.frame(pokemon_int %>%
                               group_by(run_i) %>%
                               summarise(Count=n()))

  cumulative <- cumsum(pokemon_caught$Count)
  pokemon_caught$Cumulative <- cumulative
  pokemon_caught$unseen <- 150 - cumulative
  return(pokemon_caught)
}

We can now see how many (cumulative) unique Pokemon we encountered each day for 20 days (note: I am assuming that there is an equal probability of encountering any given Pokemon) :

[1]  1 11 15 24 27 28 34 35 36 38 45 53 56 59 62 64 68 71

In theory, we could repeat this simulation experiment many times (e.g. 50 times) and visualize the results:

 #Repeat Simulation 50 Times:

final <- list()
for (i in 1:50) {
  round_i <- i
  s_i <- pokemon_function()
  final_tmp <- data.frame(round_i, s_i)
  final[[i]] <- final_tmp
}

visualization_file <- do.call(rbind.data.frame, final)

visualization_file$round_i = as.factor(visualization_file$round_i)

 g1 = ggplot(data=visualization_file, aes(x=run_i, y=Cumulative, group = round_i, colour = round_i)) + geom_line() +labs(y= "Total Pokemon Seen", x = "Iterations") +geom_point() + ggtitle("Pokemon Simulation: Number of Unique Pokemon Seen in Different Simulations")

g2 = ggplot(data=visualization_file, aes(x=run_i, y=unseen, group = round_i, colour = round_i)) + geom_line() +labs(y= "Total Pokemon Not Seen", x = "Iterations") +geom_point() + ggtitle("Pokemon Simulation: Number of Unique Pokemon Not Seen in Different Simulations")

Pokemon Simulation

Obviously, with enough time, we would surely encounter every single unique Pokemon in this game. But is there some mathematical formula that can be used to estimate the total number of Pokemon based on a single random sample? If I have the following measurements ( 1, 11 ,15 ,24 ,27 ,28 ,34 ,35, 36, 38, 45, 53, 56, 59, 62, 64, 68, 71) – is there some mathematical formula that can be used to estimate the total population size?

Thank You!

Best Answer

The general problem of determining the total number of species in an ecosystem based on a limited sample has received a lot of attention from statisticians. I personally have not worked on this problem, but I am vaguely aware of the subdiscipline.

One highly cited review article is Bunge & Fitzpatrick [1]. They describe many different classes of estimators for the total number of species, which all have different requirements and assumptions (e.g. finite vs infinite total number of Pokemon, parametric vs nonparametric).

To make things more concrete, we need to make some specific assumptions. For instance, you wrote:

I am assuming that there is an equal probability of encountering any given Pokemon

The assumption that all species exist at an equal frequency greatly simplifies the problem, and Bunge & Fitzpatrick consider this scenario in section 1.3.1 of their paper. In this case, they say that an approximate maximum likelihood estimator for the total number of species can be found by solving

$$ c = C^*(1-e^{-n/C^*})$$

for $C^*$. Here, $n$ is the total number of observed individuals and $c$ is the total number of unique species. From eyeballing your graph, it looks like you tend to see about $c=80$ different species at the end of 20 "days", and (I am not an R person but) it looks like you are catching an average of 5.5 Pokemon each day, so you probably catch around $n=20\times5.5=110$ Pokemon in total after 20 days. Plugging these into the above formula and solving it in WolframAlpha we get

$$C^* \approx 163$$

which (in my opinion) is not too far from the true value of 150!

Of course, the assumption of equal species frequencies is a strong one and it is not actually true in Pokemon or in many real world applications. There are other methods for this case (see sections 1.3.2 and later in [1]). A key idea seems to be keeping track of how many species are seen once, how many species are seen twice, how many species are seen three times, and so on. Ref [2] seems to be a more recent paper along these lines with an associated R package.

Lastly, one strategy that often comes to mind is to try to fit your graph on the left (of total Pokemon seen versus number of iterations) to some curve and try to extrapolate out the total species number from that. In section 1.6 of the paper, Bunge & Fitzpatrick seem to say that is unlikely to be super effective, echoing some of the comments already on your question.

[1] Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: a review. Journal of the American Statistical Association, 88(421), 364-373.

[2] Willis, A., & Bunge, J. (2015). Estimating diversity via frequency ratios. Biometrics, 71(4), 1042-1049.

Related Question