Estimating the Number of Pokemon Without Knowing How Many There Are

probabilitystatistics

I was imagining the following scenario: Suppose you are playing a Pokemon game for the first time and you don't know how many Pokemon are there. You spend some time playing today, encounter some random Pokemon, and record how many unique Pokemon you came across (e.g. you saw 15 total Pokemon, but only 8 of them were unique). Tomorrow, you spend some time playing and you encounter some Pokemon (e.g. 25, but only 7 were unique ) – now you record how many unique Pokemon you saw today and add these to the number of unique Pokemon you saw yesterday. You repeat this process a few times, and after collecting "n" number of samples, you have observed "m" number of unique samples.

Using this information, are there any mathematical formulas that you can use to estimate the total number of Pokemon that might exist in the game?

I am thinking that there might already exist some statistical/probability formulas that can be used for estimating the population size based on some finite samples, but so far I have not found any such formulas that can be directly used for this problem. I came across two somewhat related concepts in math (https://en.wikipedia.org/wiki/German_tank_problem , https://en.wikipedia.org/wiki/Coupon_collector%27s_problem), but I am not sure how to apply these concepts to this problem of estimating the number of Pokemon.

To make things a little more concrete, I wrote some computer code (R programming language) that attempts to simulate this problem. Suppose we are playing the original Pokemon game and there are 150 Pokemon :

library(dplyr)
library(ggplot2)

pokemon_id = 1:150

pokemon_names = names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
          "Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
          "Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
          "Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
          "Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
          "Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
          "Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
          ,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")

pokemon_data = data.frame(pokemon_id, pokemon_names)

Now, suppose you have 20 days to play this game – each day, you encounter a random number of Pokemon and keep track which of these Pokemon were unique (note: In the real example, we don't know there are 150 Pokemon, but I have included this number to facilitate some of the calculations) :

pokemon_function <- function() {

  pokemon_results <- list()

  for (i in 1:20) {

    run_i <- i
     pokemon_caught_i <- abs(sample.int(10, 1))
    sample_i <- pokemon_data[sample(nrow(pokemon_data), pokemon_caught_i), ]
    pokemon_tmp <- data.frame(run_i, sample_i)
      pokemon_results[[i]] <- pokemon_tmp
  }
  results_df <- do.call(rbind.data.frame,   pokemon_results)

  pokemon_int <- data.frame(results_df %>%
                         group_by(pokemon_id) %>%
                         filter(run_i == min(run_i)) %>%
                         distinct)

  pokemon_caught <- data.frame(pokemon_int %>%
                               group_by(run_i) %>%
                               summarise(Count=n()))

  cumulative <- cumsum(pokemon_caught$Count)
  pokemon_caught$Cumulative <- cumulative
  pokemon_caught$unseen <- 150 - cumulative
  return(pokemon_caught)
}

We can now see how many (cumulative) unique Pokemon we encountered each day for 20 days (note: I am assuming that there is an equal probability of encountering any given Pokemon) :

[1]  1 11 15 24 27 28 34 35 36 38 45 53 56 59 62 64 68 71

In theory, we could repeat this simulation experiment many times (e.g. 50 times) and visualize the results:

 #Repeat Simulation 50 Times:

final <- list()
for (i in 1:50) {
  round_i <- i
  s_i <- pokemon_function()
  final_tmp <- data.frame(round_i, s_i)
  final[[i]] <- final_tmp
}

visualization_file <- do.call(rbind.data.frame, final)

visualization_file$round_i = as.factor(visualization_file$round_i)

 g1 = ggplot(data=visualization_file, aes(x=run_i, y=Cumulative, group = round_i, colour = round_i)) + geom_line() +labs(y= "Total Pokemon Seen", x = "Iterations") +geom_point() + ggtitle("Pokemon Simulation: Number of Unique Pokemon Seen in Different Simulations")

g2 = ggplot(data=visualization_file, aes(x=run_i, y=unseen, group = round_i, colour = round_i)) + geom_line() +labs(y= "Total Pokemon Not Seen", x = "Iterations") +geom_point() + ggtitle("Pokemon Simulation: Number of Unique Pokemon Not Seen in Different Simulations")

Obviously, with enough time, we would surely encounter every single unique Pokemon in this game. But is there some mathematical formula that can be used to estimate the total number of Pokemon based on a single random sample? If I have the following measurements ( 1, 11 ,15 ,24 ,27 ,28 ,34 ,35, 36, 38, 45, 53, 56, 59, 62, 64, 68, 71) – is there some mathematical formula that can be used to estimate the total population size?

Thank You!

Best Answer

The general problem of determining the total number of species in an ecosystem based on a limited sample has received a lot of attention from statisticians. I personally have not worked on this problem, but I am vaguely aware of the subdiscipline.

One highly cited review article is Bunge & Fitzpatrick [1]. They describe many different classes of estimators for the total number of species, which all have different requirements and assumptions (e.g. finite vs infinite total number of Pokemon, parametric vs nonparametric).

To make things more concrete, we need to make some specific assumptions. For instance, you wrote:

I am assuming that there is an equal probability of encountering any given Pokemon

The assumption that all species exist at an equal frequency greatly simplifies the problem, and Bunge & Fitzpatrick consider this scenario in section 1.3.1 of their paper. In this case, they say that an approximate maximum likelihood estimator for the total number of species can be found by solving

$$ c = C^*(1-e^{-n/C^*})$$

for $C^*$. Here, $n$ is the total number of observed individuals and $c$ is the total number of unique species. From eyeballing your graph, it looks like you tend to see about $c=80$ different species at the end of 20 "days", and (I am not an R person but) it looks like you are catching an average of 5.5 Pokemon each day, so you probably catch around $n=20\times5.5=110$ Pokemon in total after 20 days. Plugging these into the above formula and solving it in WolframAlpha we get

$$C^* \approx 163$$

which (in my opinion) is not too far from the true value of 150!

Of course, the assumption of equal species frequencies is a strong one and it is not actually true in Pokemon or in many real world applications. There are other methods for this case (see sections 1.3.2 and later in [1]). A key idea seems to be keeping track of how many species are seen once, how many species are seen twice, how many species are seen three times, and so on. Ref [2] seems to be a more recent paper along these lines with an associated R package.

Lastly, one strategy that often comes to mind is to try to fit your graph on the left (of total Pokemon seen versus number of iterations) to some curve and try to extrapolate out the total species number from that. In section 1.6 of the paper, Bunge & Fitzpatrick seem to say that is unlikely to be super effective, echoing some of the comments already on your question.

[1] Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: a review. Journal of the American Statistical Association, 88(421), 364-373.

[2] Willis, A., & Bunge, J. (2015). Estimating diversity via frequency ratios. Biometrics, 71(4), 1042-1049.

Related Solutions

How Many Days Until Two Pokemon Trainers Catch Same Species?

After $d$ days, there are $Td$ Pokemon caught, and $\binom T2d^2$ pairs of Pokemon owned by different trainers. Each pair has a $1/P$ chance of being the same kind of Pokemon. These collisions are very unlikely when $P$ is large, and mostly independent, and so (just as in the birthday problem) it makes sense to take a Poisson approximation. That is, we model the number of collisions after $d$ days as a Poisson random variable with mean $\binom T2 d^2/P$.

In this model, the probability that there are no collions after $d$ days is $\exp(-\binom T2d^2/P)$, where I'm writing $\exp(t)$ as a replacement for $e^t$ which makes the power easier to read. If $\mathbf D$ is our random variable - the number of days it takes to see two trainers that share a Pokemon - then it is always true that $$\mathbb E[\mathbf D] = \sum_{d=0}^\infty \Pr[\mathbf D > d]$$ and so our model says that $$\mathbb E[\mathbf D] \approx \sum_{d=0}^\infty \exp\left(-\binom T2 d^2/P\right).$$ If we're happy with an infinite sum as our approximating formula, we can stop here: it is very good. In the table below, the third column shows what this infinite sum gives us for $P=1025$ and $T=2, \dots, 16$; the second column shows the values in the question.

Number of trainers	Mean days needed	Infinite sum	Formula with $\sqrt\pi$
2	28.8126	28.8971	28.8731
3	16.8072	16.8556	16.8812
4	12.0491	12.0833	12.0833
5	9.44665	9.47236	9.47236
6	7.77382	7.8259	7.8259
7	6.63424	6.69152	6.69152
8	5.83851	5.86201	5.86201
9	5.18945	5.22885	5.22885
10	4.69666	4.72961	4.72961
11	4.29345	4.32583	4.32583
12	3.95724	3.99249	3.99249
13	3.68294	3.71262	3.71262
14	3.44168	3.47431	3.47431
15	3.24476	3.26893	3.26893
16	3.07407	3.0901	3.0901

However, instead of taking an infinite sum, we could take an integral approximation. The integral of $\exp(-\binom T2d^2/P)$ with respect to $d$ is an integral in terms of the error function, but because we're going from $0$ to $\infty$, it's one of the few that have closed forms. The estimate this produces is $$\mathbb E[\mathbf D] \approx \sqrt{\frac{\pi P}{2 T(T-1)}}$$ but it gives much better results when we add $\frac12$ to it; I assume this can be justified by some sort of continuity correction. I have included a fourth column in the table for the approximation $$\mathbb E[\mathbf D] \approx \frac12 + \sqrt{\frac{\pi P}{2 T(T-1)}}.$$

Best Answer

Related Solutions

How Many Days Until Two Pokemon Trainers Catch Same Species?

Related Question