Probability Analysis – Calculating the Probability of a Fire Incident

datasetprobability

I am currently working on a solution to a dataset that I have been given based on probabilities though I require some further guidance on the calculation.

For example, I have a dataset with the number of fire incidents for a district. I want to calculate the probability of a fire incident occurring within a 2km radius of a fire station given the number of buildings within that radius.

Let's say the total number of fires in this district is 80 in a month. The total number of buildings within the fire-station radius is 200, and the total number of buildings within the entire district is 15000.

What is the probability that one or more of those fires belong to the buildings within the fire-station radius?

It is not necessarily like rolling dice however I thought this approach could be taken: $n = 15000$, then $k = 80$ however given that we are looking at those within the 200 buildings, I was thinking something like:

$\frac{15000!}{(15000-80)\cdot200^{80}}$?

My reasoning for the equation:

I presumed that given a total population size $đť‘›=15000$ ,then selecting the difference of population size and $k$ occurrences of incidents and finally $\frac{1}{200^{80}}$
$80$ represents the selection and $200$ the number of samples of interest. Though this may be a better equation when calculating for fires occurring in the same building? i.e. sampling with replacement.

Secondary Approach:
Perhaps calculating the daily probability relative to a month helps. Hence, $80/30$ ~ $2.6$ which is equivalent to $\frac{1}{30}$. We know that the sample space is $n = 15000$, and $m = 200$ is a subset of the sample space. What is the probability that in any given day, more than 1 incidences have occurred within $m$.

Given the potential of more than one station in a district, then perhaps a hypergeoemetric distribution works well in this case?

$\frac{\binom{z}{k}\binom{m}{y}\binom{15000-z-m}{80-k-y}}{\binom{15000}{80}}$

Where $z$ represents another fire-station in the district independent to the previous fire-station, and $y$ is the proportion of incidences.

Extra information:

  1. We assume each incident is independent and happening to unique buildings. 2. Again we assume each logged incident is unique. 3. We do not know if they have caught on fire more than once in the month, again we assume uniqueness here also. 4. Presumingly those closer to the station probably have a lower rate of an incident. On this assumption, we could add a weight to those closer to the station by every 100m, the closer then the lower the probability.

Data collection:

Fire stations collected from here:
https://www.datadaptive.com/?pg=5

Fire incidents collected from here:
https://www.gov.uk/government/statistics/fire-statistics-incident-level-datasets

UK administrative boundaries here:
https://www.ordnancesurvey.co.uk/business-government/products/boundaryline

Buildings in UK:
https://www.ordnancesurvey.co.uk/business-government/products/open-zoomstack

Methodology:
Plotting the number of station across the UK on QGIS while overlaying the boundaries data to gather the administrative districts. Secondly, uploading the buildings from the open-zoom stack. Lastly, creating a buffer around each station for a 2km radius and counting the number of buildings within the buffer, and counting the number of buildings within each district.

Given that the fire incidence dataset does not have coordinates, we just take the total count of fires in a district.

More than one fire-station may be within the same district, is this something to account for – if so, how could this be interpreted?

Best Answer

Let $n_1$ be the no. buildings in a given district located within a 2 km radius of a fire station, & $x_1$ the no. incidents linked to those buildings; let $n_2$ be the no. buildings in the district located without that radius, & $x_2$ the no. incidents linked to those buildings.

Given the assumptions that the incidents are independent, with each linked to a unique building, & at most one incident per building, you might want to consider them as independent Bernoulli random variables with common probability parameter $\pi$; then the joint mass function for $x_1$ & $x_2$ is

$$ \Pr\left(X_1=x_1, X_2=x_2\right)={n_1 \choose x_1}\pi^{x_1}(1-\pi)^{n_1-x_1} \cdot {n_2 \choose x_2}\pi^{x_2}(1-\pi)^{n_2-x_2} $$

Let $T= X_1 +X_2$; you've observed $T=t$, so you need the conditional mass function

$$\Pr\left(X_1=x_1|T=t\right) =\frac{ {n_1 \choose x_1}{n_2 \choose t-x_1}}{n_1+n_2 \choose t} $$

—$X_1|T$ has a known hypergeometric distribution, regardless of the unknown true value of $\pi$.

If you wanted to throw out the assumption of at most one incident per building, you could consider them as independent Poisson random variables with common rate parameter $\lambda$, following @RCarnell; then the joint mass function for $x_1$ & $x_2$ is

$$ \Pr\left(X_1=x_1, X_2=x_2\right)=\frac{(\lambda n_1)^{x_1}\exp(-\lambda n_1)}{x_1!} \cdot \frac{(\lambda n_2)^{x_2}\exp(-\lambda n_2)}{x_2!} $$

and the conditional mass function is

$$\Pr\left(X_1=x_1|T=t\right) ={t \choose x_1} \left(\frac{n_1}{n_1+n_2}\right)^{x_1}\left(1 - \frac{n_1}{n_1+n_2}\right)^{t - x_1} $$

—$X_1|T$ has a known binomial distribution, regardless of the unknown true value of $\lambda$.

Fires serious enough to merit calling out the fire brigade are, fortunately, rather rare occurrences for an individual building; so the two models won't differ greatly. For the example you provide:—

n1 <- 200
n2 <- 15000 - n1
t <- 80
x1 <- 0

1 - dhyper(x1, n1, n2, t)
# 0.6592815
1 - dbinom(x1, t, n1/(n1 + n2))
# 0.653067

I imagine most fires, in the U.K. at any rate, arise from unconnected accidents—you forget about the chip-pan, drop your lit pipe down the side of the arm-chair, leave the wireless on till a valve overheats—& so the independence assumption seems reasonable. Of more concern is the assumption of a common probability/rate parameter; there's no guarantee that the area round a fire station is typical of the district. Buildings might be detached houses, blocks of flats, barns, factories, offices; might be new, or decades old; might be built from straw, or sticks, or solid bricks: such characteristics affecting fire risk will doubtless exhibit some degree of spatial clustering. Depending on what other information you can get, you may sometimes be able to make a case that your calculated probability is about right, or that it's an under- or an over-estimate.


With $k$ fire stations in a district, each building falls into one of at most $2^k$ sets (e.g. for $k=2$: within 2 km of no fire stations, of the 1st fire station only, of the 2nd fire station only, or of both fire stations); the conditional distribution $\Pr(X_1=x_1, \ldots, X_{2^k-1}|T=t)$ becomes a multivariate hypergeometric, or a multinomial, distribution. Of course if you want to calculate only, say, the probability of one or more incidents within 2 km of any fire station, you can just define two sets of buildings appropriately.

Related Question