Probability – Literature Review on Low-Probability Events and Infinite Trials

probabilityrare-eventsreferences

Question: What are influential, canonical, or otherwise useful works considering low-probability events?

My background: Applied or computational statistics, not theory or pure statistics. How I describe it to people is that, if you ask me a probability problem, I am going to solve it using simulation instead of being able to write out a solution on paper.

Additional information: I have been thinking about low-probability events in the long-run recently. Consider a simple case of a binomial distribution where the probability of an event is $\frac{1}{x}$ where $x$ is arbitrarily large. It seems to be that as $n\to\infty$, the event should be inevitable, which feels like a controversial word to use in probability. Is there work around this idea of how to think about low-probability events? Or even how to judge if the probability is $0$ or $\frac{1}{x}$ with an arbitrarily large $x$?

Note: When I say "inevitable" in the long-run, I'm not saying, "We've done it 100,000 times, this next trial must surely be the time it happens," as that's the gambler's fallacy since all trials are independent and have the same probability. I'm thinking a priori here. Consider the probability of an event being 1/100000. Below, I simulate draws from a binomial distribution 5000 times. I do this for scenarios where the number of trials is 1, 100, 10000, or 1000000. I look at the percentage of the simulations where we hit at least one instance of a the event happening, and we see that the percentage increases as does the number of trials:

set.seed(1839)
iter <- 5000
p <- 1/100000
ns <- c(1, 100, 10000, 1000000)
any_hits <- function(p, n) any(rbinom(n, 1, p) == 1)
res <- sapply(ns, \(n) mean(sapply(seq_len(iter), 
                             \(zzz) any_hits(p, n))))
names(res) <- ns
res
> res
     1    100  10000  1e+06 
0.0000 0.0012 0.0916 1.0000 

What I've done so far: I have done some keyword searching, and it seems like people generally consider high-impact, low-probability (HILP) events for this area of work. I've read a little bit, but it is a new area of interest for me, and I so I'm soliciting help searching for the "works you should know" about low-probability events in the long-run.

Best Answer

You may be interested in the Blackett Review of High Impact, Low Probability Risks that was undertaken for the UK Government Office for Science. Not a technically heavy document, it gives much attention to risk communication: see particularly the work of David Spiegelhalter, Cambridge University, who was a member of the Blackett panel. "The Norm Chronicles" is a good light read, and the micromort, a one in a million chance of death, useful for exploring low probability risks. Another vital consideration is the relationship between probability and impact: many HILP events can occur at different degrees of severity. Annex 7 of the Blackett Review sets out a typology of risk classes identified by Renn (2008):

  1. Damocles. Risk sources that have a very high potential for damage but a very low probability of occurrence. e.g. technological risks such as nuclear energy and large-scale chemical facilities.
  2. Cyclops. Events where the probability of occurrence is largely uncertain, but the maximum damage can be estimated. e.g. natural events, such as floods and earthquakes.
  3. Pythia. Highly uncertain risks, where the probability of occurrence, the extent of damage and the way in which the damage manifests itself is unknown due to high complexity. e.g. human interventions in ecosystems and the greenhouse effect.
  4. Pandora. Characterised by both uncertainty in probability of occurrence and the extent of damage, and high persistency, hence the large area that is demarcated in the diagram for this risk type. e.g. organic pollutants and endocrine disruptors.
  5. Cassandra. Paradoxical in that probability of occurrence and extent of damage are known, but there is no imminent societal concern because damage will only occur in the future. There is a high degree of delay between the initial event and the impact of the damage. e.g. anthropogenic climate change.
  6. Medusa. Low probability and low damage events, which due to specific characteristics nonetheless cause considerable concern for people. Often a large number of people are affected by these risks, but harmful results cannot be proven scientifically. e.g. mobile phone usage and electromagnetic fields.

See Renn and Klinke, 2004 for more on how these were conceptualised - and named!

Renn's risk classes

Having suitable language to describe low probability events and communicate the inherent uncertainty (a big theme of Renn's classification) is important, as it's hard to estimate either probability or impact of HILP events empirically! And communication, especially to policy-makers, is a vital part of the skillset of a statistician or scientist.

You may find some of the examples mentioned in the quotation above provide useful jumping-off points. All these areas — climate change, air pollution, nuclear safety, natural disasters — have specialised sub-fields dealing with risk assessment. More examples of monitored and quantified threats, like pandemics and terrorism, appear in the risk matrix of the UK National Risk Register, 2020, which again splits out impact and likelihood:

risk matrix

Another example would be the impact hazard of near-Earth objects (NEOs). This is especially close to what you want as there's no doubt that, if unmitigated, the probability of a catastrophic event approaches one over time. The Torino Impact Hazard Scale tries to balance probability and likely effect. NASA's site on it doesn't contain much Wikipedia doesn't but gives a reference to Morrison et al. (2004) which may interest you. The scale originates with the work of Richard Binzel, e.g. Binzel (2000). While this scale applies to the risk presented by individual objects, you're more interested in the cumulative probability of a catastrophic impact in the long-term: this requires analysis of the geological record and of the current population of NEOs, corrected for observational bias (some types and sizes of object are more easily detected). Much of this material is set out in the Report of the Task Force on Potentially Hazardous Near Earth Objects by Atkinson et al. (2000). The task force was set up to advise the UK government, and provides the following cheerful table:

NEO fatalities

If we view the probability of a Tunguska-scale event in any given year as $1$ in $250$, so that on average it would occur once in $250$ years, then the probability of the Earth lasting a millennium without such a strike is as low as $\left(\frac {249} {250}\right)^{1000}\approx 1.8\%$, which is well approximated as $\exp(-\frac{1000}{250})$ using the Poisson distribution, as @Ben's answer says.

Although different fields face different problems and utilise different methods to estimate a heterogeneous bunch of (often highly uncertain) probabilities and impacts, there's an overarching bureaucratic approach to dealing with HILP events, into which Atkinson argues the NEO threat should be incorporated:

Impacts from mid-sized Near Earth Objects are thus examples of an important class of events of low probability and high consequence. There are well established criteria for assessing whether such risks are to be considered tolerable, even though they may be expected to occur only on time-scales of thousands, tens of thousands or even hundreds of thousands of years. These criteria have been developed from experience by organisations like the British Health and Safety Executive to show when action should be taken to reduce the risks.

Flood protection, the safety of nuclear power stations, the storage of dangerous chemicals or of nuclear waste are all examples of situations in which rare failures may have major consequences for life or the environment. Once the risk is assessed, plans can be made to reduce it from the intolerable to the lowest reasonably practical levels taking account of the costs involved.

If a quarter of the world’s population were at risk from the impact of an object of 1 kilometre diameter, then according to current safety standards in use in the United Kingdom, the risk of such casualty levels, even if occurring on average once every 100,000 years, would significantly exceed a tolerable level. If such risks were the responsibility of an operator of an industrial plant or other activity, then that operator would be required to take steps to reduce the risk to levels that were deemed tolerable.

One example of such guidance is the surprisingly readable report on The Tolerability of Risk from Nuclear Power Stations (1992 revision) from the UK Health and Safety Executive (HSE), commonly abbreviated to "TOR". TOR analyses what nuclear risks are acceptable by comparing other source of risk (as, rather infamously,* did the U.S. Rasmussen Report, WASH-1400) but also endeavoured "to consider the proposition that people feel greater aversion to death from radiation than from other causes, and that a major nuclear accident could have long term health effects." TOR's quantitative approach to decision-making about risk evolved into HSE's "R2P2" framework set out in Reducing risks, protecting people (2001).

Something you'll often see in discussions of catastrophic risk is the F-N diagram, also known as Farmer's Diagram or Farmer Curve (after Frank Farmer of the UK Atomic Energy Authority and Imperial College London). Here $N$ is the number of fatalities and $F$ is the frequency, usually displayed logarithmically so events with very low probability, but potentially enormous consequences, can fit on the same scale as events which are orders of magnitude more probable but less lethal. The Health and Safety Executive Research Report 073: Transport fatal accidents and FN-curves, 1967-2001 has a good explanation and some example diagrams:

Transport FN-curves

What risk is to be deemed "tolerable"? One approach is to draw a "Farmer line", "limit line" or "criterion line" on the F-N diagram. The "Canvey criterion" and "Netherlands criterion" are commonly seen. The Canvey criterion is based on a major 1978-81 HSE study of the risks posed by the industrial installations on Canvey Island in the Thames estuary, where a $1$ in $5000$ chance per annum, i.e. annual probability of $2 \times 10^{-4}$, of a disaster causing $500$ fatalities was deemed politically "tolerable". This is plotted as the "Canvey point" on the F-N axes, and then extended on a risk-neutral basis. For example, the Canvey point is deemed equivalent to a probability of $10^{-4}$ of causing $1000$ fatalities, or $10^{-3}$ chance of $100$ fatalities. TOR notes this latter figure roughly corresponds to the $1$ in $1000$ per year threshold for breaches of "temporary safe refuges", mandated to protect offshore installation workers from fire or explosion following the Piper Alpha oil rig disaster, on the assumption of a hundred workers on a platform and that, conservatively, a breach would be fatal to all. On logarithmic axes this produces a "Canvey line" with slope $-1$, as shown in Fig. D1 of TOR ("ALARP" is "as low as reasonably practicable", the idea being that efforts should still be taken to reduce even tolerable risks, up to the point where the costs of doing so become prohibitive):

TOR D1

HSE has been cautious about adopting limit lines in general, but R2P2 suggests a criterion ten times more conservative than the Canvey point: a $2 \times 10^{-4}$ annual probability of a disaster only causing fifty fatalities, instead of five hundred. Many publications show an "R2P2 line" through the R2P2 criterion, parallel to the Canvey line, but R2P2 doesn't specify a risk-neutral extension even if this sounds rational. In fact the Netherlands criterion is far more risk averse, with a slope of $-2$ indicating a particular aversion to high consequence incidents: if one catastrophic event would cause ten times as many fatalities as another, its tolerated probability is a hundred times lower. Farmer's original 1967 "boundary line as a criterion" had a slope of $-1.5$, but the horizontal axis was sieverts of iodine-131 released, not deaths.**

HSE RR073 Fig. 4 compares various criteria with transport casualties, breaking out those train accident casualties which may have been prevented by upgrading the train protection system:

HSE RR073 Fig 4

A paper about landslide risk by Sim, Lee and Wong (2022) graphs many limit lines in use globally. More European criteria are shown in Trbojevic (2005) and Jonkman, van Gelder and Vriling (2003): both papers also survey a wider range of quantitative risk measures and regulatory approaches beyond the F-N curve. Rheinberger and Treich (2016) take an economic perspective on attitudes to catastrophic risk, again looking at many possible regulatory criteria, and examining closely the case for society being "catastrophe averse". If you're interested in microeconomics or behavioural economics (e.g. Kahneman and Tversky's prospect theory) you'll find their paper valuable, especially for its lengthy bibliography.

Regulatory approaches based on limit lines on the F-N diagram aim to control the probability of disaster across the range of possible magnitudes of disasters, but don't limit the overall risk. We may identify several points on the F-N diagram representing credible accident scenarios at a nuclear power plant, whose individual probabilities are sufficiently low (for the harm each would cause) to be on the "tolerable" side of the limit line, yet feel this cluster of points collectively represents an intolerable risk. For that reason, and others, some prefer to use the complementary cumulative distribution function (CCDF) of the harm. This is calculated as one minus the CDF, and represents the probability the harm exceeds a given value. When examining catastrophic risks, the wide range of magnitudes and probabilities makes it conventional to plot on log-log axes, so superficially this resembles an F-N diagram. As the size of the consequences tends to infinity, the CDF tends to one, the CCDF to zero, and log CCDF to negative infinity. In the example below, I highlight the level of harm that's exceeded with probability $0.25$. This might measure fatalities, dose of radiation released, property damage, or some other consequence.

CCDF

Canadian nuclear regulators took the approach of setting safety limits at various points on the CCDF curve: Cox and Baybutt (1981) compare this to Farmer's limit-line criteria. See also chapters 7 and 10 of NUREG/CR-2040, a study of the implications of applying quantitative risk criteria in the licensing of nuclear power plants in the United States for the U.S. Nuclear Regulatory Commission (1981).

Note that the probability of an industrial disaster tends towards one not just over time but also as the number of facilities increases. Jonkman et al. raise the point that each facility may meet risk tolerability criteria yet the national risk becomes intolerable. They propose setting a national limit, then subdividing it between facilities. NUREG/CR-2040 looks at this in Chapters 6 and 9. The authors distinguish risk criteria "per reactor-year" or "per site-year" for a specific plant, versus risk "per year" for the country as a whole. If national risk is to be limited to an agreed level, then site-specific risk criteria imposed to achieve this, the appropriate way to distribute risk across plants is not obvious due to the heterogeneity of sites. The authors suggest tolerating higher frequency of core melt accidents at older reactors (newer designs are expected to be safer, so arguably should face stricter criteria), or those in remote areas or with better mitigation systems (as a core melt at such sites should be harmful).

The problem of the probability of catastrophe approaching $1$ in the long run can be solved by imposing risk criteria over long (even civilisational) time-scales then solving for the required annual criteria. One proposal examined in NUREG/CR-2040 (page 71) tries to restrict core melt frequencies on the basis of a 95% probability of no such accidents in the entire lifespan of the U.S. nuclear industry. Assuming we'll rely on fission technology for about three centuries with $300$ reactors active in an average year, then $100,000$ reactor-years seemed a reasonable guess. Write $\lambda_{\text{melt}}$ for the rate of core melts per reactor-year that needs to be achieved to limit the long-term risk to our tolerated level. Using the Poisson distribution, we solve

$$ \exp\left(-\lambda_\text{melt} \times 10^5\right) = 0.95 \\ \implies \lambda_\text{melt} = - \frac{\ln 0.95}{10^5} \approx 5 \times 10^{-7} \text{ per reactor-year,}$$

so we require reactors that experience core melts at a rate no more frequent than once per two million years or so. Due to the Poisson approximation for rare events, we get a very similar answer if we solve the annual probability $p_\text{melt}$ at a given reactor, using the binomial distribution:

$$\Pr(\text{0 core melts}) = (1 - p_\text{melt})^{100,000} = 0.95 \\ \implies p_\text{melt} = 1 - \sqrt[100,000]{0.95} \approx 5 \times 10^{-7}.$$

How can regulators establish the probability of a given magnitude of disaster at a particular installation is tolerably tiny? Conversely, how can scientists estimate the potential casualties from a "once in 200 years" earthquake or flood? Events so rare lie beyond empirical observation. For natural disasters we might extrapolate the F-N curve for observed events (see Sim et al., 2022) or model the physics of catastrophic scenarios coupled to a statistical model of how likely each scenario is (e.g. NASA's PAIR model for asteroid risk, Mathias et al, 2017). In engineering systems, particularly if hyper-reliability is needed, probabilistic risk assessment (PRA) can be used.

For more on PRA in the nuclear industry, including fault tree analysis, the U.S. NRC website includes current practice and a historic overview, as does the Canadian Nuclear Safety Commission. An international comparison is Use and Development of Probabilistic Safety Assessments at Nuclear Facilities (2019) by the OECD's Nuclear Energy Agency.


Footnotes

$(*)$ The Rasmussen Report's executive summary dramatically compared nuclear risks to other man-made and natural risks, including Fig. 1 below, to emphasise that the additional risk to the U.S. population from a nuclear energy programme was negligible in comparison. It was criticised for failing to show the uncertainty of those risks, and ignoring harms other than fatalities (e.g. land contamination), particularly as later estimates of the probability of nuclear disaster were less optimistic. See Frank von Hippel's 1977 response in the Bulletin of the Atomic Scientists and, for a very readable historical overview, NUREG/KM-0010 (2016).

Rasmussen Fig 1

$(**)$ Farmer's 1967 paper is available at pages 303-318 of the Proceedings on a symposium on the containment and siting of nuclear power plants held by the International Atomic Energy Agency in Vienna, 3-7 April, 1967. His colleague J. R. Beattie's paper on "Risks to the population and the individual from iodine releases" follows immediately as an appendix; Beattie converts Farmer's limit-line radiation releases into casualty figures, so the two papers together mark the genesis of the F-N boundary line approach. This is then followed by a lively symposium discussion. Regarding the slope of $-1.5$, Farmer explains "My final curve does not directly show an inverse relationship between hazard and consequence. I chose a steeper line which is entirely subjective."

Farmer is wary of simplistically multiplying probabilities together, and due to lack of empirical data is especially cautious of claims the probability of catastrophe is low due to the improbability of passive safety measures being breached: "if credit of $1000$ or more is being claimed for a passive structure, can you really feel that the possibility of it being as effective as claimed is $999$ out of $1000$. I do not know how we test or ensure that certain conditions will obtain $999$ times out of $1000$, and if we cannot test it, I think we should not claim such high reliability". He prefers to focus on things like components (for which reliability data is available) and minimising the probability of an incident occurring in the first place. Some participants welcome Farmer's probabilistic approach, others prefer the "maximum credible accident" (nowadays evolved into the design-basis event): Farmer dislikes this approach due to the broad range of accidents one might subjectively deem "credible", and the most catastrophic, but plausible, nuclear accident would clearly violate any reasonable safety criteria even if the reactor was sited in a rural area. There's an interesting note of scepticism concerning low probability events from the French representative, F. de Vathaire:

Applying the probability method consists of reasoning like actuaries in calculating insurance premiums, but it is questionable whether we have the right to apply insurance methods to nuclear hazard assessment. We must first of all possess sufficient knowledge of the probability of safety devices failing. ... I might add that the number of incidents which I have heard mentioned in France and other countries — incidents without serious radiological consequences, but which might have had them — is fairly impressive and suggests that the probability of failure under actual plant-operation conditions is fairly high, particularly due to human errors. On the other hand, it is large releases of fission products which constitute the only real safety problem and the corresponding probabilities are very low. What practical signification must be attached to events which occur only once every thousand or million years? Can they really be considered a possibility?

NUREG/KM-0010 recounts an extreme example from the early days of probabilistic risk assessment in the nuclear industry:

...the AEC [Atomic Energy Commission] contracted with Research Planning Corporation in California to create realistic probability estimates for a severe reactor accident. The results were disappointing. While Research Planning’s calculations were good, they were underestimates. Research Planning estimated the probability of a catastrophic accident to be between $10^{-8}$ to $10^{-16}$ occurrences per year. If the $10^{-16}$ estimate were true, that would mean a reactor might operate $700,000$ times longer than the currently assumed age of the universe before experiencing a major accident. The numbers were impossibly optimistic, and the error band was distressingly large. As Dr. Wellock recalled, "the AEC wisely looked at this and recognized that probabilities were not going to solve [the problems with] this report." At this time, the AEC understood that the large error in the obtained probabilities could be attributed to the uncertainty in estimating common-cause accidents.

Clearly we need caution if a tiny probability has been obtained from multiplying many failure probabilities together. Common-cause failures violate statistical independence and undermine reliability gains of redundancy. Jones (2012) gives an introduction to this topic in the context of the space industry, where NASA readopted PRA after the 1986 Challenger disaster.

References