Solved – Who invented the decision tree

carthistory

I'm trying to trace who invented the decision tree data structure and algorithm.

In the Wikipedia entry on decision tree learning there is a claim that "ID3 and CART were invented independently at around the same time (between 1970 and 1980)". ID3 was presented later in:

  • Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106

so I'm not sure that the claim is true.

I found using Google books a reference to a 1959 book Statistical decision series and a 1958 collection of Working papers. The context is not clear and they don't seem to present an algorithm. However, they don't define the data structure and treat it like it is well known.

Using Google Scholar I found a citations going back to 1853 but these were parsing errors and not real citations from that date.

Best Answer

Good question. @G5W is on the right track in referencing Wei-Yin Loh's paper. Loh's paper discusses the statistical antecedents of decision trees and, correctly, traces their locus back to Fisher's (1936) paper on discriminant analysis -- essentially regression classifying multiple groups as the dependent variable -- and from there, through AID, THAID, CHAID and CART models.

The short answer is that the first article I've been able to find that develops a "decision tree" approach dates to 1959 and a British researcher, William Belson, in a paper titled Matching and Prediction on the Principle of Biological Classification, (JRSS, Series C, Applied Statistics, Vol. 8, No. 2, June, 1959, pp. 65-75), whose abstract describes his approach as one of matching population samples and developing criteria for doing so:

In this article Dr Belson describes a technique for matching population samples. This depends on the combination of empirically developed predictors to give the best available predictive, or matching, composite. The underlying principle is quite distinct from that inherent in the multiple correlation method.

The "long" answer is that other, even earlier streams of thought seem relevant here. For instance, the simple age-gender cohort breakouts employed in actuarial tables of mortality offer a framework for thinking about decisions that dates back several centuries. It could also be argued that efforts dating back to the Babylonians employed quadratic equations, which were nonlinear in the variables (not in the parameters, http://www-history.mcs.st-and.ac.uk/HistTopics/Quadratic_etc_equations.html) have relevance, at least insofar as they presage parametric models of logistic growth (I recognize that this is a stretch comment, please read on for a fuller motivation of it). In addition, philosophers have long recognized and theorized about the existence of hierarchically arranged, qualitative information, e.g., Aristotle's book on Categories. The concept and assumption of a hierarchy is key here. Other relevant, much later discoveries were in pushing beyond the boundaries of 3-D Euclidean space in David Hilbert's development of infinite, Hilbert space, combinatorics, discoveries in physics related to 4-D Minkowski space, distance and time, the statistical mechanics behind Einstein's theory of special relativity as well as innovations in the theory of probability relating to models of markov chains, transitions and processes. The point here is that there can be a significant lag between any theory and its application -- in this case, the lag between theories about qualitative information and developments related to their empirical assessment, prediction, classification and modeling.

A best guess is that these developments can be associated with the history of increasing sophistication of statisticians, mostly in the 20th c, in developing models leveraging scale types other than continuous (e.g., nominal or, more simply, categorical information), count data models (poisson), cross-classified contingency tables, distribution-free nonparametric statistics, multidimensional scaling (e.g., J.G. Carroll, among others), models with qualitative dependent variables such as two group logistic regression as well as correspondence analysis (mostly in Holland and France in the 70s and 80s).

There is a wide literature that discusses and compares two group logistic regression with two group discriminant analysis and, for fully nominal features, finds them providing equivalent solutions (e.g., Dillon and Goldstein's Multivariate Analysis, 1984).

J.S. Cramer's article on the history of logistic regression (The History of Logistic Regression, http://papers.tinbergen.nl/02119.pdf) describes it as originating with the development of the univariate, logistic function or the classic S-shaped curve:

The survival of the term logistic and the wide application of the device have been determined decisively by the personal histories and individual actions of a few scholars...

Deterministic models of the logistic curve originated in 1825, when Benjamin Gompertz (https://en.wikipedia.org/wiki/Benjamin_Gompertz) published a paper developing the first truly nonlinear logistic model (nonlinear in the parameters and not just the variables as with the Babylonians) -- the Gompertz model and curve.

I would suggest that another important link in this chain leading to the invention of decision trees was the Columbia sociologist Paul Lazarsfeld's work on latent structure models. His work began in the 30s, continued during WWII with his content analysis of German newspapers for the nascent OSS (later the CIA, as discussed in John Naisbett's book Megatrends) and finally published in 1950. Andersen describes it this way (Latent Structure Analysis: A Survey, Erling B. Andersen, Scandinavian Journal of Statistics, Vol. 9, No. 1, 1982, pp. 1-12):

The foundation for the classical theory of latent structure analysis was developed by Paul Lazarsfeld in 1950 in a study of ethnocentrism of American soldiers during WWII. Lazarsfeld was primarily interested in developing the conceptual foundation of latent structure models...The statistical methods developed by Lazarsfeld were, however, rather primitive...An early attempt to derive efficient estimation methods and test procedures was made by Lazarsfeld's colleague at Columbia University, T.W. Anderson, who in a paper (Psychometrika, March 1954, Volume 19, Issue 1, pp 1–10, On estimation of parameters in latent structure analysis), developed an efficient estimation method for the parameters of the latent class model...In order to introduce the framework (of latent class models) we shall briefly outline the basic concepts...and use a notational system developed much later by Goodman (1974a)...The data are given in the form of a multiple contingency table...

There is a useful distinction worth making here, as it can be related to the progression from AID to CHAID (later CART), between contingency table-based models (all variables in the model are nominally scaled) and more recent latent class models (more precisely, finite mixture models based on "mixtures" of scales and distributions, e.g., Kamakura and Russell, 1989, A Probabilistic Choice Model for Market Segmentation and Elasticity Structure) in how they create the model's residuals. For the older contingency table models, the cell counts inherent in the fully cross-classified table formed the basis for the "replications" and, therefore, the heterogeneity in the model's residuals used in the partitioning into classes. On the other hand, the more recent mixture models rely on repeated measures across a single subject as the basis for partitioning the heterogeneity in the residuals. This response is not suggesting a direct connection between latent class models and decision trees. The relevance to AID and CHAID can be summarized in the statistics employed to evaluate the models, AID uses a continuous F distribution while CHAID uses the chi-square distribution, appropriate for categorical information. Rather in their analysis and modeling of contingency tables, LCMs constitute, in my opinion, an important piece in the puzzle or narrative leading up to the development of decision trees, along with the many other innovations already noted.

CHAID was a later development, first proposed in a 1980 PhD dissertation by South African Gordon Kass as outlined in this Wiki piece on CHAID (https://en.wikipedia.org/wiki/CHAID). Of course, CART came a few years later in the 80s with Breiman, et al's, now famous book Classification and Regression Trees.

AID, CHAID and CART all posit tree-like, hierarchically arranged structures as the optimal representation of reality. They just go about this using differing algorithms and methods. To me, the next steps in this progressive chain of innovation are the emergence of heterarchical theories of structure. As defined in this Wiki article, heterarchies "are a system of organization where the elements of the organization are unranked (non-hierarchical) or where they possess the potential to be ranked a number of different ways" (https://en.wikipedia.org/wiki/Heterarchy or for a deeper, more philosophic perspective on heterarchy see Kontopoulos, The Logics of Social Structure). From an empirical point of view, the analysis and modeling of network structures are most representative of this historical development in the understanding of structure (e.g., Freeman's book The Development of Social Network Analysis). While many network analysts will try and force a hierarchical arrangement on the resulting network, this is more an expression of ingrained and unconscious assumptions than it is a statement about the empirical reality of multiplex network structure in a complex world.

This response is suggesting that the arc of the evolution leading to the development of decision trees created new questions or dissatisfaction with existing "state-of-the-art" methods at each step or phase in the process, requiring new solutions and new models. In this case, dissatisfactions can be seen in the limitations of modeling two groups (logistic regression) and recognition of a need to widen that framework to more than two groups. Dissatisfactions with unrepresentative assumptions of an underlying normal distribution (discriminant analysis or AID) as well as comparison with the relative "freedom" to be found in employing nonparametric, distribution-free assumptions and models (e.g., CHAID and CART).

As suggested, the origins of decision trees almost certainly has a long history that goes back centuries and is geographically dispersed. Multiple streams in human history, science, philosophy and thought can be traced in outlining the narrative leading up to the development of the many flavors of decision trees extant today. I will be the first to acknowledge the significant limitations of my brief sketch of this history.

/** Addendums **/

  1. This 2014 article in the New Scientist is titled Why do we love to organise knowledge into trees?(https://www.newscientist.com/article/mg22229630-800-why-do-we-love-to-organise-knowledge-into-trees/), It's a review of data visualization guru Manuel Lima's book The Book of Trees which traces the millenia old use of trees as a visualization and mnemonic aid for knowledge. There seems little question but that the secular and empirical models and graphics inherent in methods such as AID, CHAID and CART represents the continued evolution of this originally religious tradition of classification.

  2. In this video (posted online by Salford Systems, implementers of CART software), A Tribute to Leo Breiman, Breiman talks about the development of his thinking that led to the CART methodology. It all started with a wall plastered with the silhouettes of different WWII-era battleships.

https://www.salford-systems.com/videos/conferences/cart-founding-fathers/a-tribute-to-leo-breiman?utm_source=linkedin&utm_medium=social&utm_content=3599323

  1. In reading the introduction to Denis Konig's 1936 Theory of Finite and Infinite Graphs, widely viewed as providing the first rigorous, mathematical grounding to a field previously viewed as as a source of amusement and puzzles for children, Tutte notes (p. 13) that chapter 4 (beginning on p. 62) of Konig's book is devoted to trees in graph theory. Tutte's explanation of Konig's definition of a tree is "where an 'acyclic' graph is a graph with no circuit, a tree is a finite connected acyclic graph...in other words, in a tree there is one and only one path from a given vertex to another..." To me (and I'm neither a graph theorist nor a mathematician), this suggests that graph theory and its precursors in Poincare's Analysis Situs or Veblen's Cambridge Colloquium lectures on combinatorial topology, may have provided the early intellectual and mathematical antecedents for what later became a topic for statisticians.

  2. The first Tree of Knowledge is widely attributed to the neoplatonic philosopher Porphyry who, around 270 CE wrote an Introduction to Logic that used a metaphorical tree to describe and organize knowledge ... http://www.historyofinformation.com/expanded.php?id=3857

  3. Just discovered an even earlier reference to a Tree of Knowledge in the Book of Genesis in the Bible, discussed in this Wiki article ... https://en.wikipedia.org/wiki/Tree_of_life_(biblical). Genesis probably dates back to 1,400 BCE based on this reference ... https://www.biblica.com/bible/bible-faqs/when-was-the-bible-written/ Regardless, the Book of Genesis came many centuries before Porphyry.