I don't understand what is the role of the triangle here. What is it trying to communicate or visualize?
All points in the triangle must satisfy the two constraints: between zero and one in each dimension ($0 \leq \theta \leq 1$) and all sum up to one ($\theta_0 + \theta_1 + \theta_2 = 1$).
The way I finally understood it is the following:
So (a) shows a 3-D space with $\theta_{1, 2, 3}$ as coordinates. They range only between 0 and 1.
In (b), a triangle is shown, this is our simplex.
(c) shows two example points that "lay" on the simplex which also fulfil the second criteria (sums up to one).
(d) shows another example point on the simplex, the same constraints hold
In (e), I tried to show a projection of the simplex to a 2-D triangle with all example points shown before.
Hope it makes more sense now :)
With two variables, you are defining a line segment in $\mathbb{R}^2$, as you pointed out. However, due to the simplex constraint, one of these two variables is redundant in terms of specifying the density, since there is a one-to-one relationship between $x_1$ and $x_2$. Therefore, the density is specified over $K-1$ free variables (i.e., in $\mathbb{R}$)
This is actually pointed out in the first line of this section of the Wikipedia article, albeit very subtly.
Therefore, your density function becomes:.
$$Dir_{1,1}(x_1,1-x_1)=\frac{\Gamma(2)}{\Gamma(1)^2}(x_1)^0(1-x_1)^0=1$$
Therefore,
$$\int_0^1 Dir_{1,1}(x_1,1-x_1) dx_1 = 1$$
Response to OP Comment
Due to the simplex constraints, the two-variable Dirichlet density is actually degenerate in $\mathbb{R}^2$, as shown by my construction above (it only requires one variable). While it is true it has a density of $1$, it does not have a density of $1$ on the line segment connecting $(1,0)$ with $(0,1)$. What the above construction shows is that the marginal density has a value of $1$. Your confusion comes from thinking of $x_2$ as a free variable, in which case the support of the Dirichlet on $\mathbb{R}^2$ would have a non-zero area. This intuition is fine in cases like the the bivariate gaussian, where the two variables are not perfectly correlated, but not in this case.
We can formally derive this as follows:
Let $L$ be some number in $[0,\sqrt{2}]$ specifying the distance from $(1,0)$ to $(0,1)$ along the connecting line segment. Thus, each value of $L$ identifies a unique $(x_
1,x_2)$ pair. Using this notation, your assumption that the density is $1$ along this line boils down to:
$$P(L \in [a,b] \subset)=b-a$$
However, we can show this is not the case through a formal treatment of the joint density of $x_1,x_2$:
$$P_L(L\in [a,b])=P_{X_1,X_2}[(x_1,x_2) \in A_{[a,b]}]$$
Where $A_{[a,b]}:= \{(u,v): u \in [1-\frac{b}{\sqrt{2}},1-\frac{a}{\sqrt{2}}], v = 1- u]$
Now, let's calculate $P_L(L\in [a,b])$:
$$P_L(L\in [a,b])= \int_{A_{[a,b]}} dP_{X_1,X_2}= \int_{A_{[a,b]}} dP_{X_1}dP_{X_2|X_1} =\int_{A_{[a,b]}} 1 \;dP_{X_1} = \int_{1-\frac{b}{\sqrt{2}}}^{1-\frac{a}{\sqrt{2}}}1\; du = $$
$$\left(1-\frac{a}{\sqrt{2}}\right) - \left(1-\frac{b}{\sqrt{2}}\right) = \frac{1}{\sqrt{2}}(b-a)$$
Where the third equality comes about because $dP_{X_2|X_1} = 1$ for $X_2=1-X_1$ (i.e., its not a density, but a point probability mass at $1-X_1$)
As you can see, we've recovered the $\frac{1}{\sqrt{2}}$ normalizing constant for the density along the line segment in $\mathbb{R}^2$. Effectively, this (degenerate) joint density is just a linear transformation of one of the two marginals (either one will work). This results in the domain of the probability density to go from $1$ to $\sqrt{2}$, hence the density must decrease to compensate.
Best Answer
A Dirichlet distribution is often used to probabilistically categorize events among several categories. Suppose that weather events take a Dirichlet distribution. We might then think that tomorrow's weather has probability of being sunny equal to 0.25, probability of rain equal to 0.5, and probability of snow equal to 0.25. Collecting these values in a vector creates a vector of probabilities.
Another way to think about a Dirichlet distribution is the process of breaking a stick. Imagine a stick of unit length. Break that stick anywhere and retain one of the two pieces. Then break the remaining piece into two pieces and continue this as long as you desire. All of the pieces together must sum to unit length, and allocating pieces of different lengths to different events represents the probability of that event.
If you're familiar with the beta distribution, the Dirichlet distribution might become even more clear. A beta distribution is often used to describe a distribution of probabilities of dichotomous events, so its restricted to the unit interval. For example, for a Bernoulli trial, there is only a parameter $\theta$ describing the probability of a "success." Often we think of $\theta$ as being fixed, but if we are uncertain about the "true" value of $\theta$, we could think about a distribution of all possible $\theta$s, with a larger likelihood for those we consider more plausible, so perhaps $\theta \sim \text{B}(\alpha, \beta)$, where $\alpha>\beta$ concentrates more of the mass near 1 and $\beta > \alpha$ concentrates more of the mass near 0.
One might object that the beta distribution only describes the probability of a single probability, that is, for example, that $P(\theta<0.25)=0.5$, which is a scalar number. But keep in mind that the beta distribution is describing dichotomous outcomes. So by applying Kolmogorov's second axiom, we also know that $P(\theta \ge 0.25)=0.5$ as well. Collecting these results in a vector gives us a vector of probabilities.
Extending the beta distribution into three or more categories gives us the Dirichlet distribution; indeed, the PDF of the Dirichlet for two groups is the exact same as the beta distribution.