[Math] Mathematical Explanation of the Difference between SQL Joins: Inner, Outer, Left, Right

computer sciencediscrete mathematics

Question

This question calls for a mathematically sound & intuitive explanation of SQL joins that clearly shows the difference between the following:

  • Inner Join
  • Left Join
  • Right Join
  • Full Outer Join

The explanation of joins should not misuse Venn diagrams. This is key. It should also be as accessible as possible to a computer programmer or mathematical beginner. We don't want to scare programmers away from mathematical concepts by using too much jargon. Of course, a little bit of maths is always necessary.

Motivation

The internet is rife with usages of Venn diagrams to explain SQL joins. As pointed out in the following articles, this leads to a grave misunderstanding of either Venn diagrams, SQL joins or both:

As a website that many students of mathematics and computer science consult as a source of truth, it is our responsibility as a community to try everything in our power to propagate truth. Unfortunately, Venn diagram usage to explain a concept which is really Cartesian product at its core is all to rife.

Our own sister site, StackOverflow, is unfortunately part of this problem: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join/38578#38578. While there are many amazing answers under that question, the prevailing belief on that site appears to be that joins are intersections/unions and Venn diagrams are appropriate to explain them. The top ranked and accepted answer uses Venn diagrams and intersection/union to explain joins.

While there may be some cases where join coincides with intersections and unions, it is not in general the case. I fear that people are simply seeing the special case and accepting the Venn diagram explanation. I fear they are then walking away with improper understanding of SQL joins and set theory.

I am hoping that by posting a question here, even a small percentage of people might be directed here instead of to another site that has SQL joins incorrectly explained using Venn diagrams. I am hoping that at least one of the Stack Exchange websites can have an accepted answer explaining SQL joins that is mathematically accurate, and potentially many other good alternative answers alongside it to provide different perspectives.

To be clear: I think I understand SQL joins myself. The purpose of this question is to create visibility and a source of truth for those new students of computer science and mathematics who might not understand them fully.

Related

Is Cartesian Product same as SQL Full Outer Join?

Best Answer

Let $A, B$ be sets. We think of $A$ and $B$ as tables, and their elements as rows. Each element of $x\in A$ is a list of data entries, one for each column of $A$.

(Edit: WLOG assume $A$ and $B$ do not have duplicate entries. If they do, add a unique index column to each.)

Let $R$ be any relation, that is, a subset $R \subseteq A \times B$, where we write $a \sim \, b$ if $(a,b) \in R$. In SQL $R$ corresponds to the statement that appears after "ON", e.g., A.name = B.name corresponds to the relation $x \sim y$ if and only if the entry in the name column of for a row $x \in A$ is the same as the name column in a row of $y \in A$.

Then $$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, |\, a \sim b\}\, (=R).$$

(Edit: Here $(a,b)$ represents the concatenation of the entries of rows $a$ and $b$, corresponding to SELECT * FROM A JOIN B ON R. Of course the actual output may differ depending on the implementation.)

But here, if $a \in A$ is such that there is no corresponding $b$ such that $a \sim b$, then $a$ will not appear in the join. If you take a left join, you want every $a$ to appear regardless. So you add a special element $\operatorname{NULL}$ and add it to your relation. $\operatorname{NULL}$ obeys the rules

$a \sim \operatorname{NULL}$ iff there is no $b \in B$ with $a \sim b$

$\operatorname{NULL} \sim b$ iff there is no $a \in A$ with $a \sim b$

Now let $$\hat{A} = A \cup \{\operatorname{NULL}\},$$ $$\hat{B} = B \cup \{\operatorname{NULL}\}.$$

Then we have

$$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, | a \sim b\}$$ $$A \operatorname{ LEFT JOIN } B \operatorname{ON} R = \{(a,b) \in A \times \hat{B} \, | a \sim b\}$$ $$A \operatorname{ RIGHT JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times B \, | a \sim b\}$$ $$A \operatorname{ OUTER JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times \hat{B} \, | a \sim b\}.$$

Thus we'll have the pairs $(a, \operatorname{NULL})$ appear on the left join whenever $a$ doesn't match any $b$, and $(\operatorname{NULL}, b)$ whenever $b$ doesn't match any $a$ in the right join. (note that we don't have $\operatorname{NULL} \sim \operatorname{NULL}$, so we never have $(\operatorname{NULL}, \operatorname{NULL})$.)

The reason that Venn diagrams are used to depict joins is that usually joins are usually done on relations as simple as the one given above, $R$ corresponding to A.name = B.name. In that case, if $\text{names}(T)$ is the set of names that appear in a table $T$, that is, $\text{names}(T)$ = SELECT DISTINCT names FROM T, then

\begin{align*}\text{names}(A\operatorname{ INNER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cap \text{names}(B) \\ \text{names}(A\operatorname{ LEFT JOIN } B \operatorname{ON} R) &= \text{names}(A)\\ \text{names}(A\operatorname{ RIGHT JOIN } B \operatorname{ON} R) &= \text{names}(B)\\ \text{names}(A\operatorname{ OUTER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cup \text{names}(B).\end{align*}

However, this completely loses sight of the fact that joins may be one-to-one, many-to-one, or many-to-many, and personally I've found those Venn diagrams more confusing than helpful when learning about joins.

Related Question