Drawing partial trees of different sizes side-by-side

forestgraphsqtreetikz-treestrees

There exists an elaborate thread about how to draw rooted trees in LaTeX, e.g. for use in natural language applications. There exist packages other than TikZ to do this, like qtree or forest. However, a common theme in all these trees is that the leaves eventually join together in one overarching root. There are applications where this isn't the case; one example is hierarchical segmentation like byte-pair encoding. Another could be cutting across an agglomerative clustering hierarchy.

The net result is that you get multiple trees, but the leaves of one tree must be aligned and the leaves of different trees must also be aligned. Here is a hypothetical example I drew in Paint for aggregating the letters of discombobulate, ending at a segmentation discom+bobulate:

BPE tokenisation tree for discombobulate

How would one reproduce this in LaTeX, given a list of merges? There are several freedoms:

  • You are allowed to choose any specification format for the merges/tree(s). The list of merges comes from a Python program anyway, so it doesn't take much to adapt to the format you think is easiest. I have a programmatic representation of the tree, so it's trivial to convert it to something like Newick format which is almost literal forest code.
  • You are free to choose the the height at which a non-leaf node is placed, as long as it is 1. discretised and 2. above its constituents.
  • If you prefer right-angled branches like in a dendrogram, that is fine too.

There are several requirements I would like a solution to have, to make it sufficiently general:

  • Should allow more than two constituents for one node (e.g. "d", "i", "s" merge together immediately into "dis" without passing through "is"). Hence, please don't use a package that only allows drawing binary trees.
  • Should have some control over horizontal and vertical compactness, i.e. how closely packed the layers are and how far the leaves are from each other.
  • Should allow turning off intermediate node names.

The latter would produce something like the following image:

byte-pair encoding merge tree without intermediate nodes


Note to editors: a better title to this question is always appreciated. I don't like the title I came up with.

Best Answer

By use of forest package:

  • with forked edge:
\documentclass[margin=3mm, varwidth]{standalone}
\usepackage[edges]{forest}

\begin{document}
    \begin{figure}[ht]
\forestset{
    LT/.style = {% Linguistic tree
delay={where content={}{shape=coordinate}{}},
where n children=0{tier=word, baseline}{},
    for tree={
   text height = 2ex,
   text depth  = 0.5ex,
    inner ysep = 0pt,
    inner xsep = 1pt,
        forked edge,    % for forked edge
         s sep = 1mm,   % sibling distance
          }}}
    
\begin{forest}  LT
[discom
    [
        [d]
            [
                [i]
                [s]
            ]
    ]
    [
        [c]
            [
                [o]
                [m]
            ]
    ]
]
\end{forest}
\quad
\begin{forest}  LT
[bubolate
    [
        [b]
            [
                [o]
                [b]
            ]
    ]
    [
        [
            [
                [u]
                [l]
            ]
        ]
        [
            [
                [a]
                [t]
            ]
            [e]
        ]
    ]
]
\end{forest}
    \end{figure}
\end{document}

enter image description here

  • as linguistic tree:
\documentclass[margin=3mm, varwidth]{standalone}
\usepackage[linguistics]{forest}

\begin{document}
    \begin{figure}[ht]
\forestset{
    LT/.style = {% Linguistic tree
delay={where content={}{shape=coordinate}{}},
where n children=0{tier=word, baseline}{},
    for tree={
   text height = 2ex,
   text depth  = 0.5ex,
    inner ysep = 0pt,
    inner xsep = 1pt,
         s sep = 1mm,   % sibling distance
          }}}
    
\begin{forest}  LT
[discom
    [
        [d]
            [
                [i]
                [s]
            ]
    ]
    [
        [c]
            [
                [o]
                [m]
            ]
    ]
]
\end{forest}
\quad
\begin{forest}  LT
[bubolate
    [
        [b]
            [
                [o]
                [b]
            ]
    ]
    [
        [
            [
                [u]
                [l]
            ]
        ]
        [
            [
                [a]
                [t]
            ]
            [e]
        ]
    ]
]
\end{forest}
    \end{figure}
\end{document}

enter image description here

Addendum:
In the case, when you like to have the same distance between letters at bottom of trees and between trees, you need first to define new command for this distance, for example

\tikz\pgfmathsetlength{\SD}{2mm}

and than replace

  • quad with \hskip \SD
  • s sep = ...Ë™with s sep = \SD + 1mm`

MWE:

\documentclass[margin=3mm, varwidth]{standalone}
\usepackage[linguistics]{forest}

\begin{document}
    \begin{figure}[ht]

\newcommand\SD{1 mm}        % <-------   
\forestset{
    LT/.style = {% Linguistic tree
delay={where content={}{shape=coordinate}{}},
where n children=0{tier=word, baseline}{},
    for tree={
   text height = 2ex,
   text depth  = 0.5ex, 
        draw,   % that distance are more evident/visible, remove in real document
    inner ysep = 0pt,
    inner xsep = 1pt,
         s sep = \SD + 1mm, % <------- 
          }}}
    
\begin{forest}  LT,
[discom
    [
        [d]
            [
                [i]
                [s]
            ]
    ]
    [
        [c]
            [
                [o]
                [m]
            ]
    ]
]
\end{forest}
\hskip \SD                  % <------- 
\begin{forest}  LT
[bubolate
    [
        [b]
            [
                [o]
                [b]
            ]
    ]
    [
        [
            [
                [u]
                [l]
            ]
        ]
        [
            [
                [a]
                [t]
            ]
            [e]
        ]
    ]
]
\end{forest}
    \end{figure}
\end{document}

enter image description here

For better visibiity

Related Question