Given DAG find smallest DT

graph theory

I stumbled upon this graph theory problem, and I think i managed to solve it, but I need some help to making my solution rigorous and some help for prove. I am pretty new to graph theory, so I may use use the language correctly

The problem:
Given a labelled directed, acyclic and rooted graph $\mathcal{G}$, find a labelled rooted, directed tree (DT) $\mathcal{D}$, in which multiple vertices may be labelled identically, so that there is a directed edge between two vertices in $\mathcal{D}$, if there is a directed edge in $\mathcal{G}$ with the same two labels, that has as few nodes as possible.

Here rooted means, that is has a node, which can reach all other nodes. And by directed tree i mean a DAG whose underlying undirected graph is a tree.

My solution:

"Cut" the graph, if a node have n>1 entries, then copy it n times, and distribute one entry to each node.

Examples: See picture. I would like to make this procedure more rigorous

Proof: Consider a node, call it N, with n>1 entries. Since we need n paths to N, we need at least n-1 copies of N, making our method optimal. This is very rigorous either.
example

Best Answer

Say $\mathcal G$ has $m$ edges; then $\mathcal D$ must also have at least one copy of those $m$ edges. Additionally, when $\mathcal G$ has $s$ source vertices, $\mathcal D$ must contain at least $s-1$ edges that are not in $\mathcal G$: except for one source, which can be the root, $\mathcal D$ should have an edge into each of the other source vertices.

Therefore, $\mathcal D$ must contain at least $m+s-1$ edges. Each non-root vertex of a directed tree has $1$ edge going into it, and the root has $0$, so $\mathcal D$ must contain at least $m+s$ vertices. The procedure below will create a tree with exactly $m+s$ vertices, so it will be optimal.


Here is one way we can make your "cut" procedure rigorous:

Go through the vertices in an arbitrary order. When we get to vertex $v$, if there are edges $u_1 v, u_2 v, \dots, u_k v$ pointing into $v$, leave one of them (say, $u_1 v$) alone. Create $k-1$ clones of $v$ (call them $v^{(2)}, \dots, v^{(k)}$ to disambiguate) and replace the edges $u_2v, \dots, u_k v$ with edges $u_2 v^{(2)}, \dots, u_k v^{(k)}$.

All other edges should be left alone until later steps; in particular, the edges out of $v$ should remain edges out of $v$, and the clones $v^{(2)}, \dots, v^{(k)}$ should have no edges leaving them.

As an exception, since we want $\mathcal D$ to be a rooted tree, and therefore connected, we perform one more step in case there are no edges into $v$ (that is, if $k=0$). Such nodes are called sources. The first source that we see, we should leave alone; call that vertex $r$ (it will be the root of $\mathcal D$). For each subsequent vertex $v$ that's a source, we should add an edge from $r$ to $v$ in $\mathcal D$. This is not the only way to connect $\mathcal D$, but it is the most straightforward.

If $\mathcal G$ had $s$ source vertices, then there were $s-1$ steps in which we created a new edge of $\mathcal D$, instead of modifying an existing edge of $\mathcal G$. So we have exactly achieved the lower bound of $m+s-1$ edges, and $m+s$ vertices, mentioned at the beginning of this answer.

Related Question