As some of the information you provided states, the two are not the same. I like better the terminology of conditional (on covariates) and unconditional (marginal) estimates. There is a very subtle language problem that clouds the issue greatly. Analysts who tend to love "population average effects" have a dangerous tendency to try to estimate such effects from a sample with no reference to any population distribution of subject characteristics. In this sense the estimates should not be called population average estimates but instead should be called sample average estimates. It is very important to note that sample average estimates have a low chance of being transportable to the population from which the sample came or in fact to any population. One reason for this is the somewhat arbitrary selection criteria for how subjects get into studies.
As an example, if one compared treatment A and treatment B in a binary logistic model adjusted for sex, one obtains a treatment effect that is specific to both males and females. If the sex variable is omitted from the model, a sample average odds ratio effect for treatment is obtained. This in effect is a comparison of some of the males on treatment A with some of the females on treatment B, due to non-collapsibility of the odds ratio. If one had a population with a different female:male frequency, this average treatment effect coming from a marginal odds ratio for treatment, will no longer apply.
So if one wants a quantity that pertains to individual subjects, full conditioning on covariates is required. And these conditional estimates are the ones that transport to populations, not the so-called "population average" estimates.
Another way to think about it: think of an ideal study for comparing treatment to no treatment. This would be a multi-period randomized crossover study. Then think about the next best study: a randomized trial on identical twins where one of the twins in each pair is randomly selected to get treatment A and the other is selected to get treatment B. Both of these ideal studies are mimicked by full conditioning, i.e., full covariate adjustment to get conditional and not marginal effects from the more usual parallel group randomized controlled trial.
I'll try to break it down a bit.. I think most of the confusion when studying potential outcomes (ie $Y_0,Y_1$) is to realize that $Y_0,Y_1$ are different than $Y$ without bringing in the covariate $X$. The key is to realize that every individual $i$ has potential outcomes $(Y_{i1},Y_{i0})$, but you only observe $Y_{iT}$ in the data.
Ignorability says
$$(Y_0,Y_1) \perp \!\!\! \perp T|X$$
which says that conditional on $X$, then the potential outcomes are independent of treatment $T$. It is not saying that $Y$ is independent of $T$. As you point out, that makes no sense. In fact, a classic way to re-write $Y$ is as
$$Y = Y_1T + Y_0(1-T)$$
which tells us that for every individual, we observe $Y_i$ which is either $Y_{i1}$ or $Y_{i0}$ depending on the value of treatment $T_i$. The reason for potential outcomes is that we want to know the effect $Y_{i1} - Y_{i0}$ but only observe one of the two objects for everyone. The question is: what would have $Y_{i0}$ been for the individuals $i$ who have $T_i=1$ (and vice versa)? Ignoring the conditional on $X$ part, the ignorability assumption essentially says that treatment $T$ can certainly affect $Y$ by virtue of $Y$ being equal to $Y_1$ or $Y_0$, but that $T$ is unrelated to the values of $Y_0,Y_1$ themselves.
To motivate this, consider a simple example where we have only two types of people: weak people and strong people. Let treatment $T$ be receiving medication, and $Y$ is health of patient (higher $Y$ means healthier). Strong people are far healthier than weak people. Now suppose that receiving medication makes everyone healthier by a fixed amount.
First case: suppose that only unhealthy people seek out medication. Then those with $T=1$ will be mostly the weak people, since they are the unhealthy people, and those with $T=0$ will be mostly strong people. But then ignorability fails, since the values of $(Y_1,Y_0)$ are related to treatment status $T$: in this case, both $Y_1$ and $Y_0$ will be lower for $T=1$ than for $T=0$ since $T=1$ is filled with mostly weak people and we stated that weak people just are less healthy overall.
Second case: suppose that we randomly assign medication to our pool of strong and weak people. Here, ignorability holds, since $(Y_1,Y_0)$ are independent of treatment status $T$: weak and strong people are equally likely to receive treatment, so the values of $Y_1$ and $Y_0$ are on average the same for $T=0$ and $T=1$. However, since $T$ makes everyone healthier, clearly $Y$ is not independent of $T$.. it has a fixed effect on health in my example!
In other words, ignorability allows that $T$ directly affects whether you receive $Y_1$ or $Y_0$, but treatment status is not related to the these values. In this case, we can figure out what $Y_0$ would have been for those who get treatment by looking at the effect of those who didn't get treatment! We get a treatment effect by comparing those who get treatment to those who don't, but we need a way to make sure that those who get treatment are not fundamentally different from those who don't get treatment, and that's precisely what the ignorability condition assumes.
We can illustrate with two other examples:
A classic case where this holds is in randomized control trials (RCTs) where you randomly assign treatment to individuals. Then clearly those who get treatment may have a different outcome because treatment affects your outcome (unless treatment really has no effect on outcome), but those who get treatment are randomly selected and so treatment receival is independent of potential outcomes, and so you indeed do have that $(Y_0,Y_1) \perp \!\!\! \perp T$. Ignorability assumption holds.
For an example where this fails, consider treatment $T$ be an indicator for finishing high school or not, and let the outcome $Y$ be income in 10 years, and define $(Y_0,Y_1)$ as before. Then $(Y_0,Y_1)$ is not independent of $T$ since presumably the potential outcomes for those with $T=0$ are fundamentally different from those with $T=1$. Maybe people who finish high school have more perseverance than those who don't, or are from wealthier families, and these in turn imply that if we could have observed a world where individuals who finished high school had not finished it, their outcomes would still have been different than the observed pool of individuals who did not finish high school. As such, ignorability assumption likely does not hold: treatment is related to potential outcomes, and in this case, we may expect that $Y_0 | T_i = 1 > Y_0 | T_i = 0$.
The conditioning on $X$ part is simply for cases where ignorability holds conditional on some controls. In your example, it may be that treatment is independent of these potential outcomes only after conditioning on patient history. For an example where this may happen, suppose that individuals with higher patient history $X$ are both sicker and more likely to receive treatment $T$. Then without $X$, we run into the same problem as described as above: the unrealized $Y_0$ for those who receive treatment may be lower than the realized $Y_0$ for those who did not receive treatment because the former are more likely to be unhealthy individuals, and so comparing those with and without treatment will cause issues since we are not comparing the same people. However, if we control for patient history, we can instead assume that conditional on $X$, treatment assignment to individuals is again unrelated to their potential outcomes and so we are good to go again.
Edit
As a final note, based on chat with OP, it may be helpful to relate the potential outcomes framework to the DAG in OP's post (Noah's response covers a similar setting with more formality, so definitely also worth checking that out). In these type of DAGs, we fully model relationships between variables. Forgetting about $X$ for a it, suppose we just have that $T \rightarrow Y$. What does this mean? Well it means that the only effect of T is through $T = 1$ or $T= 0$, and through no other channels, so we immediately have that T affects $Y_1T+ Y_0(1-T)$ only through the value of $T$. You may think "well what if T affects Y through some other channel" but by saying $T \rightarrow Y$, we are saying there are no other channels.
Next, consider your case of $X \rightarrow T \rightarrow Y \leftarrow X$. Here, we have that T directly affects Y, but X also directly affects T and Y.
Why does ignorability fail? Because T can be 1 through the effect of X, which will also affect Y, and so $T = 1$ could affect $Y_0$ and $Y_1$ for the group where $T=1$, and so T affects $Y_1T + Y_0(1-T)$ both through 1. the direct effect of the value of T, but 2. T now also affects $Y_1$ and $Y_0$ through the fact that $X$ affects $Y$ and $T$ at the same time.
Best Answer
SUTVA says that $\delta_i(d)$ doesn't depend on treatment assignment for individuals other than $i$. Ignorability says it doesn't depend on treatment assignment for individual $i$.