You construct the policy dummy the way you first describe it, i.e. create a column of zeroes. Then for each firm you replace this with ones if a firm is in the treatment group AND it is in the post-treatment period. Something like this
$$
\begin{array}{ccccc}
\text{firm} & \text{time} & \text{treated} & \text{post} & \text{policy} \\
\hline
1 & 1 & 0 & 0 & 0 \\
1 & 2 & 0 & 0 & 0 \\
1 & 3 & 0 & 1 & 0 \\
1 & 4 & 0 & 1 & 0 \\
\hline
2 & 1 & 1 & 0 & 0 \\
2 & 2 & 1 & 0 & 0 \\
2 & 3 & 1 & 1 & 1 \\
2 & 4 & 1 & 1 & 1 \\
\hline
3 & 1 & 1 & 0 & 0 \\
3 & 2 & 1 & 0 & 0 \\
3 & 3 & 1 & 0 & 0 \\
3 & 4 & 1 & 1 & 1 \\
\end{array}
$$
where $\text{post}$ is an indicator for the post treatment period. In your equation above, the $\alpha_0$ and $\text{Treat}_i$ are going to be absorbed in the firm fixed effects.
Regarding the interpretation, this setting makes an assumption which I probably did not state in the previous answer. The assumption is that the treatment effect is the same across all periods. This means that if a firm is treated yesterday and has a gain of 2, then a firm which is treated today also has a gain of 2 (relative to firms which are never treated). I made a graph to show what this assumption means
In case you would like a reference for this, you can check out Jeff Wooldridge's notes on difference in differences and the section on extensions for multiple groups and time periods: http://www.nber.org/WNE/Slides7-31-07/slides_10_diffindiffs.pdf (What’s New in Econometrics? Lecture 10 Difference-in-Differences Estimation, Wooldridge 2007).
A nice feature of difference-in-differences (DiD) is actually that you don't need panel data for it. Given that the treatment happens at some sort of level of aggregation (in your case cities), you only need to sample random individuals from the cities before and after the treatment. This allows you to estimate
$$
y_{ist} = A_g + B_t + \beta D_{st} + c X_{ist} + \epsilon_{ist}
$$
and get the causal effect of the treatment as the expected post-pre outcome difference for the treated minus the expected post-pre outcome difference for the control.
There is a case in which people use individual fixed effects instead of a treatment indicator and this is when we don't have a well-defined level of aggregation at which the treatment occurs. In that case you would estimate
$$
y_{it} = \alpha_i + B_t + \beta D_{it} + cX_{it}+\epsilon_{it}
$$
where $D_{it}$ is an indicator for the post-treatment period for individuals who received the treatment (for example, a job market program which happens all over the place). For more information on this see these lecture notes by Steve Pischke.
In your setting, adding individual fixed effects should not change anything with respect to the point estimates. The treatment indicator $A_g$ will just be absorbed by the individual fixed effects. However, these fixed effects might soak up some of the residual variance and therefore potentially reduce the standard error of your DiD coefficient.
Here is a code example which shows that this is the case. I use Stata but you can replicate this in the statistical package of your choice. The "individuals" here are actually countries but they are still grouped according to some treatment indicator.
* load the data set (requires an internet connection)
use "http://dss.princeton.edu/training/Panel101.dta"
* generate the time and treatment group indicators and their interaction
gen time = (year>=1994) & !missing(year)
gen treated = (country>4) & !missing(country)
gen did = time*treated
* do the standard DiD regression
reg y_bin time treated did
------------------------------------------------------------------------------
y_bin | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | .375 .1212795 3.09 0.003 .1328576 .6171424
treated | .4166667 .1434998 2.90 0.005 .13016 .7031734
did | -.4027778 .1852575 -2.17 0.033 -.7726563 -.0328992
_cons | .5 .0939427 5.32 0.000 .3124373 .6875627
------------------------------------------------------------------------------
* now repeat the same regression but also including country fixed effects
areg y_bin did time treated, a(country)
------------------------------------------------------------------------------
y_bin | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | .375 .120084 3.12 0.003 .1348773 .6151227
treated | 0 (omitted)
did | -.4027778 .1834313 -2.20 0.032 -.7695713 -.0359843
_cons | .6785714 .070314 9.65 0.000 .53797 .8191729
-------------+----------------------------------------------------------------
So you see that the DiD coefficient remains the same when the individual fixed effects are included (areg
is one of the available fixed effects estimation commands in Stata). The standard errors are slightly tighter and our original treatment indicator was absorbed by the individual fixed effects and therefore dropped in the regression.
In response to the comment
I mentioned the Pischke example to show when people use individual fixed effects rather than a treatment group indicator. Your setting has a well defined group structure so the way you have written your model that's perfectly fine. Standard errors should be clustered at the city level, i.e. the level of aggregation at which the treatment occurs (I haven't done this in the example code but in DiD settings the standard errors need to be corrected as demonstrated by the Bertrand et al paper).
Regarding the movers, they don't have much of a role to play here. The treatment indicator $D_{st}$ is equal to 1 for people who live in a treated city $s$ in the post-treatment period $t$. To compute the DiD coefficient, we actually just need to compute four conditional expectations, namely
$$
c = \left[ E(y_{ist}|s=1,t=1) - E(y_{ist}|s=1,t=0)\right] - \left[ E(y_{ist}|s=0,t=1) - E(y_{ist}|s=0,t=0)\right]
$$
So if you have 4 post-treatment periods for an individual who lives in a treated city for the first two, and then moves to a control city for the remaining two periods, the first two of those observations will be used in the computation of $E(y_{ist}|s=1,t=1)$ and the last two in $E(y_{ist}|s=0,t=1)$. To make it clear why identification comes from the group differences over time and not from the movers you can visualize this with a simple graph. Suppose the change in the outcome is truly only because of the treatment and that it has a contemporaneous effect. If we have an individual who lives in a treated city after the treatment starts but then moves to a control city, their outcome should go back to what it was before they were treated. This is shown in the stylized graph below.
You might still want to think about movers for other reasons though. For instance, if the treatment has a lasting effect (i.e. it still affects the outcome even though the individual has moved)
Best Answer
A triple difference-in-difference is the correct specification for this problem. I'll present a conceptual explanation and then a mathematical one.
Conceptually, the standard (double) difference-in-difference can also be thought of as estimating heterogeneous treatment effect. In this perspective,
time
is the "treatment", and we want to estimate howtime
affects the outcome differentially across two groups. (Of course,time
itself doesn't cause anything. It's just a stand-in for the real treatment that happens between the two time periods).Thus, we can extend the standard D-in-D into triple D-in-D if we want to add another layer of heterogeneous treatment effect (i.e. the heterogeneity across big firms vs small firms in your cases).
Mathematically, the specification would be as follows:
\begin{equation} Y = \alpha + \beta_1 T + \beta_2 G + \beta_3 B + \gamma_1 TG + \gamma_2 GB + \gamma_3 TB + \delta_1 TGB \end{equation}
with
\begin{align} T &= \text{treatment time} \\ G &= \text{treatment group} \\ B &= \text{big firms} \end{align}
The DD estimate for treatment effect in small firms is $\gamma_1$ (exactly the same as the standard DD)
The DD estimate for treatment effect in big firms is $\gamma_1 + \delta_1$
Thus the treatment effect for big and small firms differs by $(\gamma_1 + \delta_1) - \gamma_1 = \delta_1$, which is also the coefficient of the triple interaction term, or the DDD estimate.