I am not a fan of the Angrist and Pischke book, but they do have a flair for phrasing, and as they say, fuzzy RD is IV (Sec. 6.2). This fact is obscured by the fact that the instrument is essentially a nonlinear transformation (step function) of one of the included exogenous variables, which by virtue of the conditional exogeneity assumption, is a valid instrument.
Assume that each subject is characterized by the tuple of random variables, $\{Y_{0i}, Y_{1i}, D_i, X_i\}$, where $Y_{0i}$ and $Y_{1i}$ are the potential outcomes under non-treatment and treatment respectively, $D_i$ is an indicator variable of whether treatment is administered (which governs which of the potential outcomes is observed for a subject), and $X_i$ is the so-called forcing variable which deterministically or stochastically determines treatment. Usually, the fuzzy RD [FRD] model is stated as the rather concise set of specifications
$$
\begin{align}
\lim_{x\downarrow x_0} \mathbb{E}(D_i\mid X_i = x) &\neq \lim_{x\uparrow x_0} \mathbb{E}(D_i\mid X_i = x)\\
\lim_{x\downarrow x_0} \mathbb{E}(Y_{0i}\mid X_i = x) &= \lim_{x\uparrow x_0} \mathbb{E}(Y_{0i}\mid X_i = x)\\
\end{align}
$$
which are intuitively transparent, but are hard to work with.
Potential outcomes framework
We can use the familiar potential outcomes model to unpack these specifications, where, for the simplicity of exposition, we exclude all other exogenous variables, other than the forcing variable, $X_i$, which deterministically (in the case of RDD) or stochastically (in the case of FRD) determines the treatment assignment ($D_i=1$). The conditional mean of the outcome in terms of the observable variables is given by
$$
\begin{align}
\mathbb{E}(Y_i \mid X_i, D_i) &= \mathbb{E}(Y_{0i}\mid X_i, D_i) + D_i\left(\mathbb{E}(Y_{1i}\mid X_i, D_i)-\mathbb{E}(Y_{0i}\mid X_i, D_i)\right) \\
\end{align}
$$
Here we make no parametric assumptions about the form of the conditional expectation functions. Note that all of these specifications are restricted to the locality of $x_0$, that is $X_i\in [x_0-\Delta_n, x_0+\Delta_n]$, where the indexing by the sample size is for pragmatic reasons (it becomes relevant when we define the estimator).
Recall that in the sharp RD case, we can write $D_i=\mathbf{1}_{[X_i\geq x_0]}$, where $x_0$ is the point of discontinuity. In the FRD case, this relationship is no longer deterministic, instead we have that the conditional mean is modelled in terms of the discontinuity
$$
\begin{align}
\mathbb{E}(D_i\mid X_i) &= \mathbb{P}\left[D_i=1\mid X_i\right]\\
&=(1-\mathbf{1}_{[X_i\geq x_0]})\mathbb{P}\left[D_i=1\mid X_i< x_0\right] + \mathbf{1}_{[X_i\geq x_0]}\mathbb{P}\left[D_i=1\mid X_i\geq x_0\right]
\end{align}
$$
Note that since $X_i$ is exogenous in the system, so is the random variable $\mathbf{1}_{[X_\geq x_0]}$ -- it acts as the excluded exogenous variable in the specification of the conditinal mean of the endogenous variable $D_i$.
Estimation
This is then a valid just-identified IV model, with one endogenous variable $D_i$, and one excluded exogenous variable $\mathbf{1}_{[X_i\geq x_0]}$. A direct and general estimator with no further parametric assumptions is the nonparametric Wald estimator.
$$
\dfrac{\widehat{\mathbb{E}}\left(Y_i \mid x_0 \leq X_i\leq x_0+ \Delta_n \right)-\widehat{\mathbb{E}}\left(Y_i \mid x_0- \Delta_n \leq X_i< x_0\right)}{\widehat{\mathbb{P}}\left[D_i=1\mid x_0 \leq X_i\leq x_0+ \Delta_n \right]-\widehat{\mathbb{P}}\left[D_i=1\mid x_0- \Delta_n \leq X_i< x_0\right]}
$$
Typically local smoothers, like the local linear smoother are used to estimate the conditional mean functions.
ATE interpretation
Note that in order to interpret the given estimator as the average treatment effect [ATE] in the locality of $x_0$, we have used the implausible but routine conditional (on $X_i$) independence of $D_i$ and $Y_{1i}-Y_{0i}$. This allows us to remove the conditioning on $D_i$ in the conditional mean function of the outcome in a mathematically convenient way. For more details, see Hahn, Todd & van der Klauuw (2001), which is an excellent and readable reference for RD models. They also provide interpretations of the parameter being estimated under weaker assumptions.
For reasons explained in my comment, you will get identical estimates for the treatment coefficient.
Here's a numerical example of "hard" RDD using Stata. We will use a experimental dataset of 12 cars. Each car was run once without a beneficial fuel additive (condition 1) and once with (condition 2). The outcome is miles per gallon. This setup is similar to your cross-border pairs, where one member is treated, but they are otherwise similar.
. use http://www.stata-press.com/data/r14/fuel, clear
. gen diff = mpg2-mpg1
. list, clean noobs
mpg1 mpg2 diff
20 24 4
23 25 2
21 21 0
25 22 -3
18 23 5
17 18 1
18 17 -1
24 28 4
20 24 4
24 27 3
23 21 -2
19 23 4
. reg diff
Source | SS df MS Number of obs = 12
-------------+---------------------------------- F(0, 11) = 0.00
Model | 0 0 . Prob > F = .
Residual | 80.25 11 7.29545455 R-squared = 0.0000
-------------+---------------------------------- Adj R-squared = 0.0000
Total | 80.25 11 7.29545455 Root MSE = 2.701
------------------------------------------------------------------------------
diff | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 1.75 .7797144 2.24 0.046 .0338602 3.46614
------------------------------------------------------------------------------
. gen pair_id = _n
. reshape long mpg, i(pair_id) j(treat)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 12 -> 24
Number of variables 4 -> 4
j variable (2 values) -> treat
xij variables:
mpg1 mpg2 -> mpg
-----------------------------------------------------------------------------
. reg mpg i.treat i.pair_id
Source | SS df MS Number of obs = 24
-------------+---------------------------------- F(12, 11) = 4.03
Model | 176.5 12 14.7083333 Prob > F = 0.0139
Residual | 40.125 11 3.64772727 R-squared = 0.8148
-------------+---------------------------------- Adj R-squared = 0.6127
Total | 216.625 23 9.41847826 Root MSE = 1.9099
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
2.treat | 1.75 .7797144 2.24 0.046 .0338602 3.46614
|
pair_id |
2 | 2 1.909902 1.05 0.317 -2.203667 6.203667
3 | -1 1.909902 -0.52 0.611 -5.203667 3.203667
4 | 1.5 1.909902 0.79 0.449 -2.703667 5.703667
5 | -1.5 1.909902 -0.79 0.449 -5.703667 2.703667
6 | -4.5 1.909902 -2.36 0.038 -8.703667 -.2963331
7 | -4.5 1.909902 -2.36 0.038 -8.703667 -.2963331
8 | 4 1.909902 2.09 0.060 -.2036669 8.203667
9 | 1.03e-15 1.909902 0.00 1.000 -4.203667 4.203667
10 | 3.5 1.909902 1.83 0.094 -.7036669 7.703667
11 | 1.15e-15 1.909902 0.00 1.000 -4.203667 4.203667
12 | -1 1.909902 -0.52 0.611 -5.203667 3.203667
|
_cons | 21.125 1.40565 15.03 0.000 18.03118 24.21882
------------------------------------------------------------------------------
. xtreg mpg i.treat, fe
Fixed-effects (within) regression Number of obs = 24
Group variable: pair_id Number of groups = 12
R-sq: Obs per group:
within = 0.3141 min = 2
between = . avg = 2.0
overall = 0.0848 max = 2
F(1,11) = 5.04
corr(u_i, Xb) = 0.0000 Prob > F = 0.0463
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
2.treat | 1.75 .7797144 2.24 0.046 .0338602 3.46614
_cons | 21 .5513413 38.09 0.000 19.78651 22.21349
-------------+----------------------------------------------------------------
sigma_u | 2.6809513
sigma_e | 1.9099024
rho | .66334557 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(11, 11) = 3.94 Prob > F = 0.015
All 3 methods yield an estimated marginal improvement of 1.75 miles per gallon, with a standard error of 0.78.
Best Answer
I'd advise you to read Bertanha's recent (not yet published) article that can be found at: http://economics.nd.edu/assets/153411/bertanha_marinho_jmp.pdf
Not only does it go over the standard normalization and pooling procedure, but it also provides a lengthy discussion on how to overcome the potential heterogeneity issues associated with this method.