I am using Gaussian process (GP) for regression.
In my problem it is quite common for two or more data points $\vec{x}^{(1)},\vec{x}^{(2)},\ldots$ to be close to each other, relatively to the length scales of the problem. Also, observations can be extremely noisy. To speed up computations and improve measurement precision, it seems natural to merge/integrate clusters of points that are close to each other, as long as I care about predictions on a larger length scale.
I wonder what is a fast but semiprincipled way of doing this.
If two data points were perfectly overlapping, $\vec{x}^{(1)} = \vec{x}^{(2)}$, and the observation noise (i.e., the likelihood) is Gaussian, possibly heteroskedastic but known, the natural way of proceeding seems to merge them in a single data point with:

$\vec{\bar{x}} \equiv \vec{x}^{(k)}$, for $k=1,2$.

Observed value $\bar{y}$ which is an average of observed values $y^{(1)}, y^{(2)}$ weighted by their relative precision: $\bar{y} = \frac{\sigma_y^2(\vec{x}^{(2)})}{\sigma_y^2(\vec{x}^{(1)}) + \sigma_y^2(\vec{x}^{(2)})} y^{(1)}
+ \frac{\sigma_y^2(\vec{x}^{(1)})}{\sigma_y^2(\vec{x}^{(1)}) + \sigma_y^2(\vec{x}^{(2)})} y^{(2)}$. 
Noise associated with the observation equal to: $\sigma_y^2(\bar{x}) = \frac{\sigma_y^2(\vec{x}^{(1)}) \sigma_y^2(\vec{x}^{(2)})}{\sigma_y^2(\vec{x}^{(1)}) + \sigma_y^2(\vec{x}^{(2)})}$.
However, how should I merge two points that are close but not overlapping?

I think that $\vec{\bar{x}}$ should still be a weighted average of the two positions, again using the relative reliability. The rationale is a centerofmass argument (i.e., think of a very precise observation as a stack of less precise observations).

For $\bar{y}$ same formula as above.

For the noise associated to the observation, I wonder if in addition to the formula above I should add a correction term to the noise because I am moving the data point around. Essentially, I would get an increase in uncertainty that is related to $\sigma_f^2$ and $\ell^2$ (respectively, signal variance and length scale of the covariance function). I am not sure of the form of this term, but I have some tentative ideas for how to compute it given the covariance function.
Before proceeding, I wondered if there was already something out there; and if this seems to be a sensible way of proceeding, or there are better quick methods.
The closest thing I could find in the literature is this paper:
E. Snelson and Z. Ghahramani, Sparse Gaussian Processes using Pseudoinputs, NIPS '05; but their method is (relatively) involved, requiring an optimization to find the pseudoinputs.
Best Answer
Great question and what you are suggesting sounds reasonable. However personally I would proceed differently in order to be efficient. As you said two points that are close provide little additional information and hence the effective degrees of freedom of the model is less than the number of data points observed. In such a case it may be worth using Nystroms method which is described well in GPML (chapter on sparse approximations can be seen http://www.gaussianprocess.org/gpml/). The method is very easy to implement and recently been proved to be highly accurate by Rudi et al. (http://arxiv.org/abs/1507.04717)