Solved – Regression analysis of two data sets

regression

I have two data sets, both resemble $y = x^2$, but say I don't know that yet. They are both evenly sampled, but the sampling rates are not the same between them, and they exist over different domains. Both sets may return data at the same $x$ value. How do I perform a regression over such a data set?

If I combine them into one large dataset, the fit will be weighted towards the dataset that has the higher density points (or the one with more total points maybe?) But, I want the two datasets to be weighted evenly in the $x$ dimension – that is, even though there is much higher data density around the point $x=5$ for the red set compared to the black set, I would want the black point to have as much weight as the red points between say $x = 4.5$ and $x=5.5$. I feel this is a very naive way of saying it, but perhaps someone with a stats background can fix this up.

Example data set.

Best Answer

You could use weighted regression to achieve your goal. Apart from the data, for each observation a weight is given that indicates how important the error for that observation is. More information and an example here: How to use weights in function lm in R?

Related Solutions

Solved – Data cleansing in regression analysis

Use a robust fit, such as lmrob in the robustbase package. This particular one can automatically detect and downweight up to 50% of the data if they appear to be outlying.

To see what can be accomplished, let's simulate a nasty dataset with plenty of outliers in both the $x$ and $y$ variables:

library(robustbase)
set.seed(17)
n.points <- 17520
n.x.outliers <- 500
n.y.outliers <- 500
beta <- c(50, .3, -.05)
x <- rnorm(n.points)
y <- beta[1] + beta[2]*x + beta[3]*x^2 + rnorm(n.points, sd=0.5)
y[1:n.y.outliers] <- rnorm(n.y.outliers, sd=5) + y[1:n.y.outliers]
x[sample(1:n.points, n.x.outliers)] <- rnorm(n.x.outliers, sd=10)

Most of the $x$ values should lie between $-4$ and $4$, but there are some extreme outliers:

Raw data scatterplot

Let's compare ordinary least squares (lm) to the robust coefficients:

summary(fit<-lm(y ~ 1 + x + I(x^2)))
summary(fit.rob<-lmrob(y ~ 1 + x + I(x^2)))

lm reports fitted coefficients of $49.94$, $0.00805$, and $0.000479$, compared to the expected values of $50$, $0.3$, and $-0.05$. lmrob reports $49.97$, $0.274$, and $-0.0229$, respectively. Neither of them estimates the quadratic term accurately (because it makes a small contribution and is swamped by the noise), but lmrob comes up with a reasonable estimate of the linear term while lm doesn't even come close.

Let's take a closer look:

i <- abs(x) < 10        # Window the data from x = -10 to 10
w <- fit.rob$weights[i] # Extract the robust weights (each between 0 and 1)
plot(x[i], y[i], pch=".", cex=4, col=hsv((w + 1/4)*4/5, w/3+2/3, 0.8*(1-w/2)), 
     main="Least-squares and robust fits", xlab="x", ylab="y")

Scatterplot with fits

lmrob reports weights for the data. Here, in this zoomed-in plot, the weights are shown by color: light greens for highly downweighted values, dark maroons for values with full weights. Clearly the lm fit is poor: the $x$ outliers have too much influence. Although its quadratic term is a poor estimate, the lmrob fit nevertheless closely follows the correct curve throughout the range of the good data ($x$ between $-4$ and $4$).

Regression Analysis – Interpreting Scatterplots with Low R-Squared and High P-Values

With data like these (indeed almost any data) the first step is a graphic that really helps to see what is going on. Crowding of data points on default scales makes that difficult to achieve.

The occurrence of exact zeros on $Y$ inhibits logarithmic transformation. Some would add a constant first to get round that. I would suggest here a square root scale instead.

Similarly, but not identically, the occurrence of exact $100$%s inhibits logit transformation of $X$, which is a kind of default for fractions not equal to zero or unity. I would suggest here a folded root transformation, $\root \of X - \root \of {100 - X}$ for the percents, which stretches out the high percents. (See, e.g., Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.)

Here's a graph for set 1 only (all posted at the time of writing). I have used transformed scales, but labelled in terms of the original values. I have to say that I see no structure here, so the essentially flat regression line does seem unsurprising.

EDIT It may be reassuring to people unfamiliar with this transformation to see how it works. Folding means that the transformation is symmetric around the middle of the range. The transformation is conservative insofar as it affects shape of relationship minimally, except for values near $0$ and $100$%, which are stretched out. (The curvature is useful in this example for values between about $70$ and $100$%.) A small but often useful virtue is that the transformation is defined for exact zeros and $100$s. Apart from a trivial prefactor, $\root \of X - \root \of {1 - X}$ behaves identically for $X$ now defined as proportions or fractions between $0$ and $1$.

Best Answer

Related Solutions

Solved – Data cleansing in regression analysis

Regression Analysis – Interpreting Scatterplots with Low R-Squared and High P-Values

Related Question