[Math] linear regression multiple points vs average

least squareslinear programmingregression

I have data related to a grocery store. I have 4 price points – 2.5, 3, 4, 5, and for each price point I have number of units that were sold on various days. Lets says that

price 2.5 was active for 100 days
price 3 was active for 120 days
price 4 was active for 200 days
price 5 was active for 300 days

I would like to understand relationship between price and qty sold by day. Lets assume that I can fit only a linear model for qty vs price.

I can do 2 things:

  1. I can take average of sales by day for each price point. For example
    • for 2.5 I will have average sale over 100 days. And then I can fit a regression line against price points. In this case I will have only 4 data point (1 for each price point)
  2. I can keep day level information and fit a regression line against
    price points (I will have 720 data points).

Which method is better and will give a better regression equation. Why?

Update 1————————————–

I wrote below code in R. I wanted to test claim made in answers below that if I have same number of observations for each price point, then I would get same regression coefficients for both methods. From results below, it seems that the claim is not true. Am i doing something wrong?

#linear regression average qty per price point vs all quantities

x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)

xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)


> summary(fit)

Call:
lm(formula = x ~ y)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3572 -1.6069 -0.1007  2.0222  6.4904 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.0952     1.1570   34.65   <2e-16 ***
y            -6.1932     0.2663  -23.25   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared:  0.8209,    Adjusted R-squared:  0.8194 
F-statistic: 540.8 on 1 and 118 DF,  p-value: < 2.2e-16

> summary(fit1)

Call:
lm(formula = xdum ~ ydum)

Residuals:
      1       2       3       4 
-0.9615  1.8077 -0.3077 -0.5385 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  38.2692     3.6456  10.497  0.00895 **
ydum         -5.7692     0.8391  -6.875  0.02051 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared:  0.9594,    Adjusted R-squared:  0.9391 
F-statistic: 47.27 on 1 and 2 DF,  p-value: 0.02051

update 2 correct R code———————

I corrected my R code and results are as below

 > summary(fit)

    Call:
    lm(formula = x ~ y)

    Residuals:
       Min     1Q Median     3Q    Max 
    -5.470 -1.520  0.007  1.635  5.406 

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  37.5556     1.0140   37.04   <2e-16 ***
    y            -5.5314     0.2334  -23.70   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 2.305 on 118 degrees of freedom
    Multiple R-squared:  0.8264,    Adjusted R-squared:  0.8249 
    F-statistic: 561.7 on 1 and 118 DF,  p-value: < 2.2e-16

    > summary(fit1)

    Call:
    lm(formula = xdum ~ ydum)

    Residuals:
          1       2       3       4 
    -0.9781  2.2648 -0.9520 -0.3347 

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)   37.556      4.542   8.268   0.0143 *
    ydum          -5.531      1.045  -5.291   0.0339 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.885 on 2 degrees of freedom
    Multiple R-squared:  0.9333,    Adjusted R-squared:    0.9 
    F-statistic: 27.99 on 1 and 2 DF,  p-value: 0.03392

Best Answer

In theory the second will be better in general, because the regression can account for the fact that you have different numbers of observations in each group.

As an extreme case, imagine you only had two observations for for one price and a hundred observations for another. The first method would treat those two prices identically and constrain the line to be as close as possible to both. This might result in it being very close to the mean of the two observation price and far from the mean of the hundred observation price. The second method would be much more willing to draw a line which was a long way from the mean of the two observation price but closer to the mean of the hundred observation price. This is a good thing because you have high uncertainty of the true mean from just two observations, so it is better to have more error there and less on the more confident estimate.

If you had exactly the same number of observations for each price then the results should be identical, and if you have close to the same number the results should be similar in which case just do whichever is easiest.

EDIT - re the last point and update 1: What I've said is provably true if you're doing least squares regression. This is because the square errors of the two fits are linearly related, and so any (even for a non-linear fit) least square error is minimised by the same fit. The crux of the proof is the fact that: $\sum (y_{i} - x)^2 = \sum (y_{i} - \mu)^2 + \sum (x - \mu)^2$, where $\mu$ is the mean of $y_{i}$

If R is giving a different answer then either there's something wrong with the R code or R isn't doing least squares linear regression. It might for instance be doing some sort of outlier removal, or a different algorithm altogether. I personally can't stand R (and this is just giving me yet another reason), so I can't motivate myself to either debug your R code nor search R's (best I can tell non-existant) documentation to find out what R is actually doing. What I can do though is counter with my own Matlab code which shows entirely consistent results between the two every time I run it:

vals = zeros(100, 4);
vals(:, 1) = normrnd(2.5, 1, 100, 1);
vals(:, 2) = normrnd(3, 0.8, 100, 1);
vals(:, 3) = normrnd(4, 1.2, 100, 1);
vals(:, 4) = normrnd(5, 1.6, 100, 1);

valmeans = mean(vals);

x_short = [2.5, 3, 4, 5];
x_long = [repmat(2.5, 1, 100), repmat(3, 1, 100), repmat(4, 1, 100), repmat(5, 1, 100)];

polyfit(x_long', vals(:), 1)
polyfit(x_short, valmeans, 1)
Related Question