Solved – How to fit a line to data using weighted least squares (WLS) regression

regressionweighted-regression

I am newbie to WLS regression topic. I am being asked to fit a line to a data using WLS. I am working in minitab. My data is as follows:
cost (independent variable) (x-axis);
production (dependent variable) (y-axis).
A small sample of the data is as follow:

  Cost           Production
  200               4000
   50               1000
  350               3500
 1000               1000
  500               3500
  100                500
  800               2000

What I have done till now is: (1) Outlier detection. (2) Using, cost and production data, I have found unstandardized residuals. (3) Then, absolute of the residuals. (4) Using cost as x-axis (independent) and absolute residuals as y-axis (or dependent data), I have found unstandardized predicted values. (5) Then, I have found weights as reciprocal of the square of the predicted values.

Now I want to plot the data and fit a line to it using WLS. This can be very basic and simple thing to most of you. But I am not able to figure it out that what I need to do after step (5) and how to fit a line using WLS?

After reading over web, I have understood that in minitab, I need to run regression>>regression>>fit regression model and there, I need to provide x and y axis data and the estimated weights. And in storage tab, I need to check the 'fits'. Once regression model is done, I need to plot the scatterplot and there I should add a 'calculated line' with fits and the relevant x or y- axis.

Q1. Now, firstly, I would like to ask if I am doing the entire 
process, upto estimating weights and fits, correctly?

Q2. Secondly, if I am fitting the line using WLS correctly? i.e. Do    
fits need to be plotted in the graph to fit a line using WLS?

Q3. In the plot, should *'fits'* be assigned in the place of dependent 
variable or independent variable?

Wheresoever, I am not doing it right, it shall be helpful if someone can tell me the relevant steps to follow in spss or minitab.

[For more details about the kind of graph/ plot that I need][1]

[1]: https://onlinecourses.science.psu.edu/stat501/node/397/

On the above URL, look at the 4th Figure i.e., scatterplot between cost vs num.responses, where black line shows OLS and red line shows WLS. I need such a scatterplot with two lines. For this, I need to construct or fit a line to my data using WLS.

Best Answer

Edit: I've re-written my post, and noticed that I made an error in computing the weights, where I used residuals instead of fitted values in the calculation. The error is now fixed.

Also note that I assume the poster is asking for the motions of fitting the WLS.

I am not familiar with Minitab, so I have instead recreated the process in the link that you have provided using Stata, including the example dataset from the linked website. I have included selected output where appropriate.

Start with data input and verify it.

clear *
cls

* Input the data
input id    num_responses   cost
1   16  77
2   14  70
3   22  85
4   10  50
5   14  62
6   17  70
7   10  55
8   13  63
9   19  88
10  12  57
11  18  81
12  11  51
end

* Verify the data
list, clean

Let's fit a simple OLS model and plot the model overlaid on a scatter plot of the data.

* Run OLS regression (cost ~ num_responses)
reg cost num_responses

The OLS model results:


. reg cost num_responses

      Source |       SS           df       MS      Number of obs   =        12
-------------+----------------------------------   F(1, 10)        =     80.19
       Model |  1695.47339         1  1695.47339   Prob > F        =    0.0000
    Residual |  211.443277        10  21.1443277   R-squared       =    0.8891
-------------+----------------------------------   Adj R-squared   =    0.8780
       Total |  1906.91667        11  173.356061   Root MSE        =    4.5983

-------------------------------------------------------------------------------
         cost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
num_responses |   3.268908   .3650515     8.95   0.000     2.455522    4.082293
        _cons |   19.47269   5.516184     3.53   0.005     7.181865    31.76351
-------------------------------------------------------------------------------

This model matches the one described in the link. Now let's look at the scatter plot with overlaid OLS model, and the residual vs predictor plot. Note that the "resid" option of predict computes the residuals. In the case of OLS, residuals are nothing more than the observed values (y_obs) minus the fitted values (y_hat), resid = y_obs - y_hat.

* Produce a scatter plot and overlay the fitted OLS regression line (corresponding to Plot 1)
sc cost num_responses || lfit cost num_responses

*** Plot residuals versus the predictor values (in the case, num_responses is our only predictor).
* (Corresponds to Plot 2) -- note: for purely graphic purposes, the Stata command is rvpplot
* predicted values of cost.
predict ols_resid, resid
sc ols_resid num_responses, yline(0, lpattern(dash))

The plots are:

Scatter plot and OLS model

Residuals of OLS model vs predictor plot

To prepare for WLS, start with computing the absolute residuals (those computed above). Then plot the absolute residuals vs the predictor.

gen double abs_res = abs(ols_resid)
sc abs_res num_responses

Absolute OLS residuals vs predictor

Now lets use those absolute residual values to start computing weights for WLS. To get weights for the WLS, you fit the OLS regression of the absolute residuals against the predictor (abs_res ~ num_responses).

* In Stata, the xb option is the predicted values (fitted values) of the model. Let's call it lp
* for linear predictor. You can check this for yourself by plugging in values to the fitted equation,
* y = b_0 + b_1 * x
reg abs_res num_responses
predict lp, xb
* Compute the weights as, w = 1 / (fitted values)^2.
gen double w = 1 / (lp^2)

The fitted model of absolute residuals using num_responses as a sole predictor is:

. reg abs_res num_responses
...
-------------------------------------------------------------------------------
      abs_res |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
num_responses |   .3226291   .1099286     2.93   0.015     .0776928    .5675653
        _cons |  -.9048621   1.661099    -0.54   0.598    -4.606021    2.796297
-------------------------------------------------------------------------------

You can verify the linear predictor (lp) manually by computed the predicted value from this model for the first two observations.

. list id cost num_responses abs_res lp in 1/2, noobs
  +---------------------------------------------+
  | id   cost   num_re~s     abs_res         lp |
  |---------------------------------------------|
  |  1     77         16   5.2247901   4.257203 |
  |  2     70         14   4.7626052   3.611945 |
  +---------------------------------------------+

. display _b[_cons] + _b[num_responses]*num_responses[1]
4.2572029
. display _b[_cons] + _b[num_responses]*num_responses[2]
3.6119448

Finally, we can use the weights to fit a WLS model, and the plot the OLS and WLS models over the original data.

reg cost num_responses [aweight=w]

* And now plot the data, and overlay the fitted OLS and WLS models.
twoway sc cost num_responses ///
   || (lfit cost num_responses, lcol(black)) ///
   || (lfit cost num_responses [aweight=w], lcol(red)), ///
   ytitle("cost") legend(rows(1) order(1 2 3) label(1 "Data") label(2 "OLS") label(3 "WLS"))

The WLS model is:

. reg cost num_responses [aweight=w]
(sum of wgt is 1.080673134694825)

      Source |       SS           df       MS      Number of obs   =        12
-------------+----------------------------------   F(1, 10)        =     85.35
       Model |  1273.86068         1  1273.86068   Prob > F        =    0.0000
    Residual |  149.251849        10  14.9251849   R-squared       =    0.8951
-------------+----------------------------------   Adj R-squared   =    0.8846
       Total |  1423.11253        11  129.373866   Root MSE        =    3.8633

-------------------------------------------------------------------------------
         cost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
num_responses |   3.421106     .37031     9.24   0.000     2.596004    4.246208
        _cons |   17.30064   4.827736     3.58   0.005      6.54377     28.0575
-------------------------------------------------------------------------------

Note that the model and plot match what is reported in the linked page.

OLS and WLS models