Solved – Comparing regression coefficients of same model across different data sets

regressionregression coefficients

I'm evaluating two (2) refrigerants (gases) that were used in the same refrigeration system. I have saturated suction temperature ($S$), condensing temperature ($D$), and amperage ($Y$) data for the evaluation. There are two (2) sets of data; 1st refrigerant ($R_1$) & 2nd refrigerant ($R_2$). I'm using a linear, multivariate ($S$ & $D$), 3rd order polynomial model for the regression analyses. I would like determine how much less / more amperage (or, some similar metric as a performance comparison) on average, as a percentage, is being drawn by the second refrigerant.

My first thought was:

  1. Determine the model to use: $Y = b_0 + b_1S + b_2D + b_3SD + b_4S^2 + b_5D^2 + b_6S^2D + b_7D^2S + b_8D^3 + b_9S^3$
  2. Derive coefficients ($b_i$) from the baseline data ($R_1$).
  3. Using those coefficients, for each $S$ & $D$ in the $R_2$ data set, calculate each expected amp draw ($\hat{Y}$) and then average.
  4. Compare the $\hat{Y}$ average to the actual average amp draw ($Y_2$) of the $R_2$ data.
  5. $\text{percent (%) change} = (Y_2 – \hat{Y}) / \hat{Y}$

However, since the 2nd refrigerant has slightly different thermal properties & small changes were made to the refrigeration system (TXV & superheat adjustments) I don't believe this 'baseline comparison method' is accurate.

My next thought was to do two (2) separate regression analyses:
\begin{align}
Y_1 &= a_{0} + a_{1}S_1 + a_{2}D_1 + a_{3}S_1D_1 + a_{4}S_1^2 + a_{5}D_1^2 + a_{6}S_1^2D_1 + a_{7}D_1^2S_1 + a_{8}D_1^3 + a_{9}S_1^3 \\
Y_2 &= b_{0} + b_{1}S_2 + b_{2}D_2 + b_{3}S_2D_2 + b_{4}S_2^2 + b_{5}D_2^2 + b_{6}S_2^2D_2 + b_{7}D_2^2S_2 + b_{8}D_2^3 + b_{9}S_2^3
\end{align}

and then, for saturated suction temp ($S$), compare coefficients ($a_{1}$ vs $b_{1}$) like so:
$$
\text{% change} = \frac{b_{1} – a_{1}}{a_{1}}
$$

However, again, these coefficients should be weighted differently. Therefore, the results would be skewed.

I believe I could use a z-test to determine how differently weighted the coefficients are, but I'm not sure I fully understand the meaning of the output: $z = (a_{1} – b_{1}) / \sqrt{SE_{a_{1}}^2 + SE_{b_{1}}^2 )}$. But, that still wouldn't give me a performance metric, which is the overall objective.

Best Answer

From the ideal gas law here, $PV=nRT$, suggesting a proportional model. Make sure your units are in absolute temperature. Asking for a proportional result would imply a proportional error model. Consider, perhaps $Y=a D^b S^c$, then for multiple linear regression one can use $\ln (Y)=\ln (a)+b \ln (D)+c \ln (S)$ by taking the logarithms of the Y, D, and S values, so that this then looks like $Y_l=a_l+b D_l+c S_l$, where the $l$ subscripts mean "logarithm of." Now, this may work better than the linear model you are using, and, the answers are then relative error type.

To verify what type of model to use try one and check if the residuals are homoscedastic. If they are not then you have a biased model, then do something else like model the logarithms, as above, one or more reciprocals of x or y data, square roots, squaring, exponentiation and so forth until the residuals are homoscedastic. If the model cannot yield homoscedastic residuals then use multiple linear Theil regression, with censoring if needed.

How normally the data is distributed on the y axis is not required, but, outliers can and often do distort the regression parameter results markedly. If homoscedasticity cannot be found then ordinary least squares should not be used and some other type of regression needs to be performed, e.g. weighted regression, Theil regression, least squares in x, Deming regression and so forth. Also, the errors should not be serially correlated.

The meaning of the output: $z = (a_{1} - b_{1}) / \sqrt{SE_{a_{1}}^2 + SE_{b_{1}}^2 )}$, may or may not be relevant. This assumes that the total variance is the sum of two independent variances. To put this another way, independence is orthogonality (perpendicularity) on an $x,y$ plot. That is, the total variability (variance) then follows Pythagorean theorem, $H=+\sqrt{A^2+O^2}$, which may or may not be the case for your data. If that is the case, then the $z$-statistic is a relative distance, i.e., a difference of means (a distance), divided by Pythagorean, A.K.A. vector, addition of standard error (SE), which are standard deviations (SDs) divided by $\sqrt{N}$, where SEs are themselves distances. Dividing one distance by the other then normalizes them, i.e., the difference in means divided by the total (standard) error, which is then in a form so that one can apply ND(0,1) to find a probability.

Now, what happens if the measures are not independent, and how can one test for it? You may remember from geometry that triangles that are not right angled add their sides as $C^2=A^2+B^2-2 A B \cos (\theta ),\theta =\angle(A,B)$, if not refresh your memory here. That is, when there is something other than a 90-degree angle between the axes, we have to include what that angle is in the calculation of total distance. First recall what correlation is, standardized covariance. This for total distance $\sigma _T$ and correlation $\rho_{A,B}$ becomes $\sigma _T^2=\sigma _A^2+\sigma _B^2-2 \sigma _A \sigma _B \rho_{A,B}$. In other words, if your standard deviations are correlated (e.g., pairwise), they are not independent.

Related Question