Least Squares – Accurate Estimation of Ratio of Regression Coefficients

efficiencyestimatorsleast squaresratiounbiased-estimator

What is the best method of estimating a ratio of regression coefficients $\beta_1/\beta_2$ under the usual assumptions / in practice? I have two relatively well approximated signals $X_1, X_2$ and they have been multiplied by unknown coefficients $\beta_1, \beta_2$. Then the output has been drenched in other noise signals with relatively normal and zero mean distributions, thus yielding the standard regression model, where $Y$ is the observed data

$$Y=\beta_1X_1+\beta_2X_2+\epsilon.$$

Estimating the ratio by standard OLS is not straight forward, since (assuming independence – which should not be assumed in general)
$$\mathbb{E}(\hat \beta_1/\hat \beta_2) =\mathbb{E}(\hat \beta_1)\mathbb{E}({\frac{1}{\hat \beta_2}})=\infty,$$
showing that the trivial estimator of using the ratio of OLS estimates $\hat \beta_1/\hat \beta_2$ is biased when estimating $\beta_1 /\beta_2 = \mathbb{E}(\hat \beta_1)/\mathbb{E}(\hat \beta_2)$. The last equality is due to the infinite variance of the reciprocal normal distribution.

What is an unbiased and the most efficient way of estimating the ratio? Is there a source for rigorously developed estimators for the scenario? Particularly I am interested in obtaining as sharp CI as possible.

Below is a straight forward code for trying different estimators. Decreasing the frequency difference between the sinusoids 1600, 1623 makes the problem more difficult and increases the bias. Note that the toy example may not generalize to more complicated scenarios, but functions as a useful work bench.

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
def ratioestimator(beta1=0.5, beta2=1.0, N=40000, n=500):
    b_strap=np.zeros(N)
    t=np.arange(0,n)
    X1=np.sin(t*2*np.pi/44100*1600)
    X2=np.sin(t*2*np.pi/44100*1623)
    for i in range(N):
        eps = np.random.normal(0,3,n)
        Y=X1*beta1+X2*beta2+eps
        
        #ESTIMATOR AND ESTIMATES:
        M = sm.OLS(Y, np.column_stack((np.ones(len(X1)),X1,X2))).fit()
        b_strap[i]=M.params[1]/M.params[2]
        #//ESTIMATOR AND ESTIMATES
        
# Analysis:  

bins=np.arange(0,1,0.05)
plt.hist(b_strap,bins=bins)
print("Mean of estimates: ", np.mean(b_strap))
print("Median of estimates: ", np.median(b_strap))
print("10% CI of estimates:", np.percentile(b_strap,45),np.percentile(b_strap,55))
print("50% CI of estimates:", np.percentile(b_strap,25),np.percentile(b_strap,75))
print("80% CI of estimates:", np.percentile(b_strap,10),np.percentile(b_strap,90))
print("95% CI of estimates:", np.percentile(b_strap,2.5),np.percentile(b_strap,97.5))

ratioestimator()

Best Answer

Reformulate the problem and maybe use nonlinear regression. If your regression model is $$ y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i $$ and the interest parameter is $\theta =\frac{\beta_1}{\beta_2}$ write the model as $$y_i=\beta_0 + \theta \beta_2 x_{i1} + \beta_2 x_{i2} + \epsilon_i = \\ \beta_0 + \beta_2 \left\{ \theta x_{1i} +x_{i2} \right\} + \epsilon_i $$ This model is no longer linear in the parameters, but it can be fitted via nonlinear least squares as at Confidence interval for GLM or the maximum of a function?. Similar ideas can be adapted for other kinds of regression models.

Then finally you can do inference for $\theta$ using profile likelihood, search this site, and also see the linked post above.

Edit

Adding an example with comparisions, first simulating some data with R:

b0 <- 1
b1 <- 1
b2 <- -1   # theta=-1

set.seed(7*11*13) # My public seed

x1 <- seq(from=-5,  to=5,  by=1/3)
x2 <- sample(x1)   # Random permutation

Y <- b0 + b1*x1  +  b2*x2  +  rnorm(length(x1), 0, 2)

mydata <- data.frame(Y, x1, x2)

mod_lm <- lm(Y ~ x1 + x2, data=mydata)

mod0 <- nls(Y ~ b0  + b2 * (theta*x1+x2), data=mydata,
            start=list(b0=0, b2=0.5, theta=-2))  

Then with mod0 making a confidence interval using profile likelihood:

confint(mod0, parm ="theta")   
Waiting for profiling to be done...
      2.5%      97.5% 
-2.4700532 -0.8152139 

Then comparing with the delta method:

car::deltaMethod(mod_lm, "x1/x2")  
     Estimate       SE    2.5 %  97.5 %
x1/x2 -1.35072  0.34126 -2.01957 -0.6819

This intervals are quite different, and I am somewhat surprised about the magnitude of the difference. It is an exercise to do simulations to compare the quality of the intervals!

For the example I have used a value for $\beta_2$ which makes 0 a quite unlikely value. If you redo the simulations with $\beta_2$ closer to zero, this functions used will have problems. Below I will look into that problem ... (later)