Solved – Positive linear regression coefficient

linear modelMATLABrregression

I am trying to use R to find the optimal solution for my problem with positive coefficients. Here are my data:

      th inp      tcyc        tinst     tmem      tcom
  1   2   2  26219765385  1975872868  52449810   782964
  2   2   4  38080459431  3155342008  76744867  1878903
  3   2   8  64572439641  6230494010 137754355  4351706
  4   2  16 140168021516 13757989992 285524252 10605705
  5   2  32 308925389816 31497131498 628391048 26040711
  6   4   2  13206650786   988226883  25631315   844126
  7   4   4  19078145632  1577873809  37085281  2125333
  8   4   8  33742095874  3114415906  65962626  5222236
  9   4  16  70956149286  6881357755 134957687 12180392
  10  4  32 153411672670 15754506070 296548768 31057252
  11  8   2   6572843040   494094967  12380740   808816
  12  8   4   9452222628   788984621  17538152  2034061
  13  8   8  16765943294  1557329849  30549900  5016827
  14  8  16  34677550217  3440679505  61614420 12493699
  15  8  32  74852648112  7876116794 133525620 29824686
  16 16   2   3252373719   247026385   5958559   672396
  17 16   4   4669800482   394452497   8097991  1676579
  18 16   8   8269859136   778889584  13651458  4196829
  19 16  16  16353025378  1720301596  26775255 10393194
  20 16  32  37113657641  3938965759  55505822 25011009
  21 32   2   1630888153   123512114   2683400   461526
  22 32   4   2293598746   197173135   3682504  1213596
  23 32   8   4045995970   389408822   5858031  3055324
  24 32  16   8217603991   860041282  10973460  7502244
  25 32  32  17978101850  1969647650  22909347 17953100
  26 48   2   1064344042    82295143   1822133   381178
  27 48   4   1523091067   131488491   2331228   949354
  28 48   8   2677097592   259536252   3552229  2381626
  29 48  16   5400541381   573140686   6489032  5875310
  30 48  32  11837404077  1313066425  13318331 13968230

I use linear regression in R, s <- lm(tcyc ~ 0+tinst+tmem+tcom, data=fit), to get the optimal value with intercept 0. But I get negative coefficients which does not make any sense.

coef(s)

 tinst      tmem      tcom 
20.8745 -281.2288 -320.7204 

I am not sure whether is it the best way to model and find the optimal parameter for tinst, tmem and tcom. How do you find positive coefficients for the model?

Further explaining this problem in Detail:::

Background:
Trying to predict the execution time of an application in the future many-core systems empirically by learning the application behavior. As it is a multithreaded program, it will have communication contnention bottleneck if the application demands high inter-core communication. The general system equation looks like

Total executiong time cycles (T_cyc) = Total cycles spent in Instruction (T_inst) + Total cycle spent in Memory instructions (T_mem) + Total cycle spent in Communication (T_com)

i,e T_cyc=T_inst+T_mem+T_com.

If I use a simulator I can get the T_inst,T_mem and T_com directly and find out the independent contribution of each component to the T_cyc. But using a hardware, I can only get the counts or number of events. Ie, N_inst, N_mem and N_com.
So what I have is

T_cyc= a* N_inst + b* N_mem + c* N_com

Where a,b,c has to be determined.

I tried solving the problem using lsqnonneg (non-negative least square method) in MATLAB to find the a,b,c. At times from the data I get b and c value ZERO which is totally meaningless.

Things to notice:
N_inst is a very high value. N_mem and N_com are bit lower in magnitude and hence I face this problem of b and c results as ZERO.

Questions:
1. Is this a proper tool to solve such a linear equation system? If not, what else should I try?
2. Is it a problem due to the sample size fed to the solver?
3. I see that for most applications trend of N_cyc, N_inst,N_mem are monotonic but N_com is non-monotonic and can it affect the solved values? If so, how to isolate this component and find its contribution individually?

Best Answer

It is often the case that suppressing the intercept leads to regression coefficients that don't make sense. In my experience, there are rarely cases where suppressing the intercept makes sense, even if the scientific plausibility suggests that it might be justifiable (such as stopping distance versus cruising speed or creatinine clearance versus kidney mass in grams: you LEAVE the intercept IN with such analyses!). This is a problem of extrapolation.

Just eyeballing these data, I imagine that the estimated intercept would be a largely non-zero value. Since these data appear to come from some sort of computing time, comparing flops versus elapsed time, etc. the non-zero intercept could have a host of interpretations such as a boot time for running a process, a system lag as memory is allocated for an operation, or any other non-neglible system processes that aren't measured as part of an experimental run. Furthermore, and more subtle, there may be non-linear effects which are influencing your results. The regression coefficient from intercept-in OLS still provides a great way of estimating the first order linear trend through those data, even if the trend is curvilinear... only when you leave the intercept IN.

My first recommendation is to look at the output from running pairs(fit). And just look at the trend.

Nonetheless, if your goal is to simply find optimal positive coefficients in the model, you can do so with using by-hand optimization, either ML or Gibb's sampling, though don't be surprised if those results make no sense. Example of by-hand optimization:

X <- model.matrix(~ tinst+tmem+tcom-1, data=fit)
y <- fit$tcyc
negLogLik <- function(b) {
  b <- exp(b)     ## restrict to positive only values
  yhat <- b %*% X ## calculate fitted
  -var(y-yhat)    ## objective foo
}

nlm(negLogLik, c(1,1,1)) ## minimize objective foo