I am trying to use R to find the optimal solution for my problem with positive coefficients. Here are my data:
th inp tcyc tinst tmem tcom
1 2 2 26219765385 1975872868 52449810 782964
2 2 4 38080459431 3155342008 76744867 1878903
3 2 8 64572439641 6230494010 137754355 4351706
4 2 16 140168021516 13757989992 285524252 10605705
5 2 32 308925389816 31497131498 628391048 26040711
6 4 2 13206650786 988226883 25631315 844126
7 4 4 19078145632 1577873809 37085281 2125333
8 4 8 33742095874 3114415906 65962626 5222236
9 4 16 70956149286 6881357755 134957687 12180392
10 4 32 153411672670 15754506070 296548768 31057252
11 8 2 6572843040 494094967 12380740 808816
12 8 4 9452222628 788984621 17538152 2034061
13 8 8 16765943294 1557329849 30549900 5016827
14 8 16 34677550217 3440679505 61614420 12493699
15 8 32 74852648112 7876116794 133525620 29824686
16 16 2 3252373719 247026385 5958559 672396
17 16 4 4669800482 394452497 8097991 1676579
18 16 8 8269859136 778889584 13651458 4196829
19 16 16 16353025378 1720301596 26775255 10393194
20 16 32 37113657641 3938965759 55505822 25011009
21 32 2 1630888153 123512114 2683400 461526
22 32 4 2293598746 197173135 3682504 1213596
23 32 8 4045995970 389408822 5858031 3055324
24 32 16 8217603991 860041282 10973460 7502244
25 32 32 17978101850 1969647650 22909347 17953100
26 48 2 1064344042 82295143 1822133 381178
27 48 4 1523091067 131488491 2331228 949354
28 48 8 2677097592 259536252 3552229 2381626
29 48 16 5400541381 573140686 6489032 5875310
30 48 32 11837404077 1313066425 13318331 13968230
I use linear regression in R, s <- lm(tcyc ~ 0+tinst+tmem+tcom, data=fit)
, to get the optimal value with intercept 0. But I get negative coefficients which does not make any sense.
coef(s)
tinst tmem tcom
20.8745 -281.2288 -320.7204
I am not sure whether is it the best way to model and find the optimal parameter for tinst
, tmem
and tcom
. How do you find positive coefficients for the model?
Further explaining this problem in Detail:::
Background:
Trying to predict the execution time of an application in the future many-core systems empirically by learning the application behavior. As it is a multithreaded program, it will have communication contnention bottleneck if the application demands high inter-core communication. The general system equation looks like
Total executiong time cycles (T_cyc) = Total cycles spent in Instruction (T_inst) + Total cycle spent in Memory instructions (T_mem) + Total cycle spent in Communication (T_com)
i,e T_cyc=T_inst+T_mem+T_com.
If I use a simulator I can get the T_inst,T_mem and T_com directly and find out the independent contribution of each component to the T_cyc. But using a hardware, I can only get the counts or number of events. Ie, N_inst, N_mem and N_com.
So what I have is
T_cyc= a* N_inst + b* N_mem + c* N_com
Where a,b,c has to be determined.
I tried solving the problem using lsqnonneg (non-negative least square method) in MATLAB to find the a,b,c. At times from the data I get b and c value ZERO which is totally meaningless.
Things to notice:
N_inst is a very high value. N_mem and N_com are bit lower in magnitude and hence I face this problem of b and c results as ZERO.
Questions:
1. Is this a proper tool to solve such a linear equation system? If not, what else should I try?
2. Is it a problem due to the sample size fed to the solver?
3. I see that for most applications trend of N_cyc, N_inst,N_mem are monotonic but N_com is non-monotonic and can it affect the solved values? If so, how to isolate this component and find its contribution individually?
Best Answer
It is often the case that suppressing the intercept leads to regression coefficients that don't make sense. In my experience, there are rarely cases where suppressing the intercept makes sense, even if the scientific plausibility suggests that it might be justifiable (such as stopping distance versus cruising speed or creatinine clearance versus kidney mass in grams: you LEAVE the intercept IN with such analyses!). This is a problem of extrapolation.
Just eyeballing these data, I imagine that the estimated intercept would be a largely non-zero value. Since these data appear to come from some sort of computing time, comparing flops versus elapsed time, etc. the non-zero intercept could have a host of interpretations such as a boot time for running a process, a system lag as memory is allocated for an operation, or any other non-neglible system processes that aren't measured as part of an experimental run. Furthermore, and more subtle, there may be non-linear effects which are influencing your results. The regression coefficient from intercept-in OLS still provides a great way of estimating the first order linear trend through those data, even if the trend is curvilinear... only when you leave the intercept IN.
My first recommendation is to look at the output from running
pairs(fit)
. And just look at the trend.Nonetheless, if your goal is to simply find optimal positive coefficients in the model, you can do so with using by-hand optimization, either ML or Gibb's sampling, though don't be surprised if those results make no sense. Example of by-hand optimization: