Solved – Model Building: Missing Data or Large Gap between data points

lognormal distributionregression

I am currently trying to build a model using a data set that has large gap between data points. When I look for the correlation I clearly see a negative regression line. But I am worried about the gap that exist between the poins.

I build a simple linear model though this has high R squared I don't think simple linear regression is the best model to that fits the data. This looks like it has a negative exponential behavior. I thought to post here to get some expert thoughts on what I should do when you deal with the data that has a large gap between points and does this data has a linear relationship or strong non linear relationship?

Data Set:

   density  co2
1     20.4 38.8
2     27.4 31.5
3    106.2 10.6
4     80.4 16.1
5    141.3  7.7
6    130.9  8.3
7    121.7  8.5
8    106.5 11.1
9    130.5  8.6
10   101.1 11.1
11   123.9  9.8
12   144.2  7.8
13    29.5 31.8
14    30.8 31.6
15    26.5 34.0
16    35.7 28.9
17    30.0 28.8
18   106.2 10.5
19    97.0 12.3
20    90.1 13.2
21   106.7 11.4
22    99.3 11.2
23   107.2 10.3
24   109.1 11.4

Plot:
Simple Linear Regression Plot

Summary of Linear Model:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.12948    1.21768   31.31  < 2e-16 ***
density     -0.24247    0.01261  -19.22 3.04e-15 ***

In addition if I transfer both density and co2 as log transform variables, then I see following behavior. Since data is missing at the middle its really hard to stick to a log transformed model or the base model.

enter image description here

Best Answer

Presumably co2 means "carbon dioxide" and density means what it says. Even so, it would help to have more detail on what is happening here. Is there no physics or chemistry or engineering background to help us, or you, or everyone?

Why is there a gap? Is there no hint from the background to the data?

Are these the results of an experiment in which one variable is controlled, or something else? Which variable do you want to predict and/or regard as the response or outcome (dependent variable, if you will)? You appear to be regarding co2 as the outcome. Is that prescribed by the problem?

Some rough experiments indicate that logging just one variable might make sense too. Linear is a lousy model because if you extrapolate you soon produce negative predictions for one or other variable, which is surely unphysical.

Related Question