Solved – Why does GBM predict different values for the same data

boostingmachine learningpredictive-modelsr

I am new to R. I am building predictive model with gbm package. I have a problem that I retrieve different results for data from data frame that was used for building of the model and for separate data frame with same values.

I randomly divide my data to two sets, training set is loaded to `head':

head <- read.csv(…)

I build a model with gbm:

fit1000x3 <- gbm(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10+V11, data=head, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

When I create a data frame with values equal to head[1,]:

xxx <- data.frame(V1=…)

I receive different values for:

predict(fit1000x3, newdata=head[1,], n.trees=100)

and

predict(fit1000x3, newdata=xxx, n.trees=100)

Here is the series of commands I have run:

> head <- read.csv(...)
> fit1000x3 <- gbm(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10+V11, data=head, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.1707            -nan     0.1000    0.0152
     2        0.1581            -nan     0.1000    0.0122
     3        0.1478            -nan     0.1000    0.0100
     4        0.1395            -nan     0.1000    0.0079
     5        0.1326            -nan     0.1000    0.0067
     6        0.1267            -nan     0.1000    0.0056
     7        0.1211            -nan     0.1000    0.0052
     8        0.1168            -nan     0.1000    0.0039
     9        0.1133            -nan     0.1000    0.0032
    10        0.1103            -nan     0.1000    0.0027
   100        0.0773            -nan     0.1000   -0.0002
   200        0.0734            -nan     0.1000   -0.0002
   300        0.0714            -nan     0.1000   -0.0002
   400        0.0695            -nan     0.1000   -0.0002
   500        0.0681            -nan     0.1000   -0.0002
   600        0.0672            -nan     0.1000   -0.0002
   700        0.0663            -nan     0.1000   -0.0002
   800        0.0655            -nan     0.1000   -0.0002
   900        0.0648            -nan     0.1000   -0.0001
  1000        0.0643            -nan     0.1000   -0.0001

> predict(fit1000x3, newdata=head[1,], n.trees=100)
    [1] 0.1420456
> head[1,]
      V1   V2                            V3  V4 V5   V6   V7            V8       V9
    1  0 0.35 m01xrfn2 Effective resolution 5.1 Nu null null niceCharacter unitName
       V10    V11
    1 null nextag
> xxx <- data.frame(V1=0, V2=0.35, V3="m01xrfn2 Effective resolution",   V4="5.1", V5="Nu", V6="null", V7="null", V8="niceCharacter", V9="unitName", V10="null", V11="nextag")
> xxx
      V1   V2                            V3  V4 V5   V6   V7            V8        V9
    1  0 0.35 m01xrfn2 Effective resolution 5.1 Nu null null niceCharacter unitName
       V10    V11
    1 null nextag
> head[1,]
      V1   V2                            V3  V4 V5   V6   V7            V8       V9
    1  0 0.35 m01xrfn2 Effective resolution 5.1 Nu null null niceCharacter unitName
       V10    V11
    1 null nextag
> predict(fit1000x3, newdata=xxx, n.trees=100)
    [1] 0.2068787

> str(head[1,])
    'data.frame':   1 obs. of  11 variables:
     $ V1 : int 0
     $ V2 : num 0.35
     $ V3 : Factor w/ 113 levels "m01t_ Contains",..: 4
     $ V4 : Factor w/ 884 levels ".","0","01","02",..: 503
     $ V5 : Factor w/ 11 levels "aN","aNu","aU",..: 4
     $ V6 : Factor w/ 4 levels "null","propertyAlias",..: 1
     $ V7 : Factor w/ 9 levels "attach","block",..: 6
     $ V8 : Factor w/ 8 levels "attach","block",..: 5
     $ V9 : Factor w/ 4 levels "null","propertyAlias",..: 4
     $ V10: Factor w/ 2 levels "null","undef": 1
     $ V11: Factor w/ 368 levels "101reviews","123football",..: 223
> str(xxx)
    'data.frame':   1 obs. of  11 variables:
     $ V1 : num 0
     $ V2 : num 0.35
     $ V3 : Factor w/ 1 level "m01xrfn2 Effective resolution": 1
     $ V4 : Factor w/ 1 level "5.1": 1
     $ V5 : Factor w/ 1 level "Nu": 1
     $ V6 : Factor w/ 1 level "null": 1
     $ V7 : Factor w/ 1 level "null": 1
     $ V8 : Factor w/ 1 level "niceCharacter": 1
     $ V9 : Factor w/ 1 level "unitName": 1
     $ V10: Factor w/ 1 level "null": 1
     $ V11: Factor w/ 1 level "nextag": 1

Best Answer

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276