Solved – Do we have to scale new unseen feature data for prediction

caretmachine learningmultidimensional scalingr

In machine learning most algorithms require some kind of scaling to decrease error. This is my code:

# ensure results are repeatable
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
head(iris)
X=scale(iris[,-5])
X=data.frame(X)
head(X)
y=iris[,5]
y=data.frame(y)
head(y)
X=cbind(X,y)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=5, repeats=1)
# train the model 
  model <- train(y~., data=X, method="svmLinear2", trControl=control, tuneLength=5)
  # summarize the model
  print(model)
#saving model
save(model, file="model.Rdata")

#loading model
supmod<-load("model.Rdata")

#new data
# Sepal.Length Sepal.Width Petal.Length Petal.Width 
# 4.2             3.2          1.7         0.23  
new<-c(4.2,3.2,1.7,0.23)
pre<-predict(supmod,new)
#dont know how to predict this model with unseen data

In the above code I have two question one related to scaling of the new data and other related to coding error passing the new data to the loaded model.

The real iris feature data looks like this

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

But before passing to svm algorithm we have to scale the data and i use scale() to scale data and its look like this.

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8976739  1.01560199    -1.335752   -1.311052
2   -1.1392005 -0.13153881    -1.335752   -1.311052
3   -1.3807271  0.32731751    -1.392399   -1.311052
4   -1.5014904  0.09788935    -1.279104   -1.311052
5   -1.0184372  1.24503015    -1.335752   -1.311052
6   -0.5353840  1.93331463    -1.165809   -1.048667

It is this scaled data that we use for training and testing our model. lets say I have successfully trained the model and use it for prediction of new unseen data (eg this one row).

Sepal.Length Sepal.Width Petal.Length Petal.Width 
 4.2             3.2          1.7         0.23

Do I need to scale this new data? or I just have to pass this data directly to my model?
The next question is related to a coding error

predict(supmod,new) returns this error

Error in UseMethod("predict") : no applicable method for 'predict'
applied to an object of class "character"

Best Answer

1) You should scale the new data as well. You can scale all the data, training and new data together, if possible. Or you store the scaling function and apply it later to the new data. If you have data d that is normally distributed with, lets say mean=m and sd=s, you scale the data by: (d-m)/s. Just apply this function to the new data as well, using the same mean and sd.

2) You can't assign the data you load directly.

#loading model
supmod<-load("model.Rdata")

The resulting variable does only contain the string "model".

Try this:

load("model.Rdata")

This loads the model, the name of the variable is "model".

3) Futher, you have to pass a data.frame (with the same rownames as the training dataset) to predict:

new <- data.frame(Sepal.Length=4.2, Sepal.Width=3.2, Petal.Length=1.7, Petal.Width=0.23)

pre<-predict(model,new)

Related Solutions

Solved – How to one use the predict function on a lm object where the IVs have been dynamically scaled

When you use the predict with newdata argument you must supply the the data.frame with the same column names. In your code you have

newdata <- expand.grid(X1=c(-1,0,1),X2=c(-1,0,1))
names(newdata) <- c("scale(xxA)","scale(xxB)")

But the formula supplied to lm object is

out ~ scale(xxA)*scale(xxB)

So when you call the predict, it tries to find objects xxA and xxB in your data and apply function scale as per your initial request. But all R finds are objects scale(xxA) and scale(xxB). So naturally it produces the error.

Now if you supply correctly named newdata

newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1))

and try to use it for prediction

newdata$Y <- predict(lm.res.scale,newdata)

R will remember the how it scaled original data and apply the same scaling to your new data. In this case supplied value -1 for xxA will be subtracted the original mean of xxA and divided by the original standard value of xxA. If you want to get prediction of 1 S.D below the mean, you will need to supply this value. In your case then newdata should look like this:

newdata <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

I gathered all the solutions in one place to compare:

##Prepare data
set.seed(1)
dat <- data.frame(xxA = rnorm(20,10), xxB = rnorm(20,20))
dat$out <- with(dat,xxA+xxB+xxA*xxB+rnorm(20,20))
dat <- within(dat,{
    X1 <- as.numeric(scale(xxA))
    X2 <- as.numeric(scale(xxB))
    })

##Estimate the models
lm.res.scale <- lm(out ~ scale(xxA)*scale(xxB),data=dat)
lm.res.correct <- lm(out~X1*X2,data=dat)
lm.mod <- lm(out ~ I(scale(xxA))*I(scale(xxB)), data=dat)
rms.res <- Glm(out ~ scale(xxA)*scale(xxB),data=dat)


##Build data for prediction
newdata <- expand.grid(xxA=c(-1,0,1),xxB=c(-1,0,1))
newdata$X1<-newdata$xxA
newdata$X2<-newdata$xxB

##Gather the predictions
newdata$Yscaled <- predict(lm.res.scale,newdata)
newdata$Ycorrect <- predict(lm.res.correct,newdata)
newdata$YwithI <- predict(lm.mod,newdata)
newdata$Ywithrms <- Predict(rms.res,xxA=c(-1,0,1),xxB=c(-1,0,1),conf.int=FALSE)[,3]

##Build alternative data for prediction
newdata2 <- expand.grid(xxA=mean(dat$xxA)+sd(dat$xxA)*c(-1,0,1),xxB=mean(dat$xxB)+sd(dat$xxB)*c(-1,0,1))

#Predict
newdata$Yorigsc <- predict(lm.res.scale,newdata2)

I used set.seed so the results should be the same if you try to repeat it. The newdata looks like this:

> newdata
  xxA xxB X1 X2  Yscaled Ycorrect   YwithI Ywithrms  Yorigsc
1  -1  -1 -1 -1 25.79709 225.9562 221.7517 221.7517 225.9562
2   0  -1  0 -1 25.63030 244.5181 243.0404 243.0404 244.5181
3   1  -1  1 -1 25.46351 263.0800 264.3291 264.3291 263.0800
4  -1   0 -1  0 25.36341 234.6981 231.7012 231.7012 234.6981
5   0   0  0  0 26.21499 254.0704 254.0704 254.0704 254.0704
6   1   0  1  0 27.06657 273.4427 276.4396 276.4396 273.4427
7  -1   1 -1  1 24.92972 243.4400 241.6507 241.6507 243.4400
8   0   1  0  1 26.79967 263.6227 265.1004 265.1004 263.6227
9   1   1  1  1 28.66962 283.8054 288.5501 288.5501 283.8054

As expected Yscaled produces not the result we need since the original scaling is applied. In the case when we scale data before lm (Ycorrect) and when we supply alternative unscaled values (Yorigsc) results coincide and are the ones needed.

Now the other prediction methods give different results. This happens since R is forced to forget the original scaling using formula

out ~ I(scale(xxA))*I(scale(xxB))

or package rms. But when we use predict, the values are still scaled, but now according to supplied values of xxA and xxB. This is best illustrated by following statement, which in some way mimics what predict does with the data:

> eval(expression(cbind(scale(xxA),scale(xxB))),env=as.list(newdata))
           [,1]      [,2]
 [1,] -1.154701 -1.154701
 [2,]  0.000000 -1.154701
 [3,]  1.154701 -1.154701
 [4,] -1.154701  0.000000
 [5,]  0.000000  0.000000
 [6,]  1.154701  0.000000
 [7,] -1.154701  1.154701
 [8,]  0.000000  1.154701
 [9,]  1.154701  1.154701

We can see that in this case, scaling does not change original values too much, but this is even worse, since the values from predict look reasonable, when in fact they are wrong.

Solved – Visualizing SVM results

Usually a dimension reduction technique is employed to visualize fit on many variables.

Usually again SVD is used to reduce dimensions and keep 2 components, and visualize.

Here's how it might look like - enter image description here

Note that the x and y axes are the top 2 components of the SVD decomposition.

I haven't used R much lately, so I used python for creating the picture above.

from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC
from sklearn.datasets import load_iris

# To visualize the actual data in top 2 dimensions
iris=load_iris()
x,y=iris.data,iris.target

model=SVC().fit(x,y)
predicted=model.predict(x)

svd=TruncatedSVD().fit_transform(x)

from matplotlib import pyplot as plt
plt.figure(figsize=(16,6))
plt.subplot(1,2,0)
plt.title('Actual data, with errors highlighted')
colors=['r','g','b']
for t in [0,1,2]:
    plt.plot(svd[y==t][:,0],svd[y==t][:,1],colors[t]+'+')

errX,errY=svd[predicted!=y],y[predicted!=y]
for t in [0,1,2]:
    plt.plot(errX[errY==t][:,0],errX[errY==t][:,1],colors[t]+'o')


# To visualize the SVM classifier across
import numpy as np
density=15
domain=[np.linspace(min(x[:,i]),max(x[:,i]),num=density*4 if i==2 else density) for i in range(4)]

from itertools import product
allxs=list(product(*domain))
allys=model.predict(allxs)

allxs_svd=TruncatedSVD().fit_transform(allxs)

plt.subplot(1,2,1)
plt.title('Prediction space reduced to top two SVD\'s')
plt.ylim(-3,3)
for t in [0,1,2]:
    plt.scatter(allxs_svd[allys==t][:,0],allxs_svd[allys==t][:,1],color=colors[t],alpha=0.2/density,edgecolor='None')

Best Answer

Related Solutions

Solved – How to one use the predict function on a lm object where the IVs have been dynamically scaled

Solved – Visualizing SVM results

Related Question