Solved – Do we have to scale new unseen feature data for prediction

caretmachine learningmultidimensional scalingr

In machine learning most algorithms require some kind of scaling to decrease error. This is my code:

# ensure results are repeatable
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
head(iris)
X=scale(iris[,-5])
X=data.frame(X)
head(X)
y=iris[,5]
y=data.frame(y)
head(y)
X=cbind(X,y)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=5, repeats=1)
# train the model 
  model <- train(y~., data=X, method="svmLinear2", trControl=control, tuneLength=5)
  # summarize the model
  print(model)
#saving model
save(model, file="model.Rdata")

#loading model
supmod<-load("model.Rdata")

#new data
# Sepal.Length Sepal.Width Petal.Length Petal.Width 
# 4.2             3.2          1.7         0.23  
new<-c(4.2,3.2,1.7,0.23)
pre<-predict(supmod,new)
#dont know how to predict this model with unseen data

In the above code I have two question one related to scaling of the new data and other related to coding error passing the new data to the loaded model.

The real iris feature data looks like this

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

But before passing to svm algorithm we have to scale the data and i use scale() to scale data and its look like this.

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8976739  1.01560199    -1.335752   -1.311052
2   -1.1392005 -0.13153881    -1.335752   -1.311052
3   -1.3807271  0.32731751    -1.392399   -1.311052
4   -1.5014904  0.09788935    -1.279104   -1.311052
5   -1.0184372  1.24503015    -1.335752   -1.311052
6   -0.5353840  1.93331463    -1.165809   -1.048667

It is this scaled data that we use for training and testing our model. lets say I have successfully trained the model and use it for prediction of new unseen data (eg this one row).

Sepal.Length Sepal.Width Petal.Length Petal.Width 
 4.2             3.2          1.7         0.23  
  1. Do I need to scale this new data? or I just have to pass this data directly to my model?
  2. The next question is related to a coding error

predict(supmod,new) returns this error

Error in UseMethod("predict") : no applicable method for 'predict'
applied to an object of class "character"

Best Answer

1) You should scale the new data as well. You can scale all the data, training and new data together, if possible. Or you store the scaling function and apply it later to the new data. If you have data d that is normally distributed with, lets say mean=m and sd=s, you scale the data by: (d-m)/s. Just apply this function to the new data as well, using the same mean and sd.

2) You can't assign the data you load directly.

#loading model
supmod<-load("model.Rdata")

The resulting variable does only contain the string "model".

Try this:

load("model.Rdata")

This loads the model, the name of the variable is "model".

3) Futher, you have to pass a data.frame (with the same rownames as the training dataset) to predict:

new <- data.frame(Sepal.Length=4.2, Sepal.Width=3.2, Petal.Length=1.7, Petal.Width=0.23)

pre<-predict(model,new)