Solved – Error: too many ties in knn in R

classificationk nearest neighbourmachine learningr

I am trying to use the KNN algorithm from the class package in R.

I have used it before on the same dataset, without normalizing one of the features, but it performed poor (0.35 precision). Now I tried to train the model with normalized features, but I get the error "too many ties in knn".

I am trying to predict user ratings on movies (using the Movie Lens data set).

I want to predict rating which could be 1,2,3,4,5. I tried with different values of k, which are not multiples of 5. I even tried with k = 1, but I still get the same error.

The data consists mostly of binary attributes (19 genres of movies and gender of users) and only 1 numeric attribute (user age) and I think that is the problem.

Best Answer

Please combine the changes I've made below with the additional data that you have in your dataset which I don't have in the version I found, such as age, gender, etc.

require(class)
require(caret)
unzip("ml-100k.zip")
setwd("ml-100k")
movies        <- read.csv("u.data", sep = "\t")
names(movies) <- c("user id", "movie id", "rating", "ts")
movies$"user id" <- NULL

idx      <- rbinom(99999, 2, .6)
training <- movies[idx,]
testing  <- movies[-idx,]
x        <- training
y        <- training$rating
x1       <- testing
y1       <- testing$rating

# Too many ties
knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = FALSE, use.all = T)

# Still no joy
knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = FALSE, use.all = F)

# This works ####
movies        <- read.csv("u.data", sep = "\t")
names(movies) <- c("user id", "movie id", "rating", "ts")
movies$"user id" <- NULL

# Fix timestamp
movies$ts  <- as.POSIXct(movies$ts, origin = "1970-01-01") 
movies$new <- 0
movies$new[movies$ts > mean(movies$ts)] <- 1
movies$ts  <- NULL

#Group movie ID's
movies$movie      <- cut(movies$"movie id", 20, labels = 1:20)
movies$"movie id" <- NULL

# The renaming below was part of an experiment with recoding the outcome
#        but it's not important here
movies$y      <- movies$rating
movies$rating <- NULL


#movies$y <- as.data.frame(scale(movies$y))

idx      <- rbinom(nrow(movies), 2, .5)
training <- movies[idx,]
testing  <- movies[-idx,]
x        <- training
y        <- training$y
x1       <- testing
y1       <- testing$y

knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = T, use.all = F)
Related Question