Solved – Error: too many ties in knn in R

classificationk nearest neighbourmachine learningr

I am trying to use the KNN algorithm from the class package in R.

I have used it before on the same dataset, without normalizing one of the features, but it performed poor (0.35 precision). Now I tried to train the model with normalized features, but I get the error "too many ties in knn".

I am trying to predict user ratings on movies (using the Movie Lens data set).

I want to predict rating which could be 1,2,3,4,5. I tried with different values of k, which are not multiples of 5. I even tried with k = 1, but I still get the same error.

The data consists mostly of binary attributes (19 genres of movies and gender of users) and only 1 numeric attribute (user age) and I think that is the problem.

Best Answer

Please combine the changes I've made below with the additional data that you have in your dataset which I don't have in the version I found, such as age, gender, etc.

require(class)
require(caret)
unzip("ml-100k.zip")
setwd("ml-100k")
movies        <- read.csv("u.data", sep = "\t")
names(movies) <- c("user id", "movie id", "rating", "ts")
movies$"user id" <- NULL

idx      <- rbinom(99999, 2, .6)
training <- movies[idx,]
testing  <- movies[-idx,]
x        <- training
y        <- training$rating
x1       <- testing
y1       <- testing$rating

# Too many ties
knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = FALSE, use.all = T)

# Still no joy
knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = FALSE, use.all = F)

# This works ####
movies        <- read.csv("u.data", sep = "\t")
names(movies) <- c("user id", "movie id", "rating", "ts")
movies$"user id" <- NULL

# Fix timestamp
movies$ts  <- as.POSIXct(movies$ts, origin = "1970-01-01") 
movies$new <- 0
movies$new[movies$ts > mean(movies$ts)] <- 1
movies$ts  <- NULL

#Group movie ID's
movies$movie      <- cut(movies$"movie id", 20, labels = 1:20)
movies$"movie id" <- NULL

# The renaming below was part of an experiment with recoding the outcome
#        but it's not important here
movies$y      <- movies$rating
movies$rating <- NULL


#movies$y <- as.data.frame(scale(movies$y))

idx      <- rbinom(nrow(movies), 2, .5)
training <- movies[idx,]
testing  <- movies[-idx,]
x        <- training
y        <- training$y
x1       <- testing
y1       <- testing$y

knn(train = x, test = testing, cl = y, k = 1, l = 0, prob = T, use.all = F)

Related Solutions

Solved – Predicting with Restricted Boltzmann Machines for Collaborative Filtering

The input for missing movies are all zero.

In a vanilla RBM, once you go to the hidden layer and then come back to the visible layer, you'll get reconstructions for all movies, not just the ones that the current user have interacted with. In the training process it's really important to ignore those reconstructions so that they don't affect your weight matrix and visible layer bias in the update step. In this context, "ignore" means set the value to 0. If you don't set to 0 you'll get corrections in more weights than you really need.

On the other hand, to get predictions, you need use the full matrix to actually get all reconstructions.

As you pointed, you just calculate the activation for all visible units(using the softmax function), pick the five units related to some specific item and then apply the last formula which is the expected value for some movie $i$

$p_i = \sum_{k} p(v_{i}^{k} = 1 | h) k$

If your dataset contains natural numbers you can just round() them.

Solved – Why is this nearest neighbors algorithm classifier implementation giving low accuracy

You should probably try to reduce the number of variables to a sensible set before trying to classify using nearest neighbors. Otherwise you'll fall victim to the curse of dimensionality, which is referenced in the Wikipedia article on $k$-nearest neighbors. You might also consider some sort of scaling of the variables so that no particular attribute has an undue influence on your classifications.

Your Python code could also be simplified quite a bit. Instead of defining these functions you could use the inner product function from numpy:

import math
import numpy as np

# inner product
np.dot(a, b)

# cosine similarity
np.dot(a, b) / math.sqrt(np.dot(a, a) * np.dot(b, b))

Best Answer

Related Solutions

Solved – Predicting with Restricted Boltzmann Machines for Collaborative Filtering

Solved – Why is this nearest neighbors algorithm classifier implementation giving low accuracy

Related Question