Solved – Categorical response variable prediction

anovacategorical datalogisticmultinomial-distributionr

I have the following kind of data (coded in R):

v.a = c('cat', 'dog', 'dog', 'goat', 'cat', 'goat', 'dog', 'dog')
v.b = c(1, 2, 1, 2, 1, 2, 1, 2)
v.c = c('blue', 'red', 'blue', 'red', 'red', 'blue', 'yellow', 'yellow')
set.seed(12)
v.d = rnorm(8)
aov(v.a ~ v.b + v.c + v.d) # Error

I would like to know if the value of v.b or the value of v.c has any ability to predict the value of v.a. I would run an ANOVA (as shown above) but I think it does not make any sense since my response variable is not ordinal (it is categorical). What should I do?

Best Answer

You could use ANY classifier. Including Linear Discriminants, multinomial logit as Bill pointed out, Support Vector Machines, Neural Nets, CART, random forest, C5 trees, there are a world of different models that can help you predict $v.a$ using $v.b$ and $v.c$. Here is an example using the R implementation of random forest:

# packages
library(randomForest)

#variables
v.a= c('cat','dog','dog','goat','cat','goat','dog','dog')
v.b= c(1,2,1,2,1,2,1,2)
v.c= c('blue', 'red', 'blue', 'red', 'red', 'blue', 'yellow', 'yellow')

# model fit
# note that you must turn the ordinal variables into factor or R wont use
# them properly
model <- randomForest(y=as.factor(v.a),x=cbind(v.b,as.factor(v.c)),ntree=10)

#plot of model accuracy by class
plot(model)

enter image description here

# model confusion matrix
model$confusion

Clearly these variables don't show a strong relation.

Related Question