Solved – R package for combining factor levels for datamining

many-categoriesr

Wondering if anyone has run across a package/function in R that will combine levels of a factor whose proportion of all the levels in a factor is less than some threshold? Specifically, one of the first steps in data preparation I conduct is to collapse sparse levels of factors together (say into a level called 'Other') that do not constitute at least, say, 2% of the total. This is done unsupervised and is done when the objective is to model some activity in marketing (not fraud detection, where those very small occurrences could be extremely important). I am looking for a function that will collapse levels until some threshold proportion is met.

UPDATE:

Thanks to these great suggestions I wrote a function pretty easily. I did realize though that it was possible to collapse levels with proportion < the minimum and still have that recoded level be < the minimum, requiring the addition of the lowest level with proportion > the minimum. Likely can be more efficient but it appears to work. The next enhancement would be to figure out how to capture the "rules" for applying the collapse logic to new data (a validation set or future data).

collapseFactors<- function(tableName,minPercent=5,fillIn ="RECODED" )
{
    for (i in 1:ncol(tableName))
        {   

            if(is.factor(tableName[,i]) == TRUE) #process just factors
            {


                sortedTable<-sort(prop.table(table(tableName[,i])))
                numberToCollapse<-length(sortedTable[sortedTable<(minPercent/100)])

                if (sum(sortedTable[1:numberToCollapse])<(minPercent/100))
                    {
                        numberToCollapse=numberToCollapse+1 #add next level if < minPercent
                    }

                if(numberToCollapse>1) #if not >1 then nothing to collapse
                {
                    lf <- names(sortedTable[1:numberToCollapse])
                    levels(tableName[,i])[levels(tableName[,i]) %in% lf] <- fillIn
                }
            }#end if a factor


        }#end for loop

    return(tableName)

}#end function

Best Answer

It seems it's just a matter of "releveling" the factor; no need to compute partial sums or make a copy of the original vector. E.g.,

set.seed(101)
a <- factor(LETTERS[sample(5, 150, replace=TRUE, 
                           prob=c(.1, .15, rep(.75/3,3)))])
p <- 1/5
lf <- names(which(prop.table(table(a)) < p))
levels(a)[levels(a) %in% lf] <- "Other"

Here, the original factor levels are distributed as follows:

 A  B  C  D  E 
18 23 35 36 38 

and then it becomes

Other     C     D     E 
   41    35    36    38 

It may be conveniently wrapped into a function. There is a combine_factor() function in the reshape package, so I guess it could be useful too.

Also, as you seem interested in data mining, you might have a look at the caret package. It has a lot of useful features for data preprocessing, including functions like nearZeroVar() that allows to flag predictors with very imbalanced distribution of observed values (See the vignette, example data, pre-processing functions, visualizations and other functions, p. 5, for example of use).