Solved – Dumthe coding categorical variables with lots of unique values using log2

categorical datacategorical-encodinglogarithmmany-categories

I'm trying to understand the logic behind this binary encoder.

It automatically takes categorical variables and dummy codes them (similar to one-hot-encoding on sklearn), but reduces the number of output columns equal to the log2 of the length of unique values.

Basically, when i used this library, i noticed that my dummy variables are limited to only a few of the unique values. Upon further investigation i noticed this @staticmethod, which take the log2 of the len of unique values in a categorical variable.

My question is WHY!? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this? How does taking the log2 determine how many digits are needed to represent the data?

 def calc_required_digits(X, col):
        """
        figure out how many digits we need to represent the classes present
        """
        return int( np.ceil(np.log2(len(X[col].unique()))) )

Full source code:

"""Binary encoding"""

import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input

__author__ = 'willmcginnis'


[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
    """Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

    Parameters
    ----------

    verbose: int
        integer indicating verbosity of output. 0 for none.
    cols: list
        a list of columns to encode, if None, all string columns will be encoded
    drop_invariant: bool
        boolean for whether or not to drop columns with 0 variance
    return_df: bool
        boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
    impute_missing: bool
        boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
    handle_unknown: str
        options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
        impute is used, an extra column will be added in if the transform matrix has unknown categories.  This can causes
        unexpected changes in dimension in some cases.

    Example
    -------
    >>>from category_encoders import *
    >>>import pandas as pd
    >>>from sklearn.datasets import load_boston
    >>>bunch = load_boston()
    >>>y = bunch.target
    >>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
    >>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
    >>>numeric_dataset = enc.transform(X)
    >>>print(numeric_dataset.info())

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 506 entries, 0 to 505
    Data columns (total 16 columns):
    CHAS_0     506 non-null int64
    RAD_0      506 non-null int64
    RAD_1      506 non-null int64
    RAD_2      506 non-null int64
    RAD_3      506 non-null int64
    CRIM       506 non-null float64
    ZN         506 non-null float64
    INDUS      506 non-null float64
    NOX        506 non-null float64
    RM         506 non-null float64
    AGE        506 non-null float64
    DIS        506 non-null float64
    TAX        506 non-null float64
    PTRATIO    506 non-null float64
    B          506 non-null float64
    LSTAT      506 non-null float64
    dtypes: float64(11), int64(5)
    memory usage: 63.3 KB
    None

    """
    def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
        self.return_df = return_df
        self.drop_invariant = drop_invariant
        self.drop_cols = []
        self.verbose = verbose
        self.impute_missing = impute_missing
        self.handle_unknown = handle_unknown
        self.cols = cols
        self.ordinal_encoder = None
        self._dim = None
        self.digits_per_col = {}

[docs]    def fit(self, X, y=None, **kwargs):
        """Fit encoder according to X and y.

        Parameters
        ----------

        X : array-like, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------

        self : encoder
            Returns self.

        """

        # if the input dataset isn't already a dataframe, convert it to one (using default column names)
        # first check the type
        X = convert_input(X)

        self._dim = X.shape[1]

        # if columns aren't passed, just use every string column
        if self.cols is None:
            self.cols = get_obj_cols(X)

        # train an ordinal pre-encoder
        self.ordinal_encoder = OrdinalEncoder(
            verbose=self.verbose,
            cols=self.cols,
            impute_missing=self.impute_missing,
            handle_unknown=self.handle_unknown
        )
        self.ordinal_encoder = self.ordinal_encoder.fit(X)

        for col in self.cols:
            self.digits_per_col[col] = self.calc_required_digits(X, col)

        # drop all output columns with 0 variance.
        if self.drop_invariant:
            self.drop_cols = []
            X_temp = self.transform(X)
            self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]



        return self


[docs]    def transform(self, X):
        """Perform the transformation to new categorical data.

        Parameters
        ----------

        X : array-like, shape = [n_samples, n_features]

        Returns
        -------

        p : array, shape = [n_samples, n_numeric + N]
            Transformed values with encoding applied.

        """

        if self._dim is None:
            raise ValueError('Must train encoder before it can be used to transform data.')

        # first check the type
        X = convert_input(X)

        # then make sure that it is the right size
        if X.shape[1] != self._dim:
            raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))

        if not self.cols:
            return X

        X = self.ordinal_encoder.transform(X)

        X = self.binary(X, cols=self.cols)

        if self.drop_invariant:
            for col in self.drop_cols:
                X.drop(col, 1, inplace=True)

        if self.return_df:
            return X
        else:
            return X.values


[docs]    def binary(self, X_in, cols=None):
        """
        Binary encoding encodes the integers as binary code with one column per digit.
        """

        X = X_in.copy(deep=True)

        if cols is None:
            cols = X.columns.values
            pass_thru = []
        else:
            pass_thru = [col for col in X.columns.values if col not in cols]

        bin_cols = []
        for col in cols:
            # get how many digits we need to represent the classes present
            digits = self.digits_per_col[col]

            # map the ordinal column into a list of these digits, of length digits
            X[col] = X[col].map(lambda x: self.col_transform(x, digits))

            for dig in range(digits):
                X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
                bin_cols.append(str(col) + '_%d' % (dig, ))

        X = X.reindex(columns=bin_cols + pass_thru)

        return X


[docs]    @staticmethod
    def calc_required_digits(X, col):
        """
        figure out how many digits we need to represent the classes present
        """
        return int( np.ceil(np.log2(len(X[col].unique()))) )




[docs]    @staticmethod
    def col_transform(col, digits):
        """
        The lambda body to transform the column values
        """

        if col is None or float(col) < 0.0:
            return None
        else:

            col = list("{0:b}".format(int(col)))
            if len(col) == digits:
                return col
            else:
                return [0 for _ in range(digits - len(col))] + col

Best Answer

It's spelled out in the docstring:

Binary encoding for categorical variables, similar to onehot, 
but stores categories as binary bitstrings.

A bitstring is a binary integer, so all digits are zeros and ones. The base two logarithm, rounded up, of an integer is the number of digits in its base two representation.

n = 3, binary = 11, log2 = 1.58
n = 7, binary = 111, log2 = 2.81
n = 11, binary = 1011, log2 = 3.46
n = 55, binary = 110111, log2 = 5.78
n = 78, binary = 1001110, log2 = 6.29

So the code is calculating the base two logarithm to determine how many bits it needs to use to encode all of the categories.

I have a variable called "City" that contains 87 unique values. If I dummy code them, I would then increase the number of columns by 87, so that each column represented whether that row included the city (1) or not(0). Using this encoder, I only get 7 columns, how do I know which cities are represented?

They are all represented, not just seven.

In a standard encoding, you would create 87 new columns. For each city, exactly one column would be one, and all 86 other columns would be zero. In this encoding, each city is represented by a unique combination of zeros and ones in the seven new columns.

Related Solutions

Categorical Encoding – Coding Categorical Variables for Regression

Here is an example using the employee data.sav data, which comes with standard installation. Suppose salary is the dependent variable, job category, jobcat, is the categorical independent variable, and beginning salary, salbegin, is the continuous independent variable. Using GLM, you can perform pairwise comparisons between each pair of job categories. The steps are as follow:

With the data set open, go to Analyze > General Linear Model > Univariate.
Put the dependent variable and independent variable into the correct slots. Categorical independent variables go to "Fixed Factor(s)" and continuous ones go to "Covariate(s)." Do not worry about the Random Factors. When it's all set, click the "Model" button.
In the Model panel, highlight the two independent variables, then change the build term to "Main effects," and then click the arrow button (indicated by the red circle) to bring the two variables over. When all set, click "Continue."
Now, click the "Option" button.
In the Option panel, do the followings: 1) Highlight jobcat, 2) bring it over to the right by clicking the arrow button, 3) Check "Compare Main Effects", 4) Specify the adjustment you'd like to make for the multiple pairwise comparisons. I left it as LSD which does not adjust for multiple tests, 5) Check "Parameter Estimates" so that you'll also get the regression coefficients. When it's all done, click Continue and then OK to submit the test.
Here is the regression coefficient table:
Scroll down a bit and you'll find the pairwise comparisons table:

Solved – Estimation process in OLS with categorical variables and dumthe coding

Just to answer one part of your question:

Now I am having difficulties understanding how this is done with categorical variables. I have read about dummy coding, and that a categorical variable with k levels is divided into k−1 dummy variables and so on, but I fail to see how this is actually implemented with regard to the actual OLS estimation (formulas above). How would the matrix of coefficients above look, if we are dealing with a categorical variable and dummy coding?

Hopefully this block of code will help. Look at the X matrix:

set.seed(123987)
n    <- 6
df   <- data.frame(x=runif(n), categorical=factor(letters[1:3]))
df$y <- rnorm(n) + df$x + ifelse(df$categorical == "a", 0,
                                 ifelse(df$categorical == "b", 2, 10))
fit  <- lm(y ~ x + categorical, data=df)
fit$coefficients  # Around -0.1, 2.5, 1.1 and 10.3

X      <- matrix(1, nrow=n, ncol=length(fit$coefficients))
X[, 2] <- df$x
X[, 3] <- 1*(df$categorical == "b")
X[, 4] <- 1*(df$categorical == "c")
colnames(X) <- c("constant", "x", "indicator for b", "indicator for c")  # Aka dummies
Y <- matrix(df$y, ncol=1)

beta_hat <- as.vector(solve(t(X) %*% X) %*% t(X) %*% Y)
max(abs(beta_hat - fit$coefficients))          # Very small -- essentially equal
isTRUE(all.equal(beta_hat, fit$coefficients))  # ...well, not equal enough for all.equal

The matrix X has one column of 1s (the constant); a column of df$x (a continuous predictor); a column that is 1 when the categorical variable equals "b", and zero otherwise; and similarly for "c". The value "a" is omitted, since we have a constant.

Edit: spacing in the code block is messed up for some reason, not sure why.

Best Answer

Related Solutions

Categorical Encoding – Coding Categorical Variables for Regression

Solved – Estimation process in OLS with categorical variables and dumthe coding

Related Question