Think about the discontinuity introduced in vector spaces by one-hot encoding

categorical-encodinglinear algebra

Consider a case where you have two features: feature 1 (f1) is numerical and can take any real number, feature 2 (f2) is categorial with 3 unique values. Say we use one-hot encoding for feature 2 and generate vectors that look like bit strings of length 3. When we combine these with values in f1, we have a full dataset. An example data point looks like this: (0.56,0,1,0).

Now, if we take the example data point and add it to itself, we get (1.2,0,2,0). But this vector will never exist in our dataset because we can't have the value 2 in any of the last three positions. What this means is that no matter how many samples we take and transform (via one-hot encoding), all our data always lives in discontinuous pockets of a 4 dimensional real vector space and can never form a vector space by itself because it violates the axiom that the sum of two vectors in a vector space should be another vector in the same space. Is this correct?

A couple more questions:

  1. When feeding a model these vectors, are we forcing it to make an assumption that our data is continuous but that we only happen to observe a subset that happens to only have two values (0,1) in three of its dimensions?
  2. Another image that comes to mind is that of sampling from each dimension but with different rules. For dimension one, it is sampling from a uniform distribution over all the reals. For the other three dimensions it is sampling from a Bernoulli distribution. Does this even make sense?

Best Answer

  1. You are correct that such an embedding of your data fails to be a vector space. But that's generally not a problem. For continuous variables, the sum doesn't necessarily make any sense either (the sum of two people's height may be way beyond any actually observed height, but so what?).

  2. The model probably doesn't know anything about the fact you've one-hot encoded, no. But again, that probably doesn't matter. For a linear model e.g., all that matters is that you can multiply the indicator by a coefficient and add that as a "contribution" of having that value. That the model never sees something between 0 and 1, or beyond those values, is useful to know about how the model will produce the coefficient. It also tells you how the model would deal with an errant input of 2, but in general practice that won't happen anyway.

  3. Absolutely! You can take the product of probability spaces without caring too much about what distribution each one takes.

Related Question