Solved – Feature naming conventions

feature-engineeringnotationsoftware

I'm curious to know what others tend to see as a suitable naming convention for model features or variables, particularly as they relate to their use and reference in software applications.

For instance, given two inputs: age and income, we could construct features around various transformations and discretizations of their individual raw values, as well as capture their interaction in a number of ways.

Realizing that we want to make these names concise yet descriptive, would the following seem reasonable? Are they too verbose?

  • gt_100k_income
  • is_missing_income
  • lg10_income
  • ge_20_lt_25_age
  • zscale_age
  • ratio_ln_income_ln_age

Is it worth trying to (explicitly or implicitly) denote the return type of a feature value? How about naming a feature that is derived from say 5 or more other features?

Best Answer

With a lot of variables, at some point you are going to want to figure out what's what, and this will be easier if they are meaningful when put in alphabetical order. You are less likely to group them by whether they are logged or not than whether they are in the same "family". So, I'd rearrange your example as this:

gt_100k_income               income_gt_100k
is_missing_income            income_missing
lg10_income                  income_lg10
age_20_lt_25_age             age_ge_20_lt_25
zscale_age                   age_zscale
ratio_ln_income_ln_age       income_ln_over_age_ln

I realize that this is exactly the opposite of what some software does automatically (such as Excel pivot tables or Alteryx Summaries), but Bill Gates isn't right all the time.

It's probably more important to be consistent with your method, than what that particular method is.

Related Question