Solved – Feature naming conventions – Math Solves Everything

I'm curious to know what others tend to see as a suitable naming convention for model features or variables, particularly as they relate to their use and reference in software applications.

For instance, given two inputs: age and income, we could construct features around various transformations and discretizations of their individual raw values, as well as capture their interaction in a number of ways.

Realizing that we want to make these names concise yet descriptive, would the following seem reasonable? Are they too verbose?

gt_100k_income
is_missing_income
lg10_income
ge_20_lt_25_age
zscale_age
ratio_ln_income_ln_age
…

Is it worth trying to (explicitly or implicitly) denote the return type of a feature value? How about naming a feature that is derived from say 5 or more other features?

gt_100k_income income_gt_100k is_missing_income income_missing lg10_income income_lg10 age_20_lt_25_age age_ge_20_lt_25 zscale_age age_zscale ratio_ln_income_ln_age income_ln_over_age_ln

Best Answer

With a lot of variables, at some point you are going to want to figure out what's what, and this will be easier if they are meaningful when put in alphabetical order. You are less likely to group them by whether they are logged or not than whether they are in the same "family". So, I'd rearrange your example as this:

I realize that this is exactly the opposite of what some software does automatically (such as Excel pivot tables or Alteryx Summaries), but Bill Gates isn't right all the time.

It's probably more important to be consistent with your method, than what that particular method is.

Best Answer

Related Solutions

Solved – Understanding Feature Hashing

Related Question