I have this data set from https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names which gives a good summary of the attributes im using. Some of the observations are missing and I have already coded the last target column (+,-'s) in 0's and 1's. I am not sure how to proceed with KNN from here with the missing values and some attributes that have 10-15 different categories. I don't think Knn works with missing values because I keep getting errors saying NA's forced by coercion. Should I remove these attributes as well as trying to impute the missing numerical values?
Solved – How to implement knn in r with missing values
categorical datadata miningr
Related Solutions
If your dataset has a time series character you can have a look at this paper comparing methods for univariate time series imputation in R: http://arxiv.org/abs/1510.03924
But actually I guess you caption is misleading, usually you speak of univariate data if you have just one attribute
What I understood is, you have 52 attributes (1 numeric, 51 binary). So you do not need special algorithms for univariate imputation. The MICE package should be alright for this task. (even with 40% missing data) Perhaps you can post your MICE code, that we see what is going wrong.
One possible solution... Discretize the "months since last delinquency" variable and into categories such as:
- Never delinquent (based on NAs)
- Last delinquent 0 to 12 months ago
- Last delinquent 13 to 34 months ago
- And so on...
Or use some modification of this, and use dummy variables in the regression to estimate the parameters associated with each category. Playing around with the category definitions to find appropriate cut points would be good.
Pros:
- It incorporates all the available information without imputation
- It may have the benefit of capturing different effect sizes of time since delinquency (rather than assuming a linear slope).
Cons:
- It does require more parameters to be estimated in your model
- Coefficients for dummy variables won't have the standard linear coefficient interpretation. But they're still relatively straightforward to interpret.
Edit: You do not have missing data, therefore imputation is not justified at a theoretical level. At a more practical level, if you were to impute the NAs, they would be assigned values that will have to be based on the data (people) that already exist in full. This will give the "non-delinquents" values of around 0 to 81, meaning the imputed data would essentially say, these people actually are delinquents. This is untrue and will bias your model. Additionally, your model will only be applicable to "delinquents" since it will only have data for "delinquents" in it (imputed or otherwise).
Best Answer
You should convert you categoricals to onehot encoding and thus use a custom distance metric. Regarding the na's, yes you should create a missing value imputer; a common approach is to replace the value with the mean or median on that column. But you can create your own by taking into account the insights you took from your data, using trees for imputation is also common.