Solved – Cleaning data of inconsistent format in R

data preprocessingr

I often deal with messy survey data which requires a lot of cleaning up before any statistics can be done. I used to do this "manually" in Excel, sometimes using Excel formulas, and sometimes checking entries one-by-one. I started doing more and more of these tasks by writing scripts to do them in R, which has been very beneficial (benefits include having a record of what was done, less chance of mistakes, and being able to reuse code if the data set is updated).

But there are still some types of data that I have trouble handling efficiently. For example:

> d <- data.frame(subject = c(1,2,3,4,5,6,7,8,9,10,11),
+   hours.per.day = c("1", "2 hours", "2 hr", "2hr", "3 hrs", "1-2", "15 min", "30 mins", "a few hours", "1 hr 30 min", "1 hr/week"))
> d
   subject hours.per.day
1        1             1
2        2       2 hours
3        3          2 hr
4        4           2hr
5        5         3 hrs
6        6           1-2
7        7        15 min
8        8       30 mins
9        9   a few hours
10      10   1 hr 30 min
11      11     1 hr/week

hours.per.day is meant to be the average number of hours per day spent on a certain activity, but what we have is exactly what the subject wrote. Suppose I make some decisions on what to do with ambiguous responses, and I want the tidied variable hours.per.day2 as follows.

   subject hours.per.day hours.per.day2
1        1             1      1.0000000
2        2       2 hours      2.0000000
3        3          2 hr      2.0000000
4        4           2hr      2.0000000
5        5         3 hrs      3.0000000
6        6           1-2      1.5000000
7        7        15 min      0.2500000
8        8       30 mins      0.5000000
9        9   a few hours      3.0000000
10      10   1 hr 30 min      1.5000000
11      11     1 hr/week      0.1428571

Assuming that the number of cases is quite large (say 1000) and knowing that the subjects were free to write anything they liked, what is the best way to approach this?

Best Answer

I would use gsub() to identify the strings that I know and then perhaps do the rest by hand.

test <- c("15min", "15 min", "Maybe a few hours", 
          "4hr", "4hour", "3.5hr", "3-10", "3-10")
new_var <- rep(NA, length(test))

my_sub <- function(regex, new_var, test){
    t2 <- gsub(regex, "\\1", test)
    identified_vars <- which(test != t2)
    new_var[identified_vars] <- as.double(t2[identified_vars])
    return(new_var)    
}

new_var <- my_sub("([0-9]+)[ ]*min", new_var, test)
new_var <- my_sub("([0-9]+)[ ]*(hour|hr)[s]{0,1}", new_var, test)

To get work with the ones that you need to change by hand I suggest something like this:

# Which have we not found
by.hand <- which(is.na(new_var))

# View the unique ones not found
unique(test[by.hand])
# Create a list with the ones
my_interpretation <- list("3-10"= 5, "Maybe a few hours"=3)
for(key_string in names(my_interpretation)){
    new_var[test == key_string] <- unlist(my_interpretation[key_string])
}

This gives:

> new_var
[1] 15.0 15.0  3.0  4.0  4.0  3.5  5.0  5.0

Regex can be a little tricky, every time I'm doing anything with regex I run a few simple tests. Se ?regex for the manual. Here's some basic behavior:

> # Test some regex
> grep("[0-9]", "12")
[1] 1
> grep("[0-9]", "12a")
[1] 1
> grep("[0-9]$", "12a")
integer(0)
> grep("^[0-9]$", "12a")
integer(0)
> grep("^[0-9][0-9]", "12a")
[1] 1
> grep("^[0-9]{1,2}", "12a")
[1] 1
> grep("^[0-9]*", "a")
[1] 1
> grep("^[0-9]+", "a")
integer(0)
> grep("^[0-9]+", "12222a")
[1] 1
> grep("^(yes|no)$", "yes")
[1] 1
> grep("^(yes|no)$", "no")
[1] 1
> grep("^(yes|no)$", "(yes|no)")
integer(0)
> # Test some gsub, the \\1 matches default or the found text within the ()
> gsub("^(yes|maybe) and no$", "\\1", "yes and no")
[1] "yes"
Related Question