Solved – the best way to Reshape/Restructure Data

data preprocessingexcelr

I am a research assistant for a lab (volunteer). I and a small group have been tasked with the data analysis for a set of data pulled from a large study. Unfortunately the data were gathered with an online app of some sort, and it was not programmed to output the data in the most usable form.

The pictures below illustrates the basic problem. I was told that this is called a "Reshape" or "Restructure".

Question: What is the best process for going from Picture 1 to Picture 2 with a large data set with over 10k entries?

This is a sample, the actual file has over 10k entries. I've color coded to help with clarity

This is what it all needs to be converted into.

Best Answer

As I noted in my comment, there isn't enough detail in the question for a real answer to be formulated. Since you need help even finding the right terms and formulating your question, I can speak briefly in generalities.

The term you are looking for is data cleaning. This is the process of taking raw, poorly formatted (dirty) data and getting it into shape for analyses. Changing and regularizing formats ("two" $\rightarrow 2$) and reorganizing rows and columns are typical data cleaning tasks.

In some sense, data cleaning can be done in any software and can be done with Excel or with R. There will be pros and cons to both choices:

Excel: Excel is almost certainly the most common choice for data cleaning (see R fortunes #59 pdf). It is also considered a poor choice by statisticians. The primary reason is that it is hard to ensure that you have caught everything, or that you have treated everything identically, and there is no record of the changes that you have made, so you can't revisit those changes later. The upside of using Excel is that it will be easier to see what you are doing, and you don't have to know much to make changes. (Statisticians will consider the latter an additional con.)
R: R will require a steep learning curve. If you aren't very familiar with R or programming, things that can be done quite quickly and easily in Excel will be frustrating to attempt in R. On the other hand, if you ever have to do this again, that learning will have been time well spent. In addition, the ability to write and save your code for cleaning the data in R will alleviate the cons listed above. The following are some links that will help you get started with these tasks in R:

You can get a lot of good information on Stack Overflow:
- How does one reorder colums in R?
- R: How can I reorder the rows of a matrix, data.frame or vector acording to another one?
Quick-R is also a valuable resource:
- sorting
Getting numbers into numerical mode:
- Convert written number to number in R
- ?strtoi is a specialized function for converting from hexidicimal, etc., if necessary
Another invaluable source for learning about R is UCLA's stats help website:
- working with factor variables (for your "mostly agree", etc.)
Lastly, you can always find a lot of information with good old Google:
- This search: data cleaning in r, brings up a number of tutorials (none of which I've worked through, FTR).

Update: This is a common issue regarding the structure of your dataset when you have multiple measurements per 'study unit' (in your case, a person). If you have one row for every person, your data are said to be in 'wide' form, but then you will necessarily have multiple columns for your response variable, for example. On the other hand, you can have just one column for your response variable (but have multiple rows per person, as a result), in which case your data are said to be in 'long' form. Moving between these two formats is often called 'reshaping' your data, especially in the R world.

The standard R function for this is ?reshape. There is a guide to using reshape() on UCLA's stats help website.
Many people think reshape is hard to work with. Hadley Wickham has contributed a package called reshape2, which is intended to simplify the process. Hadley's personal website for reshape2 is here, the Quick-R overview is here, and there is a nice-looking tutorial here.
There are very many questions on SO about how to reshape data. Most of them are about going from wide to long, because that is typically what data analysts are faced with. Your question is about going from long to wide, which is much less common, but there are still many threads about that, you can look through them with this search.
If your heart is set on trying to do this with Excel, there is a thread about writing a VBA macro for Excel to replicate the reshape functionality here: melt / rehshape in Excel using VBA?

Related Solutions

Solved – Cleaning data of inconsistent format in R

I would use gsub() to identify the strings that I know and then perhaps do the rest by hand.

test <- c("15min", "15 min", "Maybe a few hours", 
          "4hr", "4hour", "3.5hr", "3-10", "3-10")
new_var <- rep(NA, length(test))

my_sub <- function(regex, new_var, test){
    t2 <- gsub(regex, "\\1", test)
    identified_vars <- which(test != t2)
    new_var[identified_vars] <- as.double(t2[identified_vars])
    return(new_var)    
}

new_var <- my_sub("([0-9]+)[ ]*min", new_var, test)
new_var <- my_sub("([0-9]+)[ ]*(hour|hr)[s]{0,1}", new_var, test)

To get work with the ones that you need to change by hand I suggest something like this:

# Which have we not found
by.hand <- which(is.na(new_var))

# View the unique ones not found
unique(test[by.hand])
# Create a list with the ones
my_interpretation <- list("3-10"= 5, "Maybe a few hours"=3)
for(key_string in names(my_interpretation)){
    new_var[test == key_string] <- unlist(my_interpretation[key_string])
}

This gives:

> new_var
[1] 15.0 15.0  3.0  4.0  4.0  3.5  5.0  5.0

Regex can be a little tricky, every time I'm doing anything with regex I run a few simple tests. Se ?regex for the manual. Here's some basic behavior:

> # Test some regex
> grep("[0-9]", "12")
[1] 1
> grep("[0-9]", "12a")
[1] 1
> grep("[0-9]$", "12a")
integer(0)
> grep("^[0-9]$", "12a")
integer(0)
> grep("^[0-9][0-9]", "12a")
[1] 1
> grep("^[0-9]{1,2}", "12a")
[1] 1
> grep("^[0-9]*", "a")
[1] 1
> grep("^[0-9]+", "a")
integer(0)
> grep("^[0-9]+", "12222a")
[1] 1
> grep("^(yes|no)$", "yes")
[1] 1
> grep("^(yes|no)$", "no")
[1] 1
> grep("^(yes|no)$", "(yes|no)")
integer(0)
> # Test some gsub, the \\1 matches default or the found text within the ()
> gsub("^(yes|maybe) and no$", "\\1", "yes and no")
[1] "yes"

Solved – Best way to display data with outliers

The standard traditional tool is a histogram. You can do this with the analysis tool pack in Excel, but I'd recommend using a stats package instead.

An extension of the histogram is a line plot showing the density - this is basically your idea of shwoing the bell curve, and it is probably the right one. From here there are various options such as drawing vertical lines to show the mean, median, 95th percentile, etc. To do this you will definitely want a stats package.

Some examples are below, including the code in R (which is free) that generated the data and drew the plots. You can see it that's not necessarily that hard to do this sort of thing in a stats package, if you're prepared to move beyond Excel.

# generate data
times <- rgamma(1000,1,1)

# draw histogram, showing counts
hist(times, col="grey")

# draw a density line plot
plot(density(times), bty="l")

# add vertical lines for the median and 95th percentile
abline(v=quantile(times, c(0.5, 0.95)), lty=2:3)

enter image description here

Best Answer

Related Solutions

Solved – Cleaning data of inconsistent format in R

Solved – Best way to display data with outliers

Related Question