Solved – the best way to Reshape/Restructure Data

data preprocessingexcelr

I am a research assistant for a lab (volunteer). I and a small group have been tasked with the data analysis for a set of data pulled from a large study. Unfortunately the data were gathered with an online app of some sort, and it was not programmed to output the data in the most usable form.

The pictures below illustrates the basic problem. I was told that this is called a "Reshape" or "Restructure".

Question: What is the best process for going from Picture 1 to Picture 2 with a large data set with over 10k entries?

This is a sample, the actual file has over 10k entries. I've color coded to help with clarity

This is what it all needs to be converted into.

Best Answer

As I noted in my comment, there isn't enough detail in the question for a real answer to be formulated. Since you need help even finding the right terms and formulating your question, I can speak briefly in generalities.

The term you are looking for is data cleaning. This is the process of taking raw, poorly formatted (dirty) data and getting it into shape for analyses. Changing and regularizing formats ("two" $\rightarrow 2$) and reorganizing rows and columns are typical data cleaning tasks.

In some sense, data cleaning can be done in any software and can be done with Excel or with R. There will be pros and cons to both choices:

  • Excel: Excel is almost certainly the most common choice for data cleaning (see R fortunes #59 pdf). It is also considered a poor choice by statisticians. The primary reason is that it is hard to ensure that you have caught everything, or that you have treated everything identically, and there is no record of the changes that you have made, so you can't revisit those changes later. The upside of using Excel is that it will be easier to see what you are doing, and you don't have to know much to make changes. (Statisticians will consider the latter an additional con.)
  • R: R will require a steep learning curve. If you aren't very familiar with R or programming, things that can be done quite quickly and easily in Excel will be frustrating to attempt in R. On the other hand, if you ever have to do this again, that learning will have been time well spent. In addition, the ability to write and save your code for cleaning the data in R will alleviate the cons listed above. The following are some links that will help you get started with these tasks in R:

    You can get a lot of good information on Stack Overflow:

    Quick-R is also a valuable resource:

    Getting numbers into numerical mode:

    Another invaluable source for learning about R is UCLA's stats help website:

    Lastly, you can always find a lot of information with good old Google:

    • This search: data cleaning in r, brings up a number of tutorials (none of which I've worked through, FTR).

Update: This is a common issue regarding the structure of your dataset when you have multiple measurements per 'study unit' (in your case, a person). If you have one row for every person, your data are said to be in 'wide' form, but then you will necessarily have multiple columns for your response variable, for example. On the other hand, you can have just one column for your response variable (but have multiple rows per person, as a result), in which case your data are said to be in 'long' form. Moving between these two formats is often called 'reshaping' your data, especially in the R world.

  • The standard R function for this is ?reshape. There is a guide to using reshape() on UCLA's stats help website.
  • Many people think reshape is hard to work with. Hadley Wickham has contributed a package called reshape2, which is intended to simplify the process. Hadley's personal website for reshape2 is here, the Quick-R overview is here, and there is a nice-looking tutorial here.
  • There are very many questions on SO about how to reshape data. Most of them are about going from wide to long, because that is typically what data analysts are faced with. Your question is about going from long to wide, which is much less common, but there are still many threads about that, you can look through them with this search.
  • If your heart is set on trying to do this with Excel, there is a thread about writing a VBA macro for Excel to replicate the reshape functionality here: melt / rehshape in Excel using VBA?
Related Question