Solved – a good software for data collection à la CDC’s Epi Info

datasetexcelopen sourcesoftware

I am a medical student working on a project which requires massive data that are collected manually through data entry operators. Although I am using traditional tool used by epidemiologists, Epi Info, I just wanted to know if anyone could recommend some good alternatives–not necessarily free/open source though preferred.

The data has been already collected, but its all in papers, spanning over five years. It contains ~22 fields and is for many individuals, by that I mean >100,000. Im from India, although Im in a medical college, this project is our sole endeavor (i.e, self funded). Double data entry proposal in on table but a final call on check mechanisms will be taken only after the pilot study.

The whole data is hand written (literally scribbled) and is indeed inconsistent at many places. Although I have almost designed the whole form in Epi Info which involves drop down boxes, radio buttons, check boxes and takes advantage of Check Code wherever required. My concerns with Epi Info are- the present version ie., EI7 though promises to incorporate Geocode, unfortunately I could not get it to work, even after getting separate keys from Bing. But, as a potential solution I would be using Google API, which would though limit queries ~2500/day/IP (yes there are workarounds).

Secondly, Epi Info is painfully slow and its giving me some hard time whether to place confidence in it for a such a huge data especially when merging will be required and when reimporting of the data will be required after editing it in excel (geocoding and other cleaning discrepancies). Personally latest version has a very good Dashboard for data analysis, but many among us are in favour of SPSS.

Lastly, we are not in favour of any web based application (ofcourse because we dont have internet everywhere here, and that has to be fast as well), an offline solution is required.

Best Answer

Very interesting question. The three tasks you have at hand (data entry, geocoding, and data analysis) are all things that can be done by one program or by three (or more) completely separate programs. This isn't an "answer", exactly, but I've outlined my experiences below.

Data Entry:

  • MS Access: The old standby. Build forms and enter data. Potentially hazardous if you have multiple users entering data simultaneously. I've used this for some small projects and would prefer to avoid it in the future - I had weird problems with records linked across tables, but your data sounds like you have just one table. You'd have to have sufficient licenses and computers with Windows. I find it slow, but most of my Access DBs are on a network drive across campus, so that's part of the problem.
  • SurveyMonkey: SurveyMonkey can make a reasonably good data entry tool for simple, form-based input - there are a few configuration tweaks necessary, but I've used this for entering a few thousand surveys. It's web-based, so you'd need a sufficiently reliable Internet connection, but otherwise you're using SurveyMonkey's hardware. Shouldn't be any problem with multiple simultaneous data entry, and it has several options for data export. You'd need at least the Select plan (US$17 per month) to get unlimited questions and respondents.
  • RedCAP: Vanderbilt University runs a consortium around its RedCAP software, which is purpose-built for research (including medical research). You have to be part of the consortium to use it, though, and I think most consortium members host their own servers - but you might be able to piggyback on someone else's.
  • A homebrew solution built on a web framework (e.g., Django or Rails): Provides maximum control, but also has the highest technical capacity requirements. I've played around with Django a bit and I think you could get a 22-field form up pretty quickly. I don't think 22 x 100k counts as "big" where Django is concerned.

Data Analysis:

Does Epi Info support writing code for analysis, or is it limited to the widgets and other menu-driven choices I saw in the tutorial video? Coding up an analysis is key for being able to reproduce results and find errors, so if Epi Info doesn't have that, SPSS would be an improvement. Better still, R - it's free, has a really robust community, and has packages that will let you do just about any kind of analysis you'd like. All of three of these should be able to import data from whatever data entry option you choose, so don't worry about that.


Geocoding (and, I'm assuming, some mapping):

The maps in Epi Info looked nice! That was the segment of the intro video that really drew my attention. But there are many ways to geocode your addresses, and it may be easier to do data entry in your system of choice and then bulk-geocode the addresses afterward (that'll save you 100,000 clicks of that 'Get Coordinates' button, anyway). There are several options for that - Bing and Google, of course, and many others (check out the geocoding questions on our sister site, GIS.StackExchange). I think it would be especially worthwhile to check out bulk geocoders that explicitly refer to their ability to code in India - many geocoders (e.g., SmartyStreets) are nation-specific, and many others are just going to return crappy results. A number of geocoding APIs are available through packages in R, and there are a variety of packages available for mapping.

So - I would definitely consider using Epi Info for only the things that it's doing well for you (the form creation and data entry looks nice, but if it's slow, what can you do?), and reaching out to other tools for the things it's not doing well. My ideal version of this would probably be double-data-entry in a Django database, automatic geocoding through an API or by sending a file to a service (whichever gets the most accurate results, and analysis and mapping in R.