Solved – Analysis of Categorical and Likert-Like Survey Data

categorical datacount-datarsurvey

I'm about to have data from intercept surveys conducted in parks. The goal of the survey is to determine which characteristics of parks users find most important to park quality (do they care a lot about safety, a little about the facilities, and not at all about who else is there?).

We've designed a survey with open-ended questions to answer this question. The current plan is to take down the responses, and then, once we have them, group them into categories (safety, facilities, social environment, accessibility, etc).

For example, one question on the survey asks the user why they came to park.

Each user's response (we're allowing them to list as many reasons as they like, but are asking for primary reasons first, then secondary reasons and so on) will then be associated with some field coding. For one user it might be, say, facilities and park aesthetics, for another it might be easy access. We'll also have some demographic data (age, sex, ethnicity, activity at the park) for each user.

Question 1: We want to determine which of the categories is most important to users, and if possible, by how much. I've never done any categorical data analysis, and I have no idea what to do here. For some questions we're just going to have counts: 16 people came for facilities, 10 for open spaces, etc.

Question 2: A separate series of questions asks users to categorize park quality on a Likert-like scale (low to high quality), and also to rate sub-components of park quality in the same way (quality of facilities, from low to high, and so on). We want to determine which predictors have the largest effect on perceived park quality here as well.

I want to know what type of models to fit to our data, and why.

I'm presuming we want some categorical analogue of regression. I want to pick up theoretical underpinnings, learn how to fit models in R, and also how to perform diagnostics on them.

Once I've decided on the appropriate analysis and have picked up the necessary background, I'd like to pre-register my data analysis plan. I've never done this before and am curious what the convention is for this.

Some details about the sample of parks: the city Parks and Recreation department has selected 10 parks for us to visit. Their park selection criteria is not entirely known, but I think they want to visit some well developed and some under developed parks. There are five pairs of parks that the Parks department thinks are comparable. In each pair of parks, one has recently undergone renovation, and the other hasn't.

My background:

I have taken a first course in math-stat, a course on linear regression, and am halfway through a course on experimental design/ANOVA/EM/Bootstrap. I have some pure math, multi, lin-alg and optimization background as well. I have some limited experience in R as well.

Best Answer

In brief, to analyse your data for question 1) it sounds like a good option to explore would be multinomial logistic regression models, and to analyse your data for question 2) it sounds like you a good option to explore would be ordinal logistic regression models.

  • Multinomial models allow you to predict probabilities of occurrence for different possible categorical outcomes, and how those probabilities depend on explanatory variables (e.g. demographic and park/area characteristics). However, be aware that you would need to look at users top/secondary/tertiary etc preferences for different park priorities separately. This is because in multinomial logistic regression the outcomes for each observation (user) need to be in the form of a choice of one of the possible categorical options only.
  • Ordinal models allow you to predict probabilities of getting different responses on an ordered outcome (e.g. a Likert scale), and how those probabilities depend on explanatory variables (e.g. demographic and park/area characteristics).

Clustering

In addition, if you are sampling data on users of different parks, then that suggests there is highly likely to be clustering in the data at the park-level (e.g. responses are likely to vary systematically between parks).

For example, to use a potentially facetious example, users of parks near to wealthier areas might be more worried about who else is using the park, whilst those from less wealthier areas might be less worried. The would mean there is variability occurring at the level of parks, which would need to be accounted for in any analysis to avoid biased results (ignoring such variation typically results in falsely reduced standard errors, and therefore falsely reduced confidence intervals and p-values).

You may simply want to account for clustering as a nuisance issue, but you may also find you want to explicitly examine it, because there can be lots of interesting questions related to it. Either way, a very good and standard option would be to use multilevel (or mixed effects) versions of the model types described above.

Resources

The University of Bristol has a truly excellent and freely available (after registering) online course called LEMMA (Learning environment for multilevel methodology and applications), which provides materials on a wide range of multilevel models. See https://www.cmm.bris.ac.uk/lemma/

I would suggest looking here first. They have modules that cover both single and multilevel models for both multinomial and ordinal logistic regression, and the materials include documents on the conceptual background with examples, and also fully worked examples using real data and with code provided for a range of software packages including Stata, R and MLwiN (but sometimes they only provide code for one or two packages of these packages).

The materials are also aimed at researchers rather than pure statisticians, so they sound like an ideal starting place for you. In addition, they also provide further relevant references, so it would also allow you to expand your learning/skills further as required.

Related Question