Solved – Survey analysis with missing data by design

missing datasurvey

I have a survey with 400 responses looking at the satisfaction of customers with a company's service overall, as well as on various specific aspects (website, account manager, invoicing, etc.). Ratings are done on a 10-point anchored scale.

The issue is that respondents only rate an aspect if they have had any interaction with it. For instance, only 14 out of the 400 responses have visited the website, so I only have 14 ratings.

The client would like to understand what drives overall satisfaction. I'd run a Shapley Values Regression using relaimpo package in R, but I can't run any type of analysis without first handling all the missing data. Any advice on doing so?

Best Answer

You could start with paper by Pokropek (2011) who describes idea of data missing by design. In such case, as described in the paper, different methods for imputation of missing data are possible.

As I understand, you have survey results of survey that consisted of several parts, where each part is a group of questions dealing a specific subject and not all customers took part in all parts of the survey. As I understand, assigning certain groups of customers to different survey parts was part of your design and was not decided by study participants (if not you can have non-random missing data issues to deal with).

Let me use synthetic example. Imagine survey that was taken by individuals divided in groups (G1,...,G5), survey consisted of different parts (P1,...,P4).

enter image description here

The parts are dealing with comparable subject (i.e. you can assume strong correlation between the parts, they are somehow exchangable, for example, different groups of students answer different tests on mathematics). Some groups answered the same parts, e.g. G1, G2 and G5 answered P1 (see diagram below). If this is how your study looks like, than you can use methods for test equating (see Kolen and Brennan, 2004; Von Davier et al., 2004). Equating methods let you to rescale score from test $X$, to scale of test $Y$, so that you can infer what would be the score of some individual if he took part in $Y$ test instead of $X$. Simple equating methods like linear equating use basic arithmetic tricks in matching means and standard deviations of the two tests, while more advanced methods like equipercentile equating (cf. Livingston, 2004) match empirical cumulative distribution functions, Item Response Theory based methods are also used. In R you can use equate library (Albano, 2014) that implements different basic equating methods.

This could be an option for you if you can assume different parts of survey to be exchangeable and you have common parts between different groups. In your case the assumption that different parts of survey (that deal with different aspects of customer satisfaction) are exchangeable is disputable, but still you can consider using some of these methods as they were designed for similar problems. This topic is pretty wide, so I would suggest reviewing the literature before going any further.


Pokropek, A. (2011). Missing by design: Planned missing-data designs in social science. ASK. Research & Methods, 20, 81-105.

Kolen, M.J., and Brennan, R. L. (2004). Test equating, scaling, and linking. New York: Springer.

Von Davier, A.A., Holland, P.W., and Thayer, D.T. (2004). The kernel method of test equating. Springer Science & Business Media.

Livingston, S.A. (2004). Equating test scores. ETS.

Albano, A.D. (2014). equate: An R Package for Observed-Score Linking and Equating. R package version, 2.

Related Question