In brief, to analyse your data for question 1) it sounds like a good option to explore would be multinomial logistic regression models, and to analyse your data for question 2) it sounds like you a good option to explore would be ordinal logistic regression models.
- Multinomial models allow you to predict probabilities of occurrence
for different possible categorical outcomes, and how those
probabilities depend on explanatory variables (e.g. demographic and
park/area characteristics). However, be aware that you would need to
look at users top/secondary/tertiary etc preferences for different
park priorities separately. This is because in multinomial logistic
regression the outcomes for each observation (user) need to be in the
form of a choice of one of the possible categorical options only.
- Ordinal models allow you to predict probabilities of getting
different responses on an ordered outcome (e.g. a Likert scale), and
how those probabilities depend on explanatory variables (e.g.
demographic and park/area characteristics).
Clustering
In addition, if you are sampling data on users of different parks, then that suggests there is highly likely to be clustering in the data at the park-level (e.g. responses are likely to vary systematically between parks).
For example, to use a potentially facetious example, users of parks near to wealthier areas might be more worried about who else is using the park, whilst those from less wealthier areas might be less worried. The would mean there is variability occurring at the level of parks, which would need to be accounted for in any analysis to avoid biased results (ignoring such variation typically results in falsely reduced standard errors, and therefore falsely reduced confidence intervals and p-values).
You may simply want to account for clustering as a nuisance issue, but you may also find you want to explicitly examine it, because there can be lots of interesting questions related to it. Either way, a very good and standard option would be to use multilevel (or mixed effects) versions of the model types described above.
Resources
The University of Bristol has a truly excellent and freely available (after registering) online course called LEMMA (Learning environment for multilevel methodology and applications), which provides materials on a wide range of multilevel models. See https://www.cmm.bris.ac.uk/lemma/
I would suggest looking here first. They have modules that cover both single and multilevel models for both multinomial and ordinal logistic regression, and the materials include documents on the conceptual background with examples, and also fully worked examples using real data and with code provided for a range of software packages including Stata, R and MLwiN (but sometimes they only provide code for one or two packages of these packages).
The materials are also aimed at researchers rather than pure statisticians, so they sound like an ideal starting place for you. In addition, they also provide further relevant references, so it would also allow you to expand your learning/skills further as required.
Best Answer
You have to aggregate the data at the levels of the 50 stores. Then you can apply your cluster algorithm on these aggregated data.
Regarding the categorical variables, I would not use the modes. I would recode all the categorical variables into binary 0/1 variables, and compute the means. If you have a variable equal to 1 if a customer is a men and equal to 0 otherwise, the mean gives you the proportion of men who have visited a particular shop. You have to set up your data as follows. If a categorical variable has two categories, it has to be recoded into a 0/1 variable. If a categorical variable has more than two categories, you have to create a binary (0/1) variable for each category of the original variable.