One thing to keep in mind before turning to weights is that gender can be considered a "swamping" variable in a two-step cluster analysis. Differences between gender are oftentimes large, and thus overpower weaker, but still substantively interesting heterogeneity in your data.
Instead of down-weighting gender, you could consider looking into a finite mixture regression model. Finite mixture models are a model-based cluster analysis (clusters are usually assumed to be multivariate Gaussian) and a finite mixture regression model essentially combines a cluster analysis with a regression. In your case, you could use gender as a predictor, perform this analysis, and detect clusters while taking into account the predictive power of gender (as well as other variables of interest). More information can be found here and here in the flexmix R package documentation.
It is obvious that using "option 1" dataset you virtually declare that similar respondents are those respondents who bought (or stole) the same items at the same turn or visit. I don't believe this is your research goal. In addition, problematic NA responses arise.
"Option2" dataset is what you should use, but do recode 2 into 1, to make the data binary. You then can take these variables in TwoStep as categorical variables. But here arises another doubt. TwoStep is for nominal categorical variables only; and you probably will not want to treat the binary variables as dichotomous nominal. Treating them as nominal means that, for you, respondents who did not buy the same item are as similar with each other as those who did bought the same item. Rather, you'll want to treat those "0 and 0" respondents as neither similar nor dissimilar, - this suggests using similarity measures such as Jaccard measure. But this in turn precludes using TwoStep and requires using Hierarchical clustering or other clustering methods apt for binary data (those other, unfortunalely, are not found in SPSS).
But you can't do hierarchical clustering of 75000 respondents - its too many, both practically and theoretically, for hierarchical method. You see - you've got trapped.
One way out will be to look for clustering algorithms (outside SPSS) which are for big (large) sparse binary data
. Search this site for this word combination: you'll get a few related questions, to read.
Another way out may be to do hierarchical clustering on random subsets of your data (subsets of size, say, 500 respondents). Clustering of subsamples and cross-validation is beneficial, as it escapes overfitting threat. But, in the context of clustering, it is quite a big work. I recommend you to read papers on cluster analysis by subsamples.
A third and the easiest way will be to do K-means clustering of your data. It solves the problem of big dataset. However, K-means is often seen as theoretically inappropriate for binary as well as count variables. They say that because the method involves computation of floating point geometric centroids, it requires interval, ideally - continuous, variables. That said, people use K-means with binary or count data "all the time". It appears to me that in high-dimensional settings such as yours (1500 variables) the relative contributions of "continuety" and "dimensionality" to the formation of centroids shifts towards dimensionality anyway - even if you had quite fine-grained Likert scale variables. That seems to excuse, to an extent, applying K-means to a wide binary dataset as yours.
If you choose K-means then I recommend you to normalize each row (respondent) in the dataset to unit sum-of-squares (= unit sum, since the data are binary) prior clustering. Why to do it? According to this, when you normalize vector magnitudes (L2 norms) you make the euclidean distance between the vectors directly reflect cosine similarity between them: $d^2=2(1-cos)$. And it is cosine similarity (= Ochiai binary measure) which is a justifiable alternative to Jaccard measure I mentioned in the 2nd paragraph above. Both Jaccard and Ochiai treat "0 and 0" respondents as neither similar nor dissimilar - that is what you need. So, using K-means on such way normalized data is in a sense analogous to using hierarchical clustering on Ochiai measure.
Best Answer
Bear in mind that CV is not intended to be a resource for software-specific questions. Given that, I don't know SPSS either but, having done my share of clustering, may still be able to provide some useful, general guidelines. As with all unsupervised, exploratory methods, there is typically no "ground truth" against which to validate the results. Use statistical metrics and common sense to derive solutions that are actionable.
The two-step process is generating seeds in step one for input into the second, k-means step. Does SPSS provide any options for filtering those seeds? For instance, being able to set a minimum seed size would eliminate outliers or splinter seeds and help to stabilize the results.
Next, I've found that playing around with the number of predictors used by the cluster algorithm can be hugely important in generating useful results. Since k-means assumes continuously distributed inputs as well as OLS estimation (i.e., it's not scale invariant), it is typically a good idea to pass the raw features through a PCA or EFA to reduce redundancy and smooth the information. Then, adjusting the number of components used by eliminating the smaller eigenvalued factors can sometimes clarify the resulting partitions. For instance, if your EFA returns 8 factors and you don't get useful results running the algorithm on all of them, try dropping the lowest loading components.
Evaluate the lumpiness of the solution, i.e., the frequencies of the grouping. For instance, solutions with clusters containing much more than 40% of your data are probably not giving good results.
If SPSS provides some sort of summary metrics like pseudo-rsquares, then run different solutions that request sequential numbers of clusters on the same inputs, e.g., 3 to 30 clusters. Find the inflection point at which those summary metrics "roll over" and stop growing from one iteration to the next. Use that as a starting point for a deeper dive.
At this point, you've can triangulate to a solution you like. Once you’ve got a solution that you like, try to validate it. There are a number of ways of doing this. For instance, you want solutions that "replicate." One method for this is to employ a train and test split of the data to see if the clusters are recoverable based on the misclassification error rate in cross-validation. There are different rules of thumb about this error rate that go as low as slightly better than random assignment (50% error) for weakly predictive models. Another answer is to make a judgment as to whether the resulting segments “feel” real or are representative, modal profiles of the space being clustered. Of course, this is a highly subjective and qualitative decision.