One problem with your scenario, which is unrelated to clustering, is that you want to check whether your results are similar, and it is always harder to do than testing whether they are different. Most statistical tools work well to detect differences that are larger than a certain threshold, and make a poor argument when you need to claim the opposite.
Here's one idea though. Say, you study your first dataset, and decide that it seems to have 3 clusters. You justify this selection properly, using one of the standard validity tests. Then you run k-clustering (or EM clustering) without randomization and assign each point of dataset 1 to a cluster.
Then you start subsetting your dataset, adding points from dataset 2 to it, one at a time, and running the same clustering algorithm every time. One point would not skew your clusters too much, so essentially you'll mark-up your dataset 2 using clusters defined by dataset 1.
Then you run same clustering algorithm on dataset 2 alone. Now for each point you know in which of 3 clusters it was put based on the dataset 1, and in which one - based on the dataset 2. You can calculate the share of this overlap, which is already telling. You can also calculate the p-value that this overlap is not random (but most probably it won't be random, unless your clusters are really spurious, so this test would not be too useful).
Arguing that the difference is small is harder. But in principle, you'll have to define how you expect your data to be different, if it is different. Then specify what is the biggest difference that you are willing to tolerate. Then somehow generate a set of surrogate datasets that are "different" with your "critical difference", and see how likely you are to get the actually observed overlap between dataset 2 cluster and dataset 1 cluster on a random dataset. But to do it, first and foremost, you need a good, well thought-through $H0$ hypothesis.
You can use other sorts of measurements as well: for example, you could compute the average separation (the mean of all pairwise distances) between the points from "matching clusters" in datasets 1 and 2. Or you could use entropy-based approaches. But regardless of the approach, proving that the clusters are different would be easy, while proving that they are "essentially the same" would require introduction of additional assumptions, and a good solid $H0$.
This is because PCA scores are simply original data in a rotated coordinate frame.
Below on the left I show some example 2D data (100 points in 2D) and on the right the corresponding PCA scores. The data cloud simply gets rotated clockwise by approximately 45 degrees.
If it is not completely clear to you how one gets from the first subplot to the second one or why PCA amounts to rotation, take a look at our very informative thread Making sense of principal component analysis, eigenvectors & eigenvalues. In my answer there I am using exactly the same toy dataset as displayed here. Some other answers are very much worth reading too.
Now, to your question.
Clustering methods are usually based on Euclidean distances between points. The points that lie close to each other get clustered together; the ones that are far away get assigned to different clusters. As you can see above, all distances between all points stay exactly the same after PCA.
Hence the identical clustering results. Here are both representations clustered with k-means with $k=3$:
As you see, the clustering results are identical.
Can PCA make any difference at all?
Yes. One can use it in two ways:
- Standardize all scores to unit variance; or
- Use only a subset of principal components, usually the ones that explain the most variance.
Here is how it looks like in the same toy example. On the left I am using standardized scores (note how different the clusters become), on the right I am using only PC1.
Best Answer
Here is an idea: treat your 2 clusters as 2 classes. Create a binary model with 1 cluster as reference class. The most significant variables are the variables influencing the partition. You can look at statistics like p-value and chi-square to determine, which one is relatively more significant.