Set aside Kullback-Leibler divergence for a moment and consider the following: it's perfectly possible for the Kolmogorov-Smirnov p-value to be small and for the corresponding Kolomogorov-Smirnov distance to be small.
Specifically, that can easily happen with large sample sizes, where even small differences are still larger than we'd expect to see from random variation.
The same will naturally tend to happen when considering some other suitable measure of divergence and comparing it to the Kolmogorov-Smirnov p-value - it will quite naturally occur at large sample sizes.
[If you don't wish to confound the distinction between Kolmogorov-Smirnov distance and p-value with the difference in what the two things are looking at, it might be better to explore the differences in the two measures ($D_{KS}$ and $D_{KL}$) directly, but that's not what is being asked here.]
Unfortunately, no; comparing the optimality of a perplexity parameter through the correspond $KL(P||Q)$ divergence is not a valid approach. As I explained in this question: "The perplexity parameter increases monotonically with the variance of the Gaussian used to calculate the distances/probabilities $P$. Therefore as you increase the perplexity parameter as a whole you will get smaller distances in absolute terms and subsequent KL-divergence values." This is described in detail in the original 2008 JMLR paper of Visualizing Data using $t$-SNE by Van der Maaten and Hinton.
You can easily see this programmatically with a toy dataset too. Say for example you want to use $t$-SNE for the famous iris dataset and you try different perplexities eg. $10, 20, 30, 40$. What would be emperical distribution of the scores? Well, we are lazy so let's get the computer do the job for us and run the Rtsne
routine with a few ($50$) different starting values and see what we get:
(Note: I use the Barnes-Hut implementation of $t$-SNE (van der Maaten, 2014) but the behaviour is the same.)
library(Rtsne)
REPS = 50; # Number of random starts
per10 <- sapply(1:REPS, function(u) {set.seed(u);
Rtsne(iris, perplexity = 10, check_duplicates= FALSE)}, simplify = FALSE)
per20 <- sapply(1:REPS, function(u) {set.seed(u);
Rtsne(iris, perplexity = 20, check_duplicates= FALSE)}, simplify = FALSE)
per30 <- sapply(1:REPS, function(u) {set.seed(u);
Rtsne(iris, perplexity = 30, check_duplicates= FALSE)}, simplify = FALSE)
per40 <- sapply(1:REPS, function(u) {set.seed(u);
Rtsne(iris, perplexity = 40, check_duplicates= FALSE)}, simplify = FALSE)
costs <- c( sapply(per10, function(u) min(u$itercosts)),
sapply(per20, function(u) min(u$itercosts)),
sapply(per30, function(u) min(u$itercosts)),
sapply(per40, function(u) min(u$itercosts)))
perplexities <- c( rep(10,REPS), rep(20,REPS), rep(30,REPS), rep(40,REPS))
plot(density(costs[perplexities == 10]), xlim= c(0,0.3), ylim=c(0,250), lwd= 2,
main='KL scores from difference perplexities on the same dataset'); grid()
lines(density(costs[perplexities == 20]), col='red', lwd= 2);
lines(density(costs[perplexities == 30]), col='blue', lwd= 2)
lines(density(costs[perplexities == 40]), col='magenta', lwd= 2);
legend('topright', col=c('black','red','blue','magenta'),
c('perp. = 10', 'perp. = 20', 'perp. = 30','perp. = 40'), lwd = 2)
Looking at the picture it is clear that the smaller perplexity values correspond to higher $KL$ scores as expected based on the reading of the original paper above. Using the $KL$ scores to pick the optimal perplexity is not very helpful. You can still use it to pick the optimal run for a given solution though!
Best Answer
Dimensionality reduction techniques are often motivated by finding new representations of the data to discover hidden variables or to discover structure. The aim of SNE is to take a different approach (compared to PCA for example) by preserving local structures, which is done by taking advantage of KL-divergence's asymmetric properties.
Conditional probabilities as inverse distance
Looking at Eq (1) ,notice that the conditional probability can be interpreted as "inverse distance", because close points (low distance) are assigned high probabilities, and far points (high distance) are assigned low probabilities.
(Note: The inverse distance name is obviously not true in a stricter mathematical sense, because effectively a larger set of numbers $ \mathbb{R} $ are mapped to a smaller set of numbers $ [0,1] $.)
Taking advantage of assymetry in KL
Two scenarios exhibit differences compared to a symmetric cost function in Equation (2).
Thus, assymetric property of KL-divergence, and the definition of the conditional probability constitutes as the key idea of this dimensionality reduction technique. Below, you can see this is exactly why the other distances fail to be a good substitute.
So then, what is the problem with the other distance metrics?
The Jensen-Shannon Divergence is effectively the symmetrisation of the KL-Divergence, by
$$ JSD(P_i||Q_i) = \frac{1}{2}KL(P_i||Q_i) + \frac{1}{2} KL(Q_i || P_i) .$$
This loses exactly the property of preserving local structures, so this is not a good substitute.
The Wasserstein distance can intuitively seen as the rearranging of a histogram from one state to another state. The rearrangements are the same both ways, so the Wasserstein metric is also symmetric, and does not have this desirable property.
The Kolmogorov-Smirnov distance is nonparametric, which would imply that we don't assume a probability distribution, but in fact the structure is described in Eq (1).