To understand how the kmeans()
function works, you need to read the documentation and/or inspect the underlying code. That said, I am sure it does not take a distance matrix without even bothering. You could write your own function to do k-means clustering from a distance matrix, but it would be an awful hassle.
The k-means algorithm is meant to operate over a data matrix, not a distance matrix. It only minimizes squared Euclidean distances (cf. Why does k-means clustering algorithm use only Euclidean distance metric?). It is only sensible when you could have Euclidean distances as a meaningful distance metric. This has always been the case since the algorithm was invented, but few people seem to be aware of this, with the result that k-means is probably the most mis-used algorithm in machine learning.
Euclidean distance doesn't make any sense for sparse categorical data (text mining), so I wouldn't even try anything like this. You first need to figure out what distance metric is appropriate for your data (@ttnphns explains some possible measures here: What is the optimal distance function for individuals when attributes are nominal?). Then you can compute the distance matrix and use a clustering algorithm that can operate over one (e.g., k-medians / PAM, various hierarchical algorithms, etc.).
Yes, it is possible to assign topics to sentences, or, more generally, to give each sentence a probability of belonging to each topic. Many LDA inference methods provide a probability of each word belonging to each topic, which you can simply aggregate by averaging to determine the probability of each sentence belonging to each topic. If you want to assign a single topic to each sentence, you can simply choose the topic with the highest probability; how you break ties is up to you.
I am not an expert in gensim, but that project appears to use variational inference for LDA. In this case, you will want the variational parameter giving a probability distribution on words belonging to topics, but I don't see by glancing through the docs/source how to attain this.
Here's the heuristic I would use: just look at the matrix relating terms to topics, and for each sentence, add up the topic contributions of each term. This ignores information about other sentences in the document, but it should be a reasonable approximation. Consult the method "get_term_topics" belonging to the LDA object to obtain this.
Is LDA on sentences equivalent to LDA on documents? The answer here is no. In deciding what topic each word of the corpus comes from, LDA inference algos borrow information from what other words are in the corpus through a parameter (denoted by $\theta$ in the original LDA paper) which gives the topic prevalence for each document. Therefore, doing LDA with each sentence being considered its own document will give a different result, since sentences won't "borrow strength" from one another. I would conjecture that you will get a somewhat similar result, but it won't be the same. Further, standard LDA inference algos have difficulty with short documents (such as tweets, which are sometimes aggregated so as to have longer docs, see e.g. this article), so you may see some degradation in the quality of the results.
Best Answer
This is an example. You need copy matutils.py and utils.py from gensim first, and the directory should like the pic blow.
The code blow should be in doc_similar.py. Then just move your data_file into directory data and change fname in function main.