Solved – how to measure similarity of two datasets (matrices) of different length

distributionskolmogorov-smirnov testmachine learningsimilaritiesstatistical significance

There are related questions being asked already but my problem is i can't find a good method of measuring similarity between two datasets that are represented by various lengths of matrices. For instance, first dataset is a sensor data with x,y,z,gyro,acc features of 1000 records. The second dataset's features are the same but with 1500 records. So how do I compute the similarity between these two.

I've used dynamic time warping (DTW) but not sure about it because it is mostly used for time-based operations but my dataset doesn't contain any temporal info. Also, it doesn't output a score between [0-1], so not sure how to scale it.
I checked Kolmogorov-Smirnov Test, as well, but it can give me the difference between only a particular feature (column) of different size. I thought of measuring the distance for each column separately and summing it up, but haven't tried yet.

Best Answer

I think your problem is related to domain adaptation and the term "discrepancy" they use in the literature between two domains. Check Ben David papers on discrepancy. There is a simplified measure defined as the classification error (or accuracy) of a classifier that discriminates samples of one dataset from the other.