Solved – Can we use Bag of Visual Words to compute similarity between images directly

clusteringcomputer visionimage processinginformation retrieval

I'm implementing a Content Based Image Retrieval application (CBIR).

I've read about the Bag of Features model and it's considered an intermediate-step algorithm in some application. For example, the histograms generated could be used for SVM classification.

Since the produced vectors are affected by the curse of dimensionality, it can become expensive to compute some kind if distance between a query histogram and all the dataset histograms. For this reason, techniques like LSH have been implemented for datasets with hundreds of thousands (or millions) of images. As explained here, KD-trees are useless in this context since the histograms high dimensionality and their performance is not going to be better than linear scan.

However, my system is based on (let's say) 50k images, so compute directly the distance should not be so prohibitive.

I have two questions:

  1. Is this approach reasonable? Recap of it: classic BoF approach and then compute the distance between each dataset histogram and the query histogram. The smaller one is returned as the most similar image.
  2. Which distance should I use? I've heard that chi-squared distance is a popular choice. What about euclidean, cosine similarity etc? Which one is the best in such a context?

Best Answer

The approach is reasonable as long as the bag-of-features representation properly captures your desired concept of similarity/distance between images. For example, it has the property that local features could be spatially re-arranged within an image and the representation wouldn't change. That could be good or bad, depending on your perspective.

Similar reasoning applies to the choice of distance metric. There is no absolute best; it depends entirely on what you want 'similar' and 'distant' to mean in the context of your application. For example, cosine distance measures the relative angle between vectors, but is invariant to magnitude. This would mean that distance between images is determined by the relative frequency of each feature, but not the absolute count. This is a popular choice for measuring distance between bag-of-word models of text documents, because relative word frequencies can better capture the meaning of text documents (e.g. a longer document might contain more occurrences of each word, but this doesn't affect the meaning). Whether cosine distance makes sense for your image retrieval system depends on whether you want your system to have that property.

Your goal is to return the image in your database that's closest to the query image. This is called nearest neighbor search. It isn't necessary to compute the distance between the query and every single image in your database; more efficient procedures exist. The general idea is that distances between images in the database can be exploited, along with properties of the distance metric. If you know that the query image has a particular distance to some database image, you can infer that other database images can't possibly be the nearest neighbor. Therefore, many unnecessary distance computations can be eliminated and the search can be performed in sublinear time. You mentioned locality sensitive hashing, so it sounds like you're aware of this issue. Another popular class of methods uses tree data structures (e.g. KD-trees, ball trees, cover trees) to accelerate the search. Searching for "nearest neighbor search" will return many papers on the topic, and ready-to-use code is available for a number of methods.

Related Question