There is also much confusion between the notions of similarity and relevance. These are not the same thing. It has often been said by many companies doing topic clustering, document filtering, and other such applications that their algorithms function by grouping relevant documents together. What is actually meant is that the algorithms are grouping similar documents together. Two (or more) documents are never relevant to each other. They may be similar to each other, but they are only ever relevant to a user’s information need. If there is no user information need, there is no relevance.
The cluster hypothesis in information retrieval says that two documents that are similar to each other have a high likelihood of being relevant to the same information need. Documents by themselves, however, are never relevant to each other. Relevance is defined in terms of a user’s information need.