Chia-Hui Chang and
Department of Computer Science and Information Engineering
National Taiwan University, Taipei 106, Taiwan
Keyword: query expansion, relevance feedback, document clustering
To overcome these problems, researchers have focused on automatic query expansion to help the user formulate what information is really needed. Another research topic is on relevance feedback from the user which gives the relevance of documents to clarify the ambiguity. In fact, these two techniques complement each other. However, the mechanisms of relevance feedback based on words or documents in the past research both have their own deficiencies. Word feedback has its upper bound performance in lexical-semantic expansion [Voo94] and document feedback is sometimes too tiring for the users.
In this paper, we propose the conceptual feedback together with a joined mechanisms for query expansion. This is continuing research from our previous work based on clustering [CH97]. The idea is to organize the initial documents retrieved by the original query into conceptual groups such that the user could get a quick overview of what the query actually retrieves. Under this designing philosophy, we choose document clustering as our first step toward conceptual feedback. The hypothesis is that similar documents are more related to the same topic than documents that are less similar to each other.
Indeed, providing a concept-based information results as well as an interactive feedback has attracted many researchers in these two years. The dynamic browsing paradigm of Scatter/Gather that clusters documents into topical-coherent groups is applied in conventional similarity search to navigate the retrieved documents by Hearst and Pedersen [HP96]. On the other hand, static clustering of the database contents has also been exploited by Anick and Vaithyanathan [AV97]. They discuss the cognitive load required to assess the content of the clusters from the key terms and introduce natural language processing techniques to extract noun phrases for describing cluster contents.
In this paper, the target is how concept-based feedback can be achieved in a personalized Web information search assistant by integrating existing search engines and techniques of query expansion and relevance feedback. We focus on the mechanisms of keyword extraction for both cluster digesting and query expansion. Furthermore, the personalized Web search assistant can be enhanced by automatic discovery agents to search for more information based on the recorded query history.
As we mentioned earlier, relevance feedback has long been suggested as a solution for query modification. Rocchio describes an elegant approach and shows how the optimal vector space query can be derived using vector addition and subtraction given the relevant and non-relevant documents [Roc71]. The probabilistic model proposed by Robertson and Sparck Jones shows how to adjust the individual term weight based on the distribution of the terms in relevant and non-relevant document set [RS76].
Now, given the cluster or concept as feedback unit, we would expect an approach to join these two models. The intuitive idea is to digest each cluster as a document vector such that the query can be modified by Rocchio's algorithm [Roc71]. Thus, the problem becomes how keywords can be extracted as cluster digests and the weighting of terms in probabilistic model can be adjusted for this purpose.
The basic idea of feature selection for a concept is to highlight those words that have high frequency with respect to some contrast concept. Past research has applied Robertson and Sparck Jones' term weighting to query expansion [Har92], given the top 10-30 documents as relevant and all other documents in the corpus as non-relevant. For keyword extraction from a cluster, a simple application is to divide the initial documents into "belonging" and "not belonging" with respect to the cluster.
However, we find the direct application of the probabilistic weighing has some problems. Since the number of documents in a cluster is not large (about 10), the weighting is useless for words that appear only in the cluster. Hence, modification of the weighting is needed in this scenario. At the time of writing, the best performance is a rewrite form of "cue validity" joined with the majority principle for keyword selection.
In this paper, we focus on integration of query expansion and relevance feedback. The employment of conceptual feedback with query expansion based on the two models is a new approach in information retrieval. By expanding a query, we could not only increase the number of relevant documents retrieved but also rank better the candidate documents. In the same time, the constructed summaries of the queries serve as the description profile of the user's information need. Thus, automatic information discovery can be carried out by generating new queries and filtering through the existed evidence.
There are a number of advantages in this Web assistant. First, it accelerates the browsing speed by dividing the initial results into similar document groups such that relevance feedback can be given by dichotomy of relevant and non-relevant cluster. On the other hand, the creation of a search agenda through genetic algorithm grants the property of autonomy and helps to execute the discovery automatically.