Appear in the 6th International World Wide Web Conference
Apr. 7-11, 1997, Santa Clara, California USA

A Multi-Engine Search Tool with Clustering*

Chia-Hui Chang and Ching-Chi Hsu
Department of Computer Science and Information Engineering
National Taiwan University, Taipei 106, Taiwan
{chia, CCHsu}@ails1.csie.ntu.edu.tw

Abstract

The dozens of existing search tools and the keyword-based search model have become the main issues of accessing the ever growing WWW. Various ranking algorithms, which are used to evaluate the relevance of documents to the query, have turn out to be impractical. This is because the information given by the user is too few to give good estimation. In this paper, we propose a new idea of searching under the multi-engine search architecture to overcome the problems. This includes clustering of the search results and extraction of co-occurrence keywords which with the user's feedback better refine the query in the searching process. Besides, our system also provides the construction of the concept space to gradually customize the search tool to fit the usage for the user at the same time.

Keywords: WWW, information extraction, clustering, customizable, user feedback
____________
*This project is partially supported by Institute for Information Industry, Taiwan.

1. Introduction

As it has been arguing that the standard Web search services are far from ideal, many researchers are seeking for a better way to tackle the ever growing WWW. Technologies include software agents, multi-engine search, and distributed cooperating, etc. The intelligent information retrieval agents expect to extract concerning information for users by building an agent theory with features like autonomous, self-learning, planning, customizable, and communication [EW94][HL95]. And the multi-engine search utilize the various search engines with the consideration that none of the search services is broad enough to cover the whole World Wide Web [SE95][Dre95]. Also, research into the distribution of indexing load as in Harvest system [BDH+94] and information retrieval protocols and standards like Z39.50 [KG94] also try to share the indexed data in a tightly-organized corporation among servers.

Each of the applications has its speciality to make up the two main insufficiency of current search tools.

First, many search tools rely on robots to discover information published on-line, but the contents of WWW are too massive to be indexed in a single server. Even if we do so, the single server will be overloaded to serve queries from all over the world, and the network delay and traffic will increase remarkably.
Second, the inverted index database used by most of the current search tools, though efficient and appropriate for keyword-based search model, is not enough to measure the relevance of documents to a specific query for users. This is because most users have little knowledge about the indexed keywords in the database and even the domain they are searching. Thus, little information is provided in the query sentences and with the few information given by users, the relevant evaluation may lose their effects.

To solve the first problem, we advocate the deployment of specialized search engines based on geographic locations to distribute the indexing load and the construction of an integrated gateway to these databases are used to provide a way of integration (and possibly incorporation). While the idea is not new, we can see that with more and more searchable tools getting online and the low overlapping of the search results from existing search tools, we do benefit from the adoption of mult-engine search architecture. And with experience from MetaCrawler [SE95] and SavvySearch [Dre95], the reasonable response shows the possibility of actual use in practice. More than that, releasing ourselves from the burden of indexing the web, we can then focus on the post-processing of the retrieved data of the second problem.

Comparing to the first problem, researches that deal with the second problem are not as promising as the first one. Most search tools still use keywords to specify queries. However, according to Pinkerton's experiment [Pin94], the average number of keywords specified in each query is only 1.5, which definitely leads to thousands of "related" documents. Since some keywords are manifold (e.g., "Java" as a computer language and as a kind of coffee), irrelevant documents are included in the search results and reduce the precision. This situation is further worsen for those keywords that encompass too broad of the search space (e.g., "computer"). Though various ranking algorithms are used to evaluate the relevance of documents, it turns out to be meaningless. To address this problem, category hierarchy of the database of the search engine are provided to narrow down the search space (as in Yahoo and Infoseek). However they demand a lot of man force in the categorization and users are easily confused through the big classification. In fact, the lack of a scalable, customizable references to provide synonyms or related topics for users become the main issue of the problem.

In this paper, we present a new search process to overcome the problems mentioned above: a provision of related words through the extraction of search result, the feedback of users interests to refine the query and a classification of the search results. The insight is that the information about the query that are not specified in the user's input may be embedded in the results of the first query. If we can extract related keywords regarding this query, users can then easily identify the domains relating to their searching and give more information as the searching going on. On the other hand, instead of having users browsing through each singular hypertext, classifying the search results into clusters gives first impression about the hypertexts. This process spotlights on the filtering and grouping of data and saves users from the work of infinite browsing. Therefore, through the feedback of the user in the searching process, a corresponding concept space to the user's knowledge domain is constructed. And the information is reused for related queries in the later process.

2. System Architecture

The implementation of the system is based on the multi-engine search architecture. Under this architecture, the system receives a user's query as a list of keywords and redirect the query to other search engines (which includes AltaVista, Infoseek, Lycos, WebCrawler, Magellan etc.). While collecting data from the internet, the system collates the data and presents the results of clustering to inform users of the domain they are searching for. The user can then refine the query to direct the searching on what they want precisely. The complete process of the system can be broken down into following steps, and shown in Fig. 1 for reference:

1. Collect the object-attribute data matrix.

2. Compute the resemblance and classify the documents.

3. Extract related words as reference.

4. Display the clustering results to the user and get the user's feedback.

5. Refine the query with user feedback.

6. Repeat step 1-5

In the following sections, we will examine each component of the system in detail.

2.1 Obtaining Data Matrix

For queries new to the system, they are directed to out-side search engines with the display form set to "detail" format (in AltaVista) or " summary" mode (in WebCrawler). The returned results show where in the internet the related documents exist and the terms that appear in the description are taken as the attributes of the documents. And if the time is available, the original contents of the documents are retrieved from internet. Then with characteristics of hypertexts, words from the page titles, headings, anchor hypertexts, list items, italic and bold face forms are extracted as the representatives of the documents' vectors.

On the other hand, to manipulate various output form of the search engines, the repeat pattern of HTML tags is discovered to find the individual URLs and related information in the search results. For example, when a new search engine is added, the system generate a general query, say "WWW", and examine through the tags of the search results to find the repeat patterns to get the respective URLs. Then, more queries are generated to test the correctness of the pattern and also to find the way to retrieve the subsequent results from the search engine.

2.2 Initial Assignment of Documents

In addition to the extraction of keywords from headings or high-lighted texts to be attributes of each document, the hyperlinks in the documents also provide useful information in the clustering. From the experiment, it shows that from an average of 100 URLs, over 30 documents come from the same server (Fig. 2). This is because most of the search engines index at least one to two level deep of an WWW server after the discovery. And since most of the WWW servers focus specifically on some topics, more than one document are usually indexed in the inverted database and retrieved together during the keywords matching. Thus, when someone use search tools to find out where in the internet the related information hides, the option to group documents from the same IP address (server) together will be useful.

Another useful feature of the hypertexts based WWW information browsing are the hyperlinks, which are intended to be used to link documents that are content-related. In our experiments, when the actual content of an URL is retrieved from the internet, the hyperlinks between the documents (refer to as cross links in the following context) are as high as 20% (Fig. 3). These existing cross links between documents, given with human-identification of association, provide an good initial classification of hypertexts at the start, especially for common terms used in different domains.

The anchor information is also used in WISE [YL96] system as similarity measure in ranking algorithms as well as grouping of linked documents. In our /system, we use them as initial assignment of hypertexts to clusters. And despite the overlap of the two optional grouping, up to 40% of the documents can be grouped without resemblance computation between documents. The grouped cluster is then represented by terms that appear in the documents of that cluster.

2.3 Resemblance Computation

The actual meaning of the data matrix represents the position or vector of each object under the vector model. To compute the resemblance coefficient between two objects, (where objects are represented by a vector of attributes, or terms here), various functions such as the Euclidean distance and the cosine value of the object vectors are the commonly used ones in the clustering literature [Cor71]. And in the field of information retrieval, the TFxIDF, which gives weights to attributes of objects, becomes the mostly used similarity measures. In this paper, we propose an extension of TFxIDF measure to include learned information from a series of queries. For a new query q, the similarity measure between cluster c(i) and c(j) are then represented as R(c(i), c(j)):

where:

TF(i,t) is the relative frequency of term t in cluster c(i),
DF(t,q) is the document frequency of term t in the whole collection of query q,
QF(t) is the frequency of term t appeared in queries so far.

The point is to give higher weights to terms which occur frequently in documents of the query domain, and to those that are unique to other queries. Noted that the weighting of terms DF(t,q)/QF(t) is also adapted to filter related keywords in a query through the users feedback. In which we dynamically re-weight the terms to inflect their significance.

Readers can see here that the resemblance computations are not between the query and documents, such as to give an sorted results of the thousands of documents, but are between documents to give the references for the clustering. The reason is that with average 1.5 query keywords, the ranking of documents to the query can not distinguish well the relevance for the user. On the contrary, resemblance computations between documents with the attached terms do give good indication in the clustering process. And the presentation of classified results shows the advantage of making the query domain more organized.

2.4 Hierarchical Clustering with Iterative Approach

In the practical operation, due to the availability of data from the internet, we need an clustering method differing from the classical ones. [FO95] With contents coming from the internet on the fly, an iterative approach must be designed to give partial results to the user during the searching process. Given the total number of documents to be got and the number of phase to complete the job (denoted by N and P, respectively), the whole process we construct proceeds as follows: at each phase k, k from 1 to P, once N/P documents are collected, the clustering procedure is activated to accomplish the classification (reference to Fig. 4).

At the beginning of the clustering procedure, each new retrieved document is assigned to a proper existing cluster or a new cluster in cluster set C(0) under the principle of initial assignment of documents. Then, the similarity measures between clusters in cluster set C(i) are computed to group clusters with the highest |C(i)|*r similarity measures to construct an hierarchical structure (where r is the grouping ratio discussed below).

There are two points to be noted here. In the processing of iterative clustering, we first want to make sure that each of the documents is compared to others to make sure that the later documents with higher similarity measure are grouped as well. Second, the computation of resemblance between clusters in cluster set C(i) are followed to group related clusters to form cluster set C(i+1). This corresponds to the computation of similarity between centroids or centers of clusters and give better measures of the resemblance. As for the ratio of grouping, r, it is chosen from 0.4 to 0.6 to give a better division of the whole category. In our experience, two to three major groups are generated to show the main topics of the domain, while others are considered as minor groups, and can not be furthur classified due to the shortage of data attributes.

2.5 User Interface and Feedback

For each query, the system shows the clustering results and highly weighted terms as references to the query domain. The user can then feedback interested topics to refine the weighting of terms and the query (Fig. 5). The interface also shows that documents of a cluster are grouped because some common terms appear in the documents. This helps the user easily identify what the cluster is about without browsing through all the contents inside.

The feedback of the selected topics also help to customize the knowledge base, since there is a percentage of negligible words, like people names or mistyped words that are not common to other queries and are weighted high in the weighting. But with the user's feedback, we can identify what are the terms to be stressed and what are the ones that are irrelevant to the query through the re-weighting procedure in supervised mode: for each term t in the user's feedback, DF(t,q) is doubled to enforce the significance of the term in the query q. And for each term t not in the feedback, QF(t) is doubled by the same way to reduce its weight. In this way, feedback of the same keywords can be repeated to strengthen their significance.

While people may ask why not use a thesaurus like Roget's [Rog] as references for the user, they can soon discover that the information provided by the thesaurus are more like the synonyms and is not suitable for using here. And the extraction of information from the query results, a kind of post-processing, give better indication.

As for the elimination of mistyped words, the spell tool on UNIX is used to pick out the words not included in the dictionary. With only one exception that words spelled in uppercase are kept to avoid the case of proprietary keywords. Though not all errors are deleted in this way, most mistyped words can be kicked out in the feedback phase of the user.

Fig. 5 The User Interface**

3. Customizable Concept Space Construction

The clustering results are a good idea, but it will show their significance only when the weighting of terms are well-assigned and each document (object) is well-coupled with its terms (attributes). On the other hand, since clustering of the documents only show syntactical resemblance, while semantic meaning is given through the interpretation of humans. Thus the goal at this moment is find a mechanism to classify each document to a proper category of human knowledge.

Let's consider the case of library system where knowledge is classified into hierarchical categories by human. Even in the system of World Wide Web, two popular servers have provided the human-classified categories -- Yahoo and Infoseek. These well classified hierarchies give us an initial prototype of knowledge. What we need to do is to find out the good attributes (or corresponding terms) of each category. Then at the classifying process of a document, we simply compute the similarity coefficient between the categories and the document and choose one with the highest resemblance. The former part is what we called the supervised training process and the second part is the unsupervised automatic classification.

3.1 Representing the categories

To represent each category at the end node in the hierarchy, the documents that have been classified to the category are compared to retrieve the common terms to be the attributes for the category. And for those non-end node categories, the representatives are the combination of their sub-categories. We regard this process of comparing and retrieving as the training process for categories construction and also for term weights assignments. Thus the concept space is a category-based hierarchical structure derived from the human classification of knowledge like library system or Yahoo categories. (However, the user can rearrange the classification through the interface of Java Applet.)

3.2 Query Processing and the Query Profile

At the classification phase of each query, we first search the category hierarchy for the appearance of keywords in the query. If some categories are matched, the documents in the search results are classified using resemblance computation. If not, the sub-categories of the root are chosen as the target categories to be assigned . In this way, each document is associated with the corresponding category.

While the user decides the categories that he is interested in, the documents are filtered. Then the definition of the query is associated with the chosen categories and are kept in the "query profile" for the user. This feedback on the classification results not only refine the query but also define the searching categories for subsequent search.

3.3 Customize the Category Hierarchy

In the practical usage, each system requires different classification of the knowledge domain. As far as the user is considered, the user does not need a complex hierarchy like the classification of library system. Thus a customizable category hierarchy is provided with a Java Applet user interface and updated in a directory-based "category profile" for each user.

4. Conclusions and Future Work

In this paper, we have proposed a new search process to overcome the issues in the current WWW search tools. This post-processing and clustering idea are the main contribution of our system. In this way, major topics in the query domain are highlighted to illustrate the general outlook of the query. And the construction of the concept space together with the user's feedback will gradually customize the system to fit the usage for the user.

The clustering idea, besides the application in the multi-engine search architecture, can be applied to any singular search tool as well. In fact we have applied it to AltaVista and Magellan respectively. However, the multi-engine search architecture has the advantage of increasing the availability and recall. As we can see from the experiments that the average overlapping of hypertexts from five search databases (i.e. AltaVista, Infoseek Lycos, WebCrawler, Magellan) is below 10% (Fig. 6). And the temporary denial or long delay of one search engine can be easily supplanted by other ones, showing the advantage of this architecture.

The current version of the system is implemented using the standard Common Gateway Interface (CGI) and can be accessed through any WWW browser. However, it is better to say this service working as an personal assistance for the user rather than a service to the public. Since the storage subsystem is trainable with the users feedback, it is best to have a client version of this system. Thus, we are working on an client version of the system as the job working on.

The recording of the "query profile" and the "category profile" are prepared not only for the ordinary search but also used as a reference for the filtering of any on-line magazines. That is, the server can highlight the out links of a homepage according to the profiles to help the user through the browsing of the homepage. In addition, the system can periodically retrieve related information from new published data based on the analysis of the two profiles. These additional functions can be achieved with the technology of data mining which is a new addressed topics in the recent research.

**The prototype of our system is also available at http://ails6.csie.ntu.edu.tw/~chia/air.html

References

[BDH+94] C.M. Bowman, P.B. Danzig, D.R. Hardy, M.F. Schwartz, "Scalable Internet Resource Discovery: Research Problems and Approaches," Comm. of ACM, Vol. 37, No. 8, pp.98-107, 1994
[BDH+94] C. Mic Bowman, P.B. Danzig, D.R. Hardy, U. Manber and M.F. Schwartz, "The Harvest Information Discovery and Access System", Techical Report at U. Colorado.
[Cor71] R.M. Cormack, "A Review of Classification," Journal of the Royal Statistical Society, Series A. 134, pp. 321-353
[Dre95] Daniel Dreilinger, "Integrating Heterogeneous WWW Search Engines", May 1995. Ftp://132.239.54.5/savvy/report.ps.gz
[E95] Learning to understand information on the Internet. (IJCAI-95).
[EW94] O. Etzioni, D. Weld, "A Softbot-Based Interface to the Internet," Comm. of ACM, Vol. 37, No. 7, July 1994, pp.72-76 http://www.cs.washington.edu/research/softbots
[FO95] C. Faloutsos and D.W. Oard, "A Survey of Information Retrieval and Filtering Methods," University of Maryland, Technical Reports CS-TR-3514
[HL95] T.W.C. Huibers and B. van Linder, "Formalising Intelligent Information Retrieval Agents," Technical Report, Utrecht University.
[KG94] Q. Kong and J. Gottschalk, "Information Retrieval Standard and Applications," presented at BISCM'94 - Beijing. http://www.dstc.edu.au/RDU/reports
[Pin94] B. Pinkerton, "Finding What People Want: Experiences with the WebCrawler," Second International WWW Conference '94, July 1994, Chicago, USA, Oct. 1994. http://info.webcrawler.com/bp/WWW94.html
[Rij79] C.J. van Rijsbergen, Information Retrieval. Butterworths, London, England, 1979, 2nd edition.
[Rom84] H. C. Romesburg, Cluster Analysis for Researchers. Wadsworth, Belmont, California, 1984.
[SE95] E. Selberg, O. Etzioni, "Multi-Engine Search And Comparison Using The MetaCrawler," Proceedings of the Fouth World Wide Web Conference '95, Boston USA. Dec. 1995.
[YL96] B. Yuwono and Dik L. Lee, "WISE: A World Wide Web Resource Database System," To appear in IEEE Trans. on Knowledge and Data Engineering Special Issue on Digital Library.

URLs

[AltaVista] http://www.altavista.digital.com
[Magellan] http://searcher.mckinley.com
[Infoseek] http://ultra.infoseek.com
[Lycos] http://lycos.cs.cmu.edu.tw
[WebCrawler] http://webcrawler.com
[Roget's] Roget's Thesaurus: http://www2.thesaurus.com/thesaurus
[WISE] http://www.cs.ust.hk/cgi-bin/IndexServer
[MetaCrawler] http://www.metacrawler.com