Direct multidisplay for web document repositories

Thumbnail Image
Gu, Zhong
Major Professor
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue
Is Version Of
Electrical and Computer Engineering

As the popularity of the Internet grows, the information on the Internet is increasing as well. Search engines are important tools to help people retrieve information of interest from the huge amount of documents. However, currently used search engines return long lists of URLS. Users have to click each URL to download the actual document and check its content and click the back button to access other URLs if the answer is not found in the current URL. This process is both labor intensive and time consuming. Multibrowser, a program that addresses this problem, is presented in this thesis. Multibrowser combines the advantages of multidisplay and direct display to present a more efficient user computer interface. First, the system downloads the actual documents according to the list of URLs returned by a standard search engine and saves the documents on the local disk. Second, the system converts the documents into n-gram vectors and clusters them into three groups according to the n-gram vectors. Then each document is assigned a color according to its position in relation to the cluster centroids. Also, each paragraph is linked with other paragraphs which have similar contents. Last, the documents are presented to the users using Multidisplay, where each document has a corresponding color bar. Users can look through the content of several documents at the same time and see the similarity among them by the colors bars; users can then retrieve the most similar paragraphs to a certain paragraph by clicking its "find similar" link. In addition, the investigation of hash tables for our text processing shows that the hash table size has to be chosen very carefully to avoid undesirably large collision probability. Some good values are suggested based on our experiments.

Mon Jan 01 00:00:00 UTC 2001