Building a web document modeling tool
Abstract
Web document modelling plays an important role in many fields including data mining and information retrieval. Its applications are in search engines, web rating systems, and web recommendation systems. Documents are various in content and representation. Therefore, there is a need of common model so that information can be extracted effectively and efficiently. This study is carried out to create an efficient web document modelling scheme. Many systems can benefit from this, such as, ranking systems, text classification systems, and web recommendation systems. It bases on vector space model (VSM) with TF-IDF weighting. Cosine similarity is used to measure the similarity of difference documents. WordNet is also help in understanding queries. Two modelling scheme are tested, one does not base on WordNet, another use WordNet for query expansion. The results are tested using precision computed for top-20 searching results. [1] The precision is around 80% for simple queries and becomes lower for more complex ones. The WordNet based method works better with long queries. The search-time is from 0.2 to 0.5 second.