Keyword extraction for unstructured documents
Abstract
Unstructured documents are the documents that can be free-form and don’t have a set structure such as contracts, letters, articles or memos. Keyword extraction is the automatic identification of these keywords which are the important words that describe the contents of the specific documents. Keyword extraction for unstructured documents can help the users to search and classify any dataset of documents that they want, especially big datasets.
In present, current researches of keyword extraction focus only on text documents and are based on different approaches such as statistics, linguistics or semantic analysis, etc. They produce relatively accurate results. However, using them separately cannot fully exploit all advantages of these approaches (from the weight of each section or each document, linguistic or semantic features). Therefore, keyword identification cannot return highly precision results.
In this research, a text mining approach is proposed to help better extract keywords of any unstructured documents. Within this framework, XML parser as well as some text mining and NLP (Natural Language Processing) techniques are utilized to preprocess and solve the linguistics problem of documents so that all keywords of documents are extracted. After that, the way how to rank candidate keywords according to their importance is presented in this research. The application which is developed from this approach is also indicated in this thesis.