Building a drug dictionary
Abstract
The aim of this thesis is to build a drug dictionary that whenever users input a
keyword of drug name or drug usages, it presents two types of results: the relevant
drugs applying Text Classification, Text Clustering and Vector Space Model
concept; and the results of keyword matching by using Database SQL statements.
First thing to do is to divide drugs information in Database into k groups by using
Text Clustering, based on the similarity between objects. Next, SQL operators will
query and return all results matching to the input keyword. The system also
calculates and presents the dominant group (based on results of Database-based
search). A dominant group is a group that its number of occurrence is the highest.
Then, in Text Classification concept, K-Nearest Neighbor algorithm aims to find K
most similar drugs in the dominant group. As results, users will receive keyword
matching results of Database-based search and relevant drugs applied Text Mining
concept. The methodology is used to change every sentence of drug information
into a Vector Space Model is TF-IDF, in which each element in a vector is a
weighted number. Moreover, the similarity, or so-called the distance between two
vectors can be calculated by using Cosine Similarity measurement. Besides, another
important step is to pre-process data before mining will also be mentioned in
details. Last but not least, some tools and resources, which are used to build the
dictionary, will be introduced later.