dc.description.abstract | HUI mining, or high utility itemset mining, has significantly impacted our
technological progression for a long time. Therefore, the research topic of data mining,
especially HUI mining, has captured the interest of countless great minds worldwide. As a
result, numerous data mining applications have emerged, such as e-commerce, streaming
analysis, bioinformatics, and more. HUI mining can be considered a generalization of the
frequent itemset mining (FIM) algorithm, which focuses on extracting items that frequently
appeared together (frequent itemsets) in the transactional database. Traditional FIM
algorithms only concern the number of times a given itemset appears in a database and
neglect other valuable information associated with those items, such as quantities or unit
profit. This often leads to the algorithm finding low-utility itemsets that generate low profits
but appear in the database enough times to be qualified as high-utility itemsets (HUIs).
Even though HUI mining provides an improved data mining algorithm, selecting the
appropriate minimum utility threshold remains one of its biggest drawbacks. To address
this problem, topKHUIM was created. This approach requires only the value k, which
indicates the k-itemsets the users want to find and forgoes the need for a user-defined utility
threshold. With the value k, the algorithm can not only find the initial utility threshold but
also update the threshold at each stage of the mining process. However, just like most
traditional HUI mining algorithms, the currently available top-k HUI mining algorithm still
poses drawbacks when dealing with massive datasets.
In this thesis, I propose a modified version of the existing TKHUIM algorithm that
includes several additions and optimizations so that it can be used to mine HUIs in massive
datasets. These modifications include transaction merging for the projected transactions in
partitions, external sorting for the corrected input database, and utilizing plentiful hard drive
storage to contain the partitions instead of relying on fast but limited computer memory.
Additionally, I will provide the results of my proposed top-k HUI mining algorithm
compared to the existing ones in a later chapter of this thesis. | en_US |