Processing missing value for rule induction algorithms
Abstract
Missing value is very popular in data acquisition. When handling data sets having
missing values, classification methods have some difficulties in their learning process.
Several pre-processing techniques, such as data replacing or data imputation, have
been introduced to remove missing values before being processed by classification
methods. However, when the percentage of missing values in the data set goes up
(sometimes to 60-70%), such pre-processing techniques cannot be successfully used.
This thesis introduces a new version of algorithm CN2 and Rules6 which represented
for Separate and Conquer method in Rule Induction algorithms that has abilities to
directly handling missing values and processing data set with large percentages of
missing values. Tested on benchmarking data sets from UCI, the new algorithms
achieved a better performance than that of popular methods Decision Table and C4.5
in directly handling missing values. These new algorithms also achieved significant
results when compare with one common preprocessing missing value data set method,
filled with the mode and means of attribute value.