Journal of Emerging Trends in Engineering and Applied Sciences: Research Areas

Abstract:
Data classification is an important task in KDD (knowledge discovery in databases) process. It has several potential applications. The performance of classifiers is strongly dependent on the data set used for learning. In practice, a data set may contain noisy or redundant data items and large number of features, many of them may not be relevant for the objective function at hand. Thus noise data may degrade the accuracy and performance of the classification models. Thus, dealing with missing values in data pre-processing is an important step in building an effective and efficient classifier. It is a process by which missing values are replaced by suitable values according an objective function or the noisy data may be filtered. It leads to better performance of the classification models in terms of their predictive or descriptive accuracy, diminishing of computing time needed to build models as they learn faster, and better understanding of the models. In this paper, the effect of missing values on data classification is studied. A comparative analysis of data classification accuracy in different scenarios is presented. Several search techniques are considered in the study for feature selection and are applied to pre-process the dataset. The predictive performances of popular classifiers are compared quantitatively. After analysing the experimental results, the paper establishes the general concept of improved classification accuracy using missing values replacement. The purpose of this research is to maintain the highest accuracy classification rate in missing values.

Keywords: data mining, feature selection, missing values, knowledge discovery databases.

Download full paper