TY - JOUR
T1 - A survey on missing data in machine learning
AU - Emmanuel, Tlamelo
AU - Maupong, Thabiso
AU - Mpoeleng, Dimane
AU - Semong, Thabo
AU - Mphago, Banyatsang
AU - Tabona, Oteng
N1 - Funding Information:
This work received a grant from the Botswana International University of Science and Technology.
Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12
Y1 - 2021/12
N2 - Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
AB - Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
UR - http://www.scopus.com/inward/record.url?scp=85117892894&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117892894&partnerID=8YFLogxK
U2 - 10.1186/s40537-021-00516-9
DO - 10.1186/s40537-021-00516-9
M3 - Article
AN - SCOPUS:85117892894
SN - 2196-1115
VL - 8
JO - Journal of Big Data
JF - Journal of Big Data
IS - 1
M1 - 140
ER -