TY - JOUR
T1 - Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification
AU - Petinrin, Olutomilayo Olayemi
AU - Saeed, Faisal
AU - Salim, Naomie
AU - Toseef, Muhammad
AU - Liu, Zhe
AU - Muyide, Ibukun Omotayo
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Gene expression data are usually known for having a large number of features. Usually, some of these features are irrelevant and redundant. However, in some cases, all features, despite being numerous, show high importance and contribute to the data analysis. In a similar fashion, gene expression data sometimes have limited instances with a high rate of imbalance among the classes. This can limit the exposure of a classification model to instances of different categories, thereby influencing the performance of the model. In this study, we proposed a cancer detection approach that utilized data preprocessing techniques such as oversampling, feature selection, and classification models. The study used SVMSMOTE for the oversampling of the six examined datasets. Further, we examined different techniques for feature selection using dimension reduction methods and classifier-based feature ranking and selection. We trained six machine learning algorithms, using repeated 5-fold cross-validation on different microarray datasets. The performance of the algorithms differed based on the data and feature reduction technique used.
AB - Gene expression data are usually known for having a large number of features. Usually, some of these features are irrelevant and redundant. However, in some cases, all features, despite being numerous, show high importance and contribute to the data analysis. In a similar fashion, gene expression data sometimes have limited instances with a high rate of imbalance among the classes. This can limit the exposure of a classification model to instances of different categories, thereby influencing the performance of the model. In this study, we proposed a cancer detection approach that utilized data preprocessing techniques such as oversampling, feature selection, and classification models. The study used SVMSMOTE for the oversampling of the six examined datasets. Further, we examined different techniques for feature selection using dimension reduction methods and classifier-based feature ranking and selection. We trained six machine learning algorithms, using repeated 5-fold cross-validation on different microarray datasets. The performance of the algorithms differed based on the data and feature reduction technique used.
KW - cancer classification
KW - gene expression
KW - machine learning
KW - microarray data
KW - sampling methods
UR - https://www.open-access.bcu.ac.uk/14552/
U2 - 10.3390/pr11071940
DO - 10.3390/pr11071940
M3 - Article
AN - SCOPUS:85166178221
SN - 2227-9717
VL - 11
SP - 1
JO - Processes
JF - Processes
IS - 7
M1 - 1940
ER -