Title : Classification of class-imbalanced medical data using data resampling and propensity score matching
Abstract:
Medical data is often imbalanced, posing a challenge to classification for disease diagnosis. Researchers have reported frequent performance degradation due to class imbalance in medical data. Given that medical data is composed of information collected through various methods, including demographics, environmental factors, diagnostic checks, and subjective responses, addressing class imbalance is a crucial part of understanding the data's characteristics and rules. In this study, we applied Propensity Score Matching (PSM) to handle class imbalance at the data level while reflecting the medical data's unique characteristics. The classification was performed using six under-sampling methods, three over-sampling methods, two hybrid-sampling methods, and PSM on the ADNI data, thyroid disease database, and heart disease health indicators data with imbalance ratios of 11.3, 12.0, and 9.6, respectively. We used support vector machine, logistic regression, XGBoost and random forest as classification models and compared the AUC, AUPR, F1 score, and Matthews Correlation Coefficient(MCC) of the models. We also compared the average and frequency of variables for each class using independent sample t-tests and chi-square tests to confirm the effect of PSM. As a result, the performance of classification improved when PSM was applied to variables with no difference between classes, but it deteriorated when a variable with a difference between classes was included. Therefore, PSM can be useful for creating appropriate models by reflecting the data distribution when conducting classification prediction studies using medical data with unbalanced classes. Our study provides valuable insights into finding techniques suitable for medical data’s characteristics and expands existing epidemiology research by applying techniques used in machine learning.
Audience Take Away:
- The approach used in this study can leverage the distribution of data to improve the accuracy of the model when conducting classification and predictive studies using imbalanced medical data.
- We provided ideas for finding techniques suitable for medical data characteristics.
- By applying techniques used in the existing epidemiology fields to machine learning, the performance of prediction models can be improved.