Machine Learning in Cardiac Diseases - Comparison of Performance of Selected Algorithms

Open Access
- Author:
- Lee, Beom Ki
- Graduate Program:
- Industrial Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- November 12, 2021
- Committee Members:
- Steven James Landry, Program Head/Chair
Soundar Kumara, Thesis Advisor/Co-Advisor
Paul Griffin, Committee Member - Keywords:
- Machine Learning
Healthcare
Classification
Supervised Learning
Prediction
Data Analysis
Data Science
Cardiovascular - Abstract:
- Until recent years, researchers and practitioners in the healthcare field have had stumbling blocks in their effort to fully utilize and benefit from data. Data were difficult and expensive to obtain and filter. The hardships of data collection and storage were mostly eased with advancements in both medical and data analytics technologies. With an abundance of newly developed data and analytical tools, the question has now become which tools are more effective and appropriate for different medical applications. Research efforts in machine learning have taken significant strides with their findings and achievements in the past years in numerous domains such as but not limited to healthcare, finance, transportation, manufacturing, and advertising. These arising innovations were able to turn assorted large data blocks into information, from which useful insights could be derived, thus enabling data-driven decision-making processes. Among these domains, machine learning algorithms are now able to contribute an additional layer of valuable data analysis capability. The purpose of this research is to analyze 3 cardiac condition/disease-related datasets using 5 different supervised machine learning classification algorithms: logistic regression, decision tree, random forest, K-nearest neighbor, and multilayer perceptron. The chosen datasets were of varying sizes, time periods, attributes, conditions, and sources in order to avoid drawing conclusions based on a single dataset, thus minimizing bias potentially introduced through small dataset uniformity. Each dataset was analyzed with the complete set of attributes and with a reduced set for performance comparison within the same algorithms. The model performance evaluation criteria are based on their prediction accuracy in terms of area under the receiver operating characteristic curve (AUC) score and their model training computational times. The comparative results revealed the superiority of random forest classifier, K-nearest neighbor, and multilayer perceptron in terms of prediction accuracy with computational time as the trade-off. The rate of increase in computational time proportional to the size of the dataset was most noticeable in random forest and K-nearest neighbor models. This problem could be addressed by using reduced models with subset variables. Multilayer perceptron yielded more of a steady increase in computational time with analyzing the bigger dataset compared to the random forest and K-nearest neighbor models. Furthermore, the reduced models provided competitiveness in both prediction accuracy and computational time against their full model counterparts. Lastly, while the logistic regression and decision tree models were consistently delivering inferior prediction accuracy, their low computational cost hinted at the possibility of them being used as baseline models.