Insights on the use of Machine Learning to Predict Retention of Career Soldiers in the United States Army

Open Access
- Author:
- Garcia, Miguel
- Graduate Program:
- Data Analytics
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 14, 2022
- Committee Members:
- Colin Neill, Program Head/Chair
Partha Mukherjee, Thesis Advisor/Co-Advisor
Guanghua Qiu, Committee Member
Youakim Badr, Thesis Advisor/Co-Advisor - Keywords:
- Artificial Intelligence
Machine Learning
AI
Deep Learning
Decision Tree
Logistic Regression
Random Forest
XGBoost
Artificial Neural Network
Deep Neural Network
U.S. Army
Military
Attrition
Retention
Person-Event Data Environment
PDE
Department of Defense
DoD
Army Analytics Group
AAG
Naive Bayes
Active Duty
Enlistment
Commissioned Officers
Soldiers
Area Under the Curve
AUC
Data Analytics
Retirement
MEDPROS
PHA
MEPCOM
DTMS
ATMS
ROC Curve
Variable Importance
Preprocessing
Data Quality - Abstract:
- In the face of cyclic pressures to appropriately up-scale or down-scale military forces in response to changes in political power and respective surge or withdrawal of forces from theatres of operation, the Department of Defense (DoD) stands to benefit from any techniques which might provide insight into prediction of retention vs. attrition of U.S. Army soldiers. The presumptive solution of Artificial Intelligence (AI) should thus serve as major resources, but the military decision-making process has yet to fully capitalize on it – though steps have been taken to lay the groundwork for large-scale implementation. As such, opportunities abound for novel implementations of any or a combination of recent advances in data analytics processes and neural learning breakthroughs to gain insights from large volumes of data. Supported by the Army Analytics Group (AAG), this thesis uses the Person-Event Data Environment (PDE) to aggregate 55 distinct datasets within a secure, cloud computing environment in order to build and compare the efficacy of various modeling techniques in R for the prediction of career-long retention of soldiers, both at time of entry and after three years of service, to include: Decision Trees (rpart), Logistic Regression, Naïve Bayes, C5.0 Decision Trees, Random Forests, XGBoost, simple Artificial Neural Networks (nnet), and Deep Neural Networks (neuralnet) – the latter of which constitutes the military’s first foray into Deep Learning in retention research. In the field of AI as applied to predicting U.S. Army attrition, this research furthers the state of the art in terms of breadth from the perspective of the number of datasets accessed, the number of records considered in terms of Active Duty soldiers in the time window, the number and types of different techniques implemented in a single study, the length of prediction (i.e. full career as opposed to single-term enlistment), and consideration of both commissioned officers as well as enlisted soldiers in a single work. The results show the potential and limitations of available demographic, medical, deployment, and training data for predicting whether soldiers will or will not serve a full career until retirement. With cohorts broken down by fiscal year and by officer vs. enlisted, the relative performance of each technique notably varied in Area Under the Curve (AUC): At most prediction points for both enlisted soldiers and officers (though less consistently), Logistic Regression proved the most effective, but XGBoost and Artificial Neural Network methodologies were able to surpass it when hyperparameters were tuned for a specific cohort. When more variables became available after three years, ensemble techniques such as XGBoost and Random Forests would sometimes outperform Logistic Regression, as in the case of commissioned datasets with ancillary features added. While the performance increase after three years is marginal, the relative predictive strength on enlisted datasets at the time of enlistment is encouraging for the prospect of future implementation by military recruiting. Ultimately, while Logistic Regression is consistent (at least as compared to other models) in prediction of career-long attrition and retention given the data in the PDE, it does not yet produce reliable enough AUC for confident use in justifying changes to policy or decisions made by the DoD in terms of real-world recruiting or retention. That being said, it can be surpassed by even the most basic neural networks when hyperparameters are tuned for specific cohorts. Though this research was limited in its application of Deep Learning by (temporary) technical constraints, findings suggest that libraries like Keras and TensorFlow may be capable of revolutionizing this topic as they have in many other fields. Ultimately, this thesis extends the proven validity of established ML techniques from first-term, enlisted-only predictions to full-career predictions for both enlisted soldiers and officers alike while simultaneously establishing a new precedent for the viability of neural networks and Deep Learning for the study of military retention and attrition.