Predicting employee attrition using machine learning approaches

Các tác giả

  • Luong Tien Vinh
  • Phan Thi Ngan

DOI:

https://doi.org/10.61591/jslhu.20.717

Từ khóa:

Employee Exploratory Data Analysis; IBM HR Employee Attrition; Support Vector Machine; Support Vector Machine; Decision Tree Classifier; Extra Trees Classifier.

Tóm tắt

Employee attrition poses a critical challenge to organizations, both in terms of financial costs and operational continuity, with the average replacement cost per hire estimated at USD 4,129 and a reported attrition rate of 57.3% in 2021. This study applies machine learning techniques to predict employee attrition and identify its primary organizational drivers. Four supervised learning models were evaluated, Support Vector Machine (SVM), Support Vector Machine (LR), Decision Tree Classifier (DTC), and Extra Trees Classifier (ETC), in which the optimized ETC achieving the highest prediction accuracy of 93%, surpassing existing state-of-the-art methods. An Employee Exploratory Data Analysis (EEDA) revealed that monthly income, hourly rate, job level, and age are key factors influencing attrition. These findings highlight the effectiveness of AI-driven approaches in workforce analytics and provide actionable insights for organizational leaders aiming to improve retention through data-informed strategies.

Tài liệu tham khảo

S. Singh and P. Gupta, “Comparative study ID3, cart and C4 . 5 Decision tree algorithm: a survey,” Int. J. Adv. Inf. Sci. Technol., 2014.

DOI: 10.15693/ijaist/2014.v3i7.47-52.

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5-32, 2001.

DOI: 10.1023/A:1010933404324.

H. Aydadenta and Adiwijaya, “A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest,” J. Inf. Process. Syst., vol. 14, no. 5, pp. 1167–1175, 2018.

DOI: 10.3745/JIPS.04.0087.

G. Esteves and J. Mendes-Moreira, “Churn perdiction in the telecom business,” in 2016 11th International Conference on Digital Information Management, ICDIM 2016, 2016.

DOI: 10.1109/ICDIM.2016.7829775.

A. Sonak and R. A. Patankar, “A Survey on Methods to Handle Imbalance Dataset,” Int. J. Comput. Sci. Mob. Comput., vol. 4, no. 11, pp. 338–343, 2015, available at : Google Scholar.

A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: A review,” Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176-203, 2015, available at: http://home.ijasca.com/data/ documents/13IJASCA-070301_Pg176-204_Classification-with-class-imbalance-problem_A-Review.pdf .

S. Du, F. Zhang, and X. Zhang, “Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach,” ISPRS J. Photogramm. Remote Sens., 2015.

DOI: 10.1016/j.isprsjprs.2015.03.011.

Z. Wu, W. Lin, Z. Zhang, A. Wen, and L. Lin, “An Ensemble Random Forest Algorithm for Insurance Big Data Analysis,” in Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, 2017.

DOI: 10.1109/CSE-EUC.2017.99.

M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks from highly imbalanced data using random forest,” BMC Med. Inform. Decis. Mak., 2011.

DOI: 10.1186/1472-6947-11-51.

V. Effendy and Z. K. a. Baizal, “Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest,” 2014 2nd Int. Conf. Inf. Commun. Technol., 2014.

DOI: 10.1109/ICoICT.2014.6914086.

E. Dwiyanti, Adiwijaya, and A. Ardiyanti, “Handling imbalanced data in churn prediction using RUSBoost and feature selection (Case study: PT. Telekomunikasi Indonesia regional 7),” in Advances in Intelligent Systems and Computing, 2017.

DOI: 10.1007/978-3-319-51281-5_38.

Ł. Kobyliński and A. Przepiórkowski, “Definition extraction with balanced random forests,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008.

DOI: 10.1007/978-3-540-85287-2_23.

S. Ghosh and S. Kumar, “Comparative Analysis of K-Means and Fuzzy C-Means Algorithms,” Int. J. Adv. Comput. Sci. Appl., 2013.

DOI: 10.14569/IJACSA.2013.040406.

S. Venkateswara and V. Swamy, “A Survey : Spectral Clustering Applications and its Enhancements,” Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 185–189, 2015, available at: Google Scholar.

A. Y. Shelestov, “Using the agglomerative method of hierarchical clustering as a data mining tool in capital market,” Int. J. "Information Theor. Appl., vol. 15, no. 1, pp. 382–386, 2018, available at: http://hdl.handle.net/10525/80.

K. Sasirekha and P. Baby, “Agglomerative Hierarchical Clustering Algorithm-A Review,” Int. J. Sci. Res. Publ., 2013.

DOI: 10.1016/S0090-3019(03)00579-2.

W. Tian, Y. Zheng, R. Yang, S. Ji, and J. Wang, “A Survey on Clustering based Meteorological Data Mining,” Int. J. Grid Distrib. Comput., vol. 7, no. 6, pp. 229–240, 2014, available at: Google Scholar.

A. Chowdhary, “Community Detection: Hierarchical clustering Algorithms,” Int. J. Creat. Res. Thoughts, vol. 5, no. 4, pp. 2320–2882, 2017, available at: http://ijcrt.org/papers/IJCRT1704418.pdf.

C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” Univ. California, Berkeley, 2004, available at: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.

D. Ramyachitra and P. Manikandan, “Imbalanced Dataset Classification and Solutions: a Review,” Int. J. Comput. Bus. Res., vol. 5, no. 4, 2014, available at: http://www.researchmanuscripts.com/July2014/2.pdf.

S. Sardari, M. Eftekhari, and F. Afsari, “Hesitant fuzzy decision tree approach for highly imbalanced data classification,” Appl. Soft Comput. J., 2017.

DOI: 10.1016/j.asoc.2017.08.052.

E. AT, A. M, A.-M. F, and S. M, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,” Glob. J. Technol. Optim., 2018.

DOI: 10.4172/2229-8711.s1111.

M. Bekkar, H. K. Djemaa, and T. A. Alitouche, “Evaluation measures for models assessment over imbalanced data sets,” J. Inf. Eng. Appl., vol. 3, no. 10, pp. 27-38, 2013, available at: Google Scholar.

C. G. Weng and J. Poon, “A new evaluation measure for imbalanced datasets,” Proceedings of the 7th Australasian Data Mining Conference., vol. 87, no. 6, pp. 27-32, 2008, available at: http://dl.acm.org/citation.cfm?id=2449288.2449295.

J. S. Akosa, “Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data,” SAS Glob. Forum, 2017, available at: Google Scholar.

Y. Zhang and D. Wang, “A Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets,” Abstr. Appl. Anal., 2013

DOI: 10.1155/2013/196256.

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., 2006.

DOI: 10.1016/j.patrec.2005.10.010.

H. M and S. M.N, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, 2015.

DOI: 10.5121/ijdkp.2015.5201.

A. K. Santra and C. J. Christy, “Genetic Algorithm and Confusion Matrix for Document Clustering,” IJCSI Int. J. Comput. Sci. Issues, 2012, available at: Google Scholar.

J. Pohjankukka, T. Pahikkala, P. Nevalainen, and J. Heikkonen, “Estimating the prediction performance of spatial models via spatial k-fold cross validation,” Int. J. Geogr. Inf. Sci., 2017.

DOI: 10.1080/13658816.2017.1346255.

Tải xuống

Đã Xuất bản

22-05-2025

Cách trích dẫn

Luong Tien Vinh, & Phan Thi Ngan. (2025). Predicting employee attrition using machine learning approaches. Tạp Chí Khoa học Lạc Hồng, 1(20), 69–75. https://doi.org/10.61591/jslhu.20.717