IMBALANCED DATA CLASSIFICATION USING RANDOM FOREST WITH WARD CLUSTERING

Authors

  • Vo Thi Ngoc Ha
  • Nguyen Thanh Son
  • Dang Dang Khoa
  • Le Phuong Long
  • Phan Thi Ngan

DOI:

https://doi.org/10.61591/jslhu.22.726

Keywords:

Imbalanced Data, Random Forest Algorithm, Balanced Random Forest, Classification Technique

Abstract

This study introduces a Modified Balanced Random Forest algorithm to improve classification performance on imbalanced datasets. The proposed method enhances the Balanced Random Forest by applying a clustering based under sampling strategy during each bootstrap iteration. Four clustering methods were evaluated including K Means, Spectral Clustering, Agglomerative Clustering, and Ward Hierarchical Clustering. Among these, the Ward Hierarchical Clustering technique achieved the best performance. Experimental results show that the proposed method outperforms standard Random Forest and Balanced Random Forest, reaching a true positive rate of 93.42 percent, a true negative rate of 93.60 percent, and an area under the curve accuracy of 93.51 percent, while also reducing processing time. These results confirm the effectiveness of the proposed approach for imbalanced data classification.

References

S. Singh and P. Gupta, “Comparative study ID3, cart and C4 . 5 Decision tree algorithm: a survey,” Int. J. Adv. Inf. Sci. Technol., 2014

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5-32, 200

H. Aydadenta and Adiwijaya, “A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest,” J. Inf. Process. Syst., vol. 14, no. 5, pp. 1167–1175, 2018.

G. Esteves and J. Mendes-Moreira, “Churn perdiction in the telecom business,” in 2016 11th International Conference on Digital Information Management, ICDIM 2016, 2016

A. Sonak and R. A. Patankar, “A Survey on Methods to Handle Imbalance Dataset,” Int. J. Comput. Sci Mob. Comput., vol. 4, no. 11, pp. 338–343, 2015

A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: A review,” Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176-203, 2015

S. Du, F. Zhang, and X. Zhang, “Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach,” ISPRS J. Photogramm. Remote Sens., 2015

Z. Wu, W. Lin, Z. Zhang, A. Wen, and L. Lin, “An Ensemble Random Forest Algorithm for Insurance Big Data Analysis,” in Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, 2017

M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks from highly imbalanced data using random forest,” BMC Med. Inform. Decis. Mak., 2011

V. Effendy and Z. K. a. Baizal, “Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest,” 2014 2nd Int. Conf. Inf. Commun. Technol., 2014

E. Dwiyanti, Adiwijaya, and A. Ardiyanti, “Handling imbalanced data in churn prediction using RUSBoost and feature selection (Case study: PT. Telekomunikasi Indonesia regional 7),” in Advances in Intelligent Systems and Computing, 2017

Ł. Kobyliński and A. Przepiórkowski, “Definition extraction with balanced random forests,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008

S. Ghosh and S. Kumar, “Comparative Analysis of K-Means and Fuzzy C-Means Algorithms,” Int. J. Adv. Comput. Sci. Appl., 2017

S. Venkateswara and V. Swamy, “A Survey: Spectral Clustering Applications and its Enhancements,” Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 185–189, 2015.

A. Y. Shelestov, “Using the agglomerative method of hierarchical clustering as a data mining tool in capital market,” Int. J. "Information Theor. Appl., vol. 15, no. 1, pp. 382–386, 2018.

K. Sasirekha and P. Baby, “Agglomerative Hierarchical Clustering Algorithm-A Review,” Int. J. Sci. Res. Publ., 2013.

W. Tian, Y. Zheng, R. Yang, S. Ji, and J. Wang, “A Survey on Clustering based Meteorological Data Mining,” Int. J. Grid Distrib. Comput., vol. 7, no. 6, pp. 229–240, 2014.

A. Chowdhary, “Community Detection: Hierarchical clustering Algorithms,” Int. J. Creat. Res. Thoughts, vol. 5, no. 4, pp. 2320–2882, 2017.

C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” Univ. California, Berkeley, 2016.

D. Ramyachitra and P. Manikandan, “Imbalanced Dataset Classification and Solutions: a Review,” Int. J. Comput. Bus. Res., vol. 5, no. 4, 2018

S. Sardari, M. Eftekhari, and F. Afsari, “Hesitant fuzzy decision tree approach for highly imbalanced data classification,” Appl. Soft Comput. J., 2017

E. AT, A. M, A.-M. F, and S. M, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,” Glob. J. Technol. Optim., 2018.

M. Bekkar, H. K. Djemaa, and T. A. Alitouche, “Evaluation measures for models assessment over imbalanced data sets,” J. Inf. Eng. Appl., vol. 3, no. 10, pp. 27-38, 2013.

C. G. Weng and J. Poon, “A new evaluation measure for imbalanced datasets,” Proceedings of the 7th Australasian Data Mining Conference., vol. 87, no. 6, pp. 27-32, 2008.

J. S. Akosa, “Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data,” SAS Glob. Forum, 2017.

Y. Zhang and D. Wang, “A Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets,” Abstr. Appl. Anal., 2013.

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., 2006.

H. M and S. M.N, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, 2015

A. K. Santra and C. J. Christy, “Genetic Algorithm and Confusion Matrix for Document Clustering,” IJCSI Int. J. Comput. Sci. Issues, 2017

J. Pohjankukka, T. Pahikkala, P. Nevalainen, and J. Heikkonen, “Estimating the prediction performance of spatial models via spatial k-fold cross validation,” Int. J. Geogr. Inf. Sci., 2017

Downloads

Published

2025-09-30

How to Cite

Vo Thi Ngoc Ha, Nguyen Thanh Son, Dang Dang Khoa, Le Phuong Long, & Phan Thi Ngan. (2025). IMBALANCED DATA CLASSIFICATION USING RANDOM FOREST WITH WARD CLUSTERING. Journal of Science Lac Hong University, 1(22), 47–51. https://doi.org/10.61591/jslhu.22.726

Most read articles by the same author(s)