Application of resampling algorithms in the imbalanced geochemical data classification Case study; Geochemical data of Qayen 1:100000 sheet

Document Type : Research Article

Author

Dept. of Mining Engineering, Birjand University of Technology, Birjand, Iran

Abstract

Geochemical data are imbalanced in nature (i.e., the number of samples with low grade or background class are high and the number of samples with high grade or anomaly class are low). Classification of this dataset will lead to create a biased model, reducing the probability of new samples belonging to classes with fewer samples, along with a decrease in the accuracy and precision of the model. In this paper, oversampling (such as SMOTE and ADASYN), undersampling (such as RUS and OSS), and hybrid-sampling (such as SMOTE-Tomek and ADASYN-CNN) algorithms have been introduced for data balancing. Also, the performance of these algorithms on the stream sediments geochemical data of Qayen sheet has been investigated by the SVM and ANN classification methods. The results show that data balancing can significantly increase the quantity of the confusion matrix metrics such as accuracy, sensitivity, specificity, precision, F-score, F-value, G-mean and AUC, by 10 to 50 percent, and reduce the error metric by about 10 percent. So that the oversampling, hybrid-sampling and undersampling algorithms have the high performance, respectively. Geochemical anomalies maps, modeled by the balancing algorithms, show that these models can increase the extent of geochemical anomalies in the study area and establish a well overlap between these anomalies and mineralized rock units. In this respect, the oversampling algorithms (SMOTE and ADASYN) and then the hybrid-sampling algorithm (ADASYN-CNN) have higher performance. Therefore, this paper proposes the use of data balancing algorithms, using oversampling algorithms and then hybrid-sampling algorithms, before to classify the exploration data.

Keywords

Main Subjects


[1]                 Zaki, M.J. and Meira, W. (2020). Data Mining and Machine Learning: Fundamental Concepts and Algorithms, Cambridge University Press, New York, 777 P.
[2]                 Cerulli, G. (2023). Fundamentals of Supervised Machine Learning: With Applications in Python, R, and Stata, Springer Cham, 391 P.
[3]                 Moradzadeh, A., Zare, M., Kamkar Rouhani, A. and Doulati Aredehjan, F. (2019). Classification of environmental geochemical data using discriminant analysis and neural network in carbonate-sulfide waste dumps of lead and zinc mines. Journal of Mining Engineering 14(44): 12-25 [In Persian].
[4]                 Geranian, H., Tabatabaei, S.H., Asadi, H.H. and Carranza, E.J.M. (2016). Application of discriminant analysis and support vector machine in mapping gold potential areas for further drilling in the Sari-Gunay gold deposit, NW Iran. Nat. Resour. Res. 25: 145–159.
[5]                 Zaremotlagh, S. and Hezarkhani, A. (2017). The use of decision tree induction and artificial neural networks for recognizing the geochemical distribution patterns of LREE in the Choghart deposit, Central Iran. Journal of African Earth Sciences 128: 37-46.
[6]                 Degtyareva. K., Kukartseva, O., Tynchenko, V., Mariupolskiy, T. and Pereverzev, D. (2024). Analysis of geochemical characteristics of rocks using machine learning methods. E3S Web of Conferences 583, 01007.
[7]                 Geranian, H., Tabatabaei, S.H. and Asadi, H.H. (2013). Application of classifiers based on Bayes decision theory in gold potential mapping in Sari Gunay epithermal gold deposit. Geochemistry Journal 1(4): 347-355 [In Persian].
[8]                 Ziaii, M., Abedi, A. and Ziaei, M. (2009). Geochemical and mineralogical pattern recognition and modeling with a Bayesian approach to hydrothermal gold deposits. Applied Geochemistry 24(6): 1142-1146.
[9]                 Yin, S., Lin, X., Huang, Y., Zhang, Z. and Li, X. (2023). Application of improved support vector machine in geochemical lithology identification. Earth. Sci. Inform. 16: 205–220.
[10]             Mahdiyanfar, H., Mohammadpoor, M. and Mahdavi, M. (2022). Determination of alteration genesis and quantitative relationship between alteration and geochemical anomaly using support vector machines. International Journal of Mining and Geo-Engineering 56(1): 33-391.
[11]             Trott, M., Leybourne, M., Hall, L. and Layton-Matthews, D. (2022). Random forest rock type classification with integration of geochemical and photographic data. Applied Computing and Geosciences 15: 100090.
[12]             Zhang, Y., Ye, X., Xie, S., Dong, J., Yaisamut, O., Zhou, X. and Zhou, X. (2023). Prediction of Au-Polymetallic Deposits Based on Spatial Multi-Layer Information Fusion by Random Forest Model in the Central Kunlun Area of Xinjiang, China. Minerals 13(10): 1302.
[13]             Chen, Y. and Zhao, Q., (2021). Mineral exploration targeting by combination of recursive indicator elimination with the ℓ2-regularization logistic regression based on geochemical data. Ore Geology Reviews 135: 104213.
[14]             Hanson, D.R. and Lawson, H.E. (2023). Using Machine Learning to Evaluate Coal Geochemical Data with Respect to Dynamic Failures. Minerals 13(6): 808.
[15]             Puzyrev, V., Zelic, M. and Duuring, P. (2023). Applying neural networks-based modelling to the prediction of mineralization: A case-study using the Western Australian Geochemistry (WACHEM) database. Ore Geology Reviews 152: 105242.
[16]             Tahmooresi, M., Babaei, B. and Dehghan, S. (2022). Geochemical exploration numerical modeling using convolutional neural network (Case study: Gonabad region). Analytical and Numerical Methods in Mining Engineering 12(31): 47-58.
[17]             Chen, Y., Zhao, Q. and Lu, L. (2022). Combining the outputs of various k-nearest neighbor anomaly detectors to form a robust ensemble model for high-dimensional geochemical anomaly detection. Journal of Geochemical Exploration 231(1):106875.
[18]             Chen, Y. and Lu, L. (2023). The Anomaly Detector, Semi-supervised Classifier, and Supervised Classifier Based on K-Nearest Neighbors in Geochemical Anomaly Detection: A Comparative Study. Math. Geosci. 55: 1011–1033.
[19]             Parsa, M. (2021). A data augmentation approach to XGboost-based mineral potential mapping: An example of carbonate-hosted Zn-Pb mineral systems of Western Iran. Journal of Geochemical Exploration 228: 106811.
[20]             Ibrahim, B., Majeed, F., Ewusi, A. and Ahenkorah, I. (2022). Residual geochemical gold grade prediction using extreme gradient boosting. Environmental Challenges 6: 100421.
[21]             Brownlee, J. (2021). Imbalanced Classification with Python, Machine Learning Mastery, 463 P.
[22]             [22] Wongvorachan, T., He, S. and Bulut, O. (2023). A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information 14: 54.
[23]             Han, J., Kamber, M. and Pei, J. (2022). Data mining: concepts and techniques, 4th Edition, Morgan Kaufmann, 752 P.
[24]             Kashyap, J., and Gulati, P. (2020). Hybrid Resampling Technique to Tackle the Imbalanced Classification Problem. 10.21203/rs.3.rs-36578/v1.
[25]             Ghosh, K., Bellinger, C., Corizzo, R., Branco, P., Krawczyk, B., and Japkowicz, N. (2024). The class imbalance problem in deep learning. Machine Learning 113: 4845–4901.
[26]             Khushi, M., Shaukat, K., Mahboob Alam, T., Hameed, I.A., Uddin, S., and Luo, S. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access 9: 109960-109975.
[27]             Altalhan, M., Algarni, A., and Turki-Hadj Alouane, M. (2025). Imbalanced Data Problem in Machine Learning: A Review. IEEE Access 13: 13686-13699.
[28]             Liu, L., Wu, X., Li, S., Tan, S., and Bai, Y. (2022).   Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Medical Informatics and Decision Making volume 22: Article number: 82.
[29]             Wang, W., and Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences 563: 358-374.
[30]             Salehi, A. R., and Khedmati, M. (2024). A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data. Scientific Reports 14(1): 5152.
[31]             Araf, I., Idri, A., and Chairi, I. (2024). Cost-sensitive learning for imbalanced medical data: a review. Artificial Intelligence Review 57(4): 80.
[32]             Xiao, J., Li, S., Tian, Y., Huang, J., Jiang, X., and Wang, S. (2025). Example dependent cost sensitive learning based selective deep ensemble model for customer credit scoring. Scientific Reports 15(1): 6000.
[33]             Liu, Y., Li, Z., Chen, J., Zhang, T., Pan, T., and He, S. (2025). A batch-adapted cost-sensitive contrastive feature learning network for industrial diagnosis with extremely imbalanced data. Measurement 244: 116478.
[34]             Yuan, Y., Wei, J., Huang, H., Jiao, W., Wang, J. and Chen, H. (2023). Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring. Engineering Applications of Artificial Intelligence 126: 106911.
[35]             Abhishek, K. and Abdelaziz, M. (2023). Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques, Packt Publishing, 344 p.
[36]             Yang, Y., Akbarzadeh Khorshidi, H. and Aickelin, U. (2024). A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front. Digit. Health 26: 1430245.
[37]             Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16: 321–357.
[38]             Hu, S., Liang, Y., Ma, L. and He, Y. (2009).  MSMOTE: Improving Classification Performance when Training Data is imbalanced. 2009 Second International Workshop on Computer Science and Engineering, 13-17.
[39]             Tahmooresi, M., Babaei, B. and Dehghan, S. (2022). Geochemical exploration numerical modeling using convolutional neural network (Case study: Gonabad region). Journal of Analytical and Numerical Methods in Mining Engineering 12(31): 47-58.
[40]             Kosolwattana, T., Liu, C., Hu, R. Han, S., Chen, H. and Lin, Y. (2023). A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Mining 16: 15.
[41]             Hengyu, Z. (2020). Improved SMOTE algorithm for imbalanced dataset. Chinese Automation Congress (CAC), Shanghai, China, 693-697.
[42]             Lee, H., Kim, J. and Kim, S. (2017). Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions. Int. J. Fuzzy Log. Intell. Syst. 17(4): 229-234.
[43]             He, H., Bai, Y., Garcia, E.A. and Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning, IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1322-1328.
[44]             Brandt, J. and Lanzén, E. (2021).  A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification. Department of Statistics, Uppsala University, 42 P.
[45]             Kurniawati, Y.E., Permanasari, A.E. and Fauziati, S. (2018). Adaptive Synthetic-Nominal (ADASYN-N) and Adaptive Synthetic-KNN (ADASYN-KNN) for Multiclass Imbalance Learning on Laboratory Test Data, 4th International Conference on Science and Technology (ICST), Yogyakarta, Indonesia, 1-6.
[46]             Qing, Z., Zeng, Q., Wang, H., Liu, Y., Xiong, T. and Zhang, S. (2022). ADASYN-LOF Algorithm for Imbalanced Tornado Samples. Atmosphere, 13(4): 544.
[47]             Devi, D., Biswas, S.K. and Purkayastha, B. (2020). A Review on Solution to Class Imbalance Problem: Undersampling Approaches, International Conference on Computational Performance Evaluation (ComPE), Shillong, India, 626-631.
[48]             Mazhari, S.A. and Safari, M. (2013). High-K Calc-alkaline Plutonism in Zouzan, NE of Lut Block, Eastern Iran: An Evidence for Arc Related Magmatism in Cenozoic. Journal Geological Society of India 81: 698-708. 
[49]             Geranian, H. and Carranza, E.J.M. (2022). Mapping of Regional-scale Multi-Element Geochemical Anomalies Using Hierarchical Clustering Algorithms. Natural Resources Research 31(4): 1841-1865.
[50]             Seyedrahimi-Niaraq, M., Mahdiyanfar, H. and Mokhtari, A. R. (2023). Application of geochemical structural methods to determine lead-contaminated areas related to mining activities. Journal of Analytical and Numerical Methods in Mining Engineering 13(34): 41-55.
[51]             Kubat, M. and Matwin, S. (1997). Addressing the course of imbalanced training sets: One-sided selection. Proceedings of the 14th international conference on machine learning, Morgan Kaufmann, pp. 179-186.
[52]             Jia, C. and Zuo, Y. (2017). S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. Journal of Theoretical Biology 422: 84-89.
[53]             Batista, G., Bazzan, A. and Monard, MC. (2003). Balancing Training Data for Automated Annotation of Keywords: A Case Study.  II Brazilian Workshop on Bioinformatics, 10-18.
[54]             Hart, P.E. (1968). The Condensed Nearest Neighbour Rule. IEEE Transactions on Information Theory 14(5): 515-516.
[55]             Hassani Pak, A.A. (2016). Principles of Geochemical Exploration. Tehran University Press, Tehran [In Persian].
[56]             Fakhari, S., Jafarirad, A., Afzal, P., and Lotfi, M. (2019). Delineation of hydrothermal alteration zones for porphyry systems utilizing ASTER data in Jebal-Barez area, SE Iran.  Iranian Journal of Earth Sciences, 11: 80-92.
[57]             Mokhtari, Z., and Seifi, A. (2021).  Detection of Hydrothermal Alteration Zones Using ASTER Remote Sensing Data in Turquoise mine of Neyshabur. Journal of Analytical and Numerical Methods in Mining Engineering, 11(28): 1-22 [In Persaian].