International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
(Re-evaluation in-progress)

IJSTR >> Volume 9 - Issue 4, April 2020 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Improve Class Prediction By Balancing Class Distribution For Diabetes Dataset

[Full Text]



Mohammad Al Khaldy, Mohammad Alauthman, Majed S. Al-Sanea and Ghassan Samara



Imbalance class; Resampling; Random Forest, Naive Bayes, Bagging.



When using machine-learning algorithms to analyses clinical data, some challenges are facing this kind of data. One of the limitations of data is class imbalance because class imbalance could create a suboptimal performance of the classifier. The purpose of this article is to evaluate the influence of imbalance class on classification efficiency for multiple classification methods. In addition, we resample data by random replacement technique with replacement and without replacement to see how balancing data can improve the performance of classification techniques. The experiments show that resampling with imbalanced replacement class obtains a considerable boost in classification effectiveness for most of the learning algorithms used, but after resampling class, the Naive Bayes algorithm has not been improved.



[1] S. Batra, H. J. Parashar, S. Sachdeva, and P. Mehndiratta, "Applying data mining techniques to standardized electronic health records for decision support," in 2013 Sixth International Conference on Contemporary Computing (IC3), 2013: IEEE, pp. 510-515.
[2] D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining (adaptive computation and machine learning). MIT Press, 2001.
[3] G. Potamias and V. Moustakis, "Knowledge discovery from distributed clinical data sources: the era for internet-based epidemiology," in 2001 Conference Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2001, vol. 4: IEEE, pp. 3638-3641.
[4] G. Menardi and N. Torelli, "Training and assessing classification rules with imbalanced data," Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 92-122, 2014.
[5] P. Cao, X. Liu, J. Zhang, D. Zhao, M. Huang, and O. Zaiane, "ℓ 2, 1 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification," Neurocomputing, vol. 234, no. 19 April 2017, pp. 38-57, 2016.
[6] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[7] D. G. Horvitz and D. J. Thompson, "A generalization of sampling without replacement from a finite universe," Journal of the American statistical Association, vol. 47, no. 260, pp. 663-685, 1952.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[9] O. Loyola-González, M. A. Medina-Pérez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, R. Monroy, and M. García-Borroto, "PBC4cip: A new contrast pattern-based classifier for class imbalance problems," Knowledge-Based Systems, vol. 115, pp. 100-109, 2017.
[10] P. Cao, X. Liu, J. Zhang, D. Zhao, M. Huang, and O. Zaiane, "ℓ2, 1 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification," Neurocomputing, vol. 234, pp. 38-57, 2017.
[11] A. Al-Shahib, R. Breitling, and D. Gilbert, "Feature selection and the class imbalance problem in predicting protein function from sequence," Applied Bioinformatics, vol. 4, no. 3, pp. 195-203, 2005.
[12] R. Batuwita and V. Palade, "Efficient resampling methods for training support vector machines with imbalanced datasets," in The 2010 International Joint Conference on Neural Networks (IJCNN), 2010: IEEE, pp. 1-8.
[13] P. B. andLuis Torgo and R. Ribeiro, "A survey of predictive modeling under imbal-anced distributions," ACM Comput. Surv, vol. 49, no. 2, pp. 1-31, 2016.
[14] V. López, A. Fernández, and F. Herrera, "On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed," Information Sciences, vol. 257, pp. 1-13, 2014.
[15] B. Antal and A. Hajdu, "An ensemble-based system for microaneurysm detection and diabetic retinopathy grading," IEEE transactions on biomedical engineering, vol. 59, no. 6, pp. 1720-1726, 2012.
[16] H. Chauhan, V. Kumar, S. Pundir, and E. S. Pilli, "A comparative study of classification techniques for intrusion detection," in 2013 International Symposium on Computational and Business Intelligence, 2013: IEEE, pp. 40-43.
[17] M. Al Khaldy and C. Kambhampati, "Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset," in Proceedings of SAI Intelligent Systems Conference, 2016: Springer, pp. 415-425.
[18] S. Hido, H. Kashima, and Y. Takahashi, "Roughly balanced bagging for imbalanced data," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 2, no. 5‐6, pp. 412-426, 2009.
[19] C. C. Aggarwal and S. Sathe, Outlier ensembles: An introduction. Springer, 2017.
[20] U. R. Salunkhe and S. N. Mali, "Classifier ensemble design for imbalanced data classification: a hybrid approach," Procedia Computer Science, vol. 85, pp. 725-732, 2016.
[21] G. L. Agrawal and H. Gupta, "Optimization of C4. 5 decision tree algorithm for data mining application," International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 3, pp. 341-345, 2013.
[22] P. Sharma, D. Singh, and A. Singh, "Classification algorithms on a large continuous random dataset using rapid miner tool," in 2015 2nd International Conference on Electronics and Communication Systems (ICECS), 2015: IEEE, pp. 704-709.
[23] G. Kaur and A. Chhabra, "Improved J48 classification algorithm for the prediction of diabetes," International Journal of Computer Applications, vol. 98, no. 22, 2014.
[24] A. Almutairi and D. Parish, "Using classification techniques for creation of predictive intrusion detection model," in The 9th International Conference for Internet Technology and Secured Transactions (ICITST-2014), 2014: IEEE, pp. 223-228.
[25] A. Galathiya, A. Ganatra, and C. Bhensdadia, "Classification with an improved decision tree algorithm," International Journal of Computer Applications, vol. 46, no. 23, pp. 1-6, 2012.
[26] J. Xu, J. Chen, and B. Li, "Random forest for relational classification with application to terrorist profiling," in 2009 IEEE International Conference on Granular Computing, 2009: IEEE, pp. 630-633.
[27] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston, "Random forest: a classification and regression tool for compound classification and QSAR modeling," Journal of chemical information and computer sciences, vol. 43, no. 6, pp. 1947-1958, 2003.
[28] A. Cuzzocrea, S. L. Francis, and M. M. Gaber, "An information-theoretic approach for setting the optimal number of decision trees in random forests," in 2013 IEEE International Conference on Systems, Man, and Cybernetics, 2013: IEEE, pp. 1013-1019.
[29] W. N. H. W. Mohamed, M. N. M. Salleh, and A. H. Omar, "A comparative study of reduced error pruning method in decision tree algorithms," in 2012 IEEE International conference on control system, computing and engineering, 2012: IEEE, pp. 392-397.
[30] A. Balasundaram and P. Bhuvaneswari, "Comparative study on decision tree based data mining algorithm to assess risk of epidemic," 2013.
[31] J. Park, H.-R. Tyan, and C.-C. J. Kuo, "Ga-based internet traffic classification technique for qos provisioning," in 2006 International Conference on Intelligent Information Hiding and Multimedia, 2006: IEEE, pp. 251-254.