International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020


IJSTR >> Volume 8 - Issue 10, October 2019 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Machine Learning Technique For Enhancing Classification Performance In Data Summarization Using Rough Set And Genetic Algorithm

[Full Text]



Merlinda Wibowo, Fiftin Noviyanto, Sarina Sulaiman, Siti Mariyam Shamsuddin



Machine Learning, Prediction, Data Summarization, Rough Set, Genetic Algorithm, Hybrid Technique.



The number of data will grow rapidly and showed a significant increase every day. This data comes from different resources and services that produce a big volume of data that need to manage and reuse or some analytical aspects of the data. These heterogeneous sources of information are able to lead to important challenges for calibration of the model, as the data is often possible to be imprecise, uncertain, ambiguous, and incomplete. Therefore, it needs big storages and this volume of makes operations such as analytical operations, process operations, retrieval operations real difficult and hugely time-consuming. One of the solutions to overcome these difficult problems is to have data summarized to make less storage and extremely shorter time to get processed and retrieved. Data summarization techniques aim than to produce the best quality of summaries. In this study, Rough Set (RS) is proposed to obtain the accuracy, effectiveness and appropriate summary result. However, RS can extract decision rules effectively from given datasets, two processes data discretization and finding reducts are required in order to generate decision rules based on the values. Both processes are known to be Non-Polynomials (NP) problem and are also related to the dimensionality reduction problem. To solve two problems, Genetic Algorithm (GA) is applied to search both the cut points for discretization and the reducts in order to discover the optimal rules. Moreover, the reduction and transformation of the data may shorten the running time, while also allowing the system to obtain more generalized results and improve the predictive accuracy. Therefore, this study proposes the hybrid approach of RS and GA to improve lack of the rough set to ensure of better result. Hybridization of the proposed method hybrid RS-GA is going to overcome the short come of data summarization method. In order to find the efficiency of the proposed work, the classification accuracy obtained using these methods are compared with the accuracy of the proposed hybrid approach. The ML methods were analyzed by comparing the prediction accuracy: Rough Set (RS), Naοve Bayes (NB), J48, Random Tree (RT) and Projective Adaptive Resonance Theory (PART). The finding shows that RS-GA approach achieved the highest prediction accuracy with 99.95% and produce the lowest error based on API values from Malaysia and Singapore respectively compared to the other ML methods. For that, it was proved that RS-GA is the best performance and the most significant method compared to other methods.



[1] Zhang, J., & Chen, X. (2012). Research on Intrusion Detection of Database based on Rough Set. Physics Procedia, 25, 1637–1641.
[2] Zhang, X., Mei, C., Chen, D., & Li, J. (2016). Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy. Pattern Recognition, 56, 1–15.
[3] Jeong, H., Ko, Y., & Seo, J. (2016). How to Improve Text Summarization and Classification by Mutual Cooperation on an Integrated Framework. Expert Systems With Applications, 60, 222–233.
[4] Hesabi, Z.R., Tari, Z., Goscinki, A., Fahad, A., Khalil, I. & Queiroz, C. (2015). Data Summarization Techniques for Big Data – A Survey. Handbook on Data Centers, 1109-1152.
[5] Smits, G., Pivert, O., Yager, R. R., & Nerzic, P. (2018). A soft computing approach to big data summarization. Fuzzy Sets and Systems, 348, 4–20.
[6] Scotti, L., Rea, F., & Corrao, G. (2018). One-stage and two-stage meta-analysis of individual participant data led to consistent summarized evidence: lessons learned from combining multiple databases. Journal of Clinical Epidemiology, 95, 19–27.
[7] Rajkovic, P., Vuc, D., Jankovic, D., Milenkovic, A., & Aleksic, D. (2017). Data summarization method for chronic disease tracking kovic, 69, 188–202.
[8] Vanderhorn, E., & Mahadevan, S. (2018). Bayesian model updating with summarized statistical and reliability data. Reliability Engineering and System Safety, 172(April 2017), 12–24.
[9] Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95, 27–36.
[10] Odelu, V., Das, A. K., Kumari, S., Huang, X. & Wazid, M. (2017). Provably secure authenticated key agreement scheme for distributed mobile cloud computing services. Future Generation Computer Systems, 68, 74–88.
[11] Li, C., Yanpei, L. & Youlong, L. (2016). Efficient service selection approach for mobile devices in mobile cloud. Journal of Supercomputing, 72(6), 2197– 2220.
[12] Vafeiadis, T., Diamantaras, K. I., Sarigiannidis, G., & Chatzisavvas, K. C. (2015). Simulation Modelling Practice and Theory A comparison of machine learning techniques for customer churn prediction. Simulation Modelling Practice and Theory, 55, 1–9.
[13] Nieto, P. J. G., Garcνa-Gonzalo, E., & Antσn, J. C. Α. (2018). Journal of Computational and Applied A comparison of several machine learning techniques for the centerline segregation prediction in continuous cast steel slabs and evaluation of its performance. Journal of Computational and Applied Mathematics, 330, 877–895.
[14] Raza, M. S., & Qamar, U. (2016). An incremental dependency calculation technique for feature selection using rough sets. Information Sciences, 343–344, 41–65.
[15] Wibowo, M., Sulaiman, S., Mariyam, S., & Hashim, H. (2017). Mobile Analytics Database Summarization Using Rough Set. International Journal of Innovative Computing, 7(2), 6–12.
[16] Kumar, S.S. & Inbarani, H.H. (2015). Optimistic Multi-Granulation Rough Set based Classification for Medical Diagnosis. Procedia Computer Science, 47, 374-382.
[17] Srivastava, D., Batra, S., & Bhalothia, S. (2015). Efficient Rule Set Generation using K-Map & Rough Set Theory (RST), 2(3), 6–10.
[18] Kim, Y., Ahn, W., Joo, K., & Enke, D. (2017). An intelligent hybrid trading system for discovering trading rules for the futures market using rough sets and genetic algorithms. Applied Soft Computing Journal, 55, 127–140.
[19] Pawlak, Z. (1997). Rough Set Approach to Knowledge-Based Decision Support. European Journal Operational Research, 99 (1), 48–57.
[20] Moshkov, M.J., Piliszczuk, M. & Zielosko, B. (2008). Partial Covers, Reducts and Decision Rules in Rough Sets Theory and Applications. Computational Intelligent. Springer, 145.
[21] Azar, A. T., Elshazly, H. I., & Mohamed, A. (2013). Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications, 1–25.
[22] Janusz, A. & Slezak, D. (2012). Utilization of attribute clustering methods for scalable computation of reducts from high-dimensional data. Proceeding Federated Conference Computer Science Information System, 295–302.
[23] Kim, K.J. & Ahn, H. (2012). Simultaneous optimization of artificial neural networks for financial forecasting. Application Intelligent, 36 887–898.
[24] Hvidsten, T. (2013). A tutorial-based guide to the ROSETTA system: A Rough Set Toolkit for Analysis of Data. Trhvidsten.Com, (October).
[25] Erickson, Jeff. (2014). NP-Hard Problems. Lecturer Notes in Combinatorial Algorithms. Creative Commons License.
[26] Yang, L., Cai, X., Zhang, Y., & Shi, P. (2014). Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization. Information Sciences, 260, 37–50. https://doi.org/10.1016/j.ins.2013.11.026
[27] Hesabi, Z.R., Tari, Z., Goscinki, A., Fahad, A., Khalil, I. & Queiroz, C. (2015). Data Summarization Techniques for Big Data – A Survey. Handbook on Data Centers, 1109-1152.
[28] Kedzie, C., Mckeown, K., & Diaz, F. (2015). Predicting Salient Updates for Disaster Summarization, 1608–1617.
[29] Dasiran, S. N. M. (2005). Mobile Database Summarization using Rough Set. Master’s degree Thesis, Universiti Teknologi Malaysia.
[30] Ray, S. S., & Misra, S. (2019). Genetic algorithm for assigning weights to gene expressions using functional annotations. Computers in Biology and Medicine, 104(July 2018), 149–162.
[31] Wibowo, M., Sulaiman, S., Mariyam, S. (2017). Machine Learning in Data Lake for Combining Data Silos. International Conference on Data Mining and Big Data, 10387, 294–306.
[32] Environmental Protection Agency (2014) School Siting Guidelines. Retrieved September 2019, from http://www.epa.gov/schools/guidelinestools/siting/index.html
[33] Abbas, Z., & Burney, A. (2016). A Survey of Software Packages Used for Rough Set Analysis, (July), 10–18.
[34] Wibowo, M., Sulaiman, S., Mariyam, S. (2018). Comparison of Prediction Methods for Air Pollution Data in Malaysia and Singapore. International Journal of Innovative Computing, 8(3), 65–71.
[35] Dong, H., Li, T., Ding, R., & Sun, J. (2018). A novel hybrid genetic algorithm with granular information for feature selection and optimization. Applied Soft Computing Journal, 65, 33–46.
[36] WHO. (2017). Evaluation of WHO air Quality Guidelines: Past, Present, and Future.
[37] Kumar, P. N. V. and Reddy, V. R. (2014). Novel Web Proxy Cache Replacement Algorithms using Machine Learning. International Journal of Engineering Sciences and Research Technology, 3(1), 339–346.