International Journal of Scientific & Technology Research

IJSTR@Facebook IJSTR@Twitter IJSTR@Linkedin
Home About Us Scope Editorial Board Blog/Latest News Contact Us

IJSTR >> Volume 4 - Issue 4, April 2015 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce

[Full Text]



Nivranshu Hans, Sana Mahajan, SN Omkar



Index Terms: Big Data, Clustering, Davies-Bouldin Index, Distributed processing, Hadoop MapReduce, Heuristics, Parallel Genetic Algorithm.



Abstract: Cluster analysis is used to classify similar objects under same group. It is one of the most important data mining methods. However, it fails to perform well for big data due to huge time complexity. For such scenarios parallelization is a better approach. Mapreduce is a popular programming model which enables parallel processing in a distributed environment. But, most of the clustering algorithms are not “naturally parallelizable” for instance Genetic Algorithms. This is so, due to the sequential nature of Genetic Algorithms. This paper introduces a technique to parallelize GA based clustering by extending hadoop mapreduce. An analysis of proposed approach to evaluate performance gains with respect to a sequential algorithm is presented. The analysis is based on a real life large data set.



[1] Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31, no. 3 (1999): 264-323.

[2] Bandyopadhyay, Sanghamitra, and Ujjwal Maulik. "Genetic clustering for automatic evolution of clusters and application to image classification." Pattern Recognition 35, no. 6 (2002): 1197-1208.

[3] Schaffer, J. David. "Multiple objective optimization with vector evaluated genetic algorithms." In Proceedings of the 1st International Conference on Genetic Algorithms, Pittsburgh, PA, USA, July 1985, pp. 93-100. 1985.

[4] White, Tom. Hadoop: the definitive guide: the definitive guide. " O'Reilly Media, Inc.", 2009.

[5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51, no. 1 (2008): 107-113.

[6] Mackey, Grant, Saba Sehrish, and Jun Wang. "Improving metadata management for small files in HDFS." In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pp. 1-4. IEEE, 2009.

[7] Davies, David L., and Donald W. Bouldin. "A cluster separation measure."Pattern Analysis and Machine Intelligence, IEEE Transactions on 2 (1979): 224-227.

[8] Jin, Chao, Christian Vecchiola, and Rajkumar Buyya. "Mrpga: an extension of mapreduce for parallelizing genetic algorithms." In eScience, 2008. eScience'08. IEEE Fourth International Conference on, pp. 214-221. IEEE, 2008.

[9] Di Geronimo, Linda, Filomena Ferrucci, Alfonso Murolo, and Federica Sarro. "A parallel genetic algorithm based on hadoop mapreduce for the automatic generation of junit test suites." In Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on, pp. 785-793. IEEE, 2012.

[10] Zitzler, Eckart, and Lothar Thiele. "Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach." evolutionary computation, IEEE transactions on 3, no. 4 (1999): 257-271.

[11] Zitzler, Eckart, and Lothar Thiele. "Multiobjective optimization using evolutionary algorithms—a comparative case study." In Parallel problem solving from nature—PPSN V, pp. 292-301. Springer Berlin Heidelberg, 1998.

[12] Senthilnath, J., S. N. Omkar, and V. Mani. "Clustering using firefly algorithm: performance study." Swarm and Evolutionary Computation 1, no. 3 (2011): 164-171.

[13] Kanungo, Tapas, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. "An efficient k-means clustering algorithm: Analysis and implementation." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, no. 7 (2002): 881-892.