Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce
Nivranshu Hans, Sana Mahajan, SN Omkar
Index Terms: Big Data, Clustering, Davies-Bouldin Index, Distributed processing, Hadoop MapReduce, Heuristics, Parallel Genetic Algorithm.
Abstract: Cluster analysis is used to classify similar objects under same group. It is one of the most important data mining methods. However, it fails to perform well for big data due to huge time complexity. For such scenarios parallelization is a better approach. Mapreduce is a popular programming model which enables parallel processing in a distributed environment. But, most of the clustering algorithms are not “naturally parallelizable” for instance Genetic Algorithms. This is so, due to the sequential nature of Genetic Algorithms. This paper introduces a technique to parallelize GA based clustering by extending hadoop mapreduce. An analysis of proposed approach to evaluate performance gains with respect to a sequential algorithm is presented. The analysis is based on a real life large data set.
 Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31, no. 3 (1999): 264-323.
 Bandyopadhyay, Sanghamitra, and Ujjwal Maulik. "Genetic clustering for automatic evolution of clusters and application to image classification." Pattern Recognition 35, no. 6 (2002): 1197-1208.
 Schaffer, J. David. "Multiple objective optimization with vector evaluated genetic algorithms." In Proceedings of the 1st International Conference on Genetic Algorithms, Pittsburgh, PA, USA, July 1985, pp. 93-100. 1985.
 White, Tom. Hadoop: the definitive guide: the definitive guide. " O'Reilly Media, Inc.", 2009.
 Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51, no. 1 (2008): 107-113.
 Mackey, Grant, Saba Sehrish, and Jun Wang. "Improving metadata management for small files in HDFS." In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pp. 1-4. IEEE, 2009.
 Davies, David L., and Donald W. Bouldin. "A cluster separation measure."Pattern Analysis and Machine Intelligence, IEEE Transactions on 2 (1979): 224-227.
 Jin, Chao, Christian Vecchiola, and Rajkumar Buyya. "Mrpga: an extension of mapreduce for parallelizing genetic algorithms." In eScience, 2008. eScience'08. IEEE Fourth International Conference on, pp. 214-221. IEEE, 2008.
 Di Geronimo, Linda, Filomena Ferrucci, Alfonso Murolo, and Federica Sarro. "A parallel genetic algorithm based on hadoop mapreduce for the automatic generation of junit test suites." In Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on, pp. 785-793. IEEE, 2012.
 Zitzler, Eckart, and Lothar Thiele. "Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach." evolutionary computation, IEEE transactions on 3, no. 4 (1999): 257-271.
 Zitzler, Eckart, and Lothar Thiele. "Multiobjective optimization using evolutionary algorithms—a comparative case study." In Parallel problem solving from nature—PPSN V, pp. 292-301. Springer Berlin Heidelberg, 1998.
 Senthilnath, J., S. N. Omkar, and V. Mani. "Clustering using firefly algorithm: performance study." Swarm and Evolutionary Computation 1, no. 3 (2011): 164-171.
 Kanungo, Tapas, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. "An efficient k-means clustering algorithm: Analysis and implementation." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, no. 7 (2002): 881-892.