International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020


IJSTR >> Volume 9 - Issue 4, April 2020 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Big Data Clustering: A Comparative Study On Various Clustering Algorithms

[Full Text]



G Ashok Kumar



Big Data, Clustering, Data Cleaning, Dimensionality Reduction, Analysis, Volume, Velocity, Variety and Dynamic.



Analysts classify big data as volume, velocity, and variety. Big data analysis explores intelligence from extremely wide variety of dynamic and complex data. Data cleaning is an essential step in big data analytics for easy prediction / decision making / clustering using data organizing tools. Clustering performs grouping of similar data from a population data set so that the data points in the same group show high degree of similarity between them than to the data points of other groups. Big data clustering help researchers to perform dimensionality reduction in complex problems, designing spam filters, identifying fraudulent or criminal behavior, performing Document analysis, classifying network traffic and helping Marketing /Sales analysis. The paper makes analysis of prominent big data clustering techniques in classifying data points belonging to different level of complexities.



[1] Abzetdin Adamov. Distributed file system as a basis of data-intensive computing, in: 2012 6th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3 (October).
[2] Breunig M, Ankerst M, Kriegel HP, Sander J. Optics: Ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1999 Jun; 28(2):49–60.
[3] C. YADAV, S. WANG, M. KUMAR, “Algorithm and approaches to handle large Data-A Survey,” International Journal of computer science and network, vol 2, issue 3, 2013.
[4] C.K. Reddy, C.C. Aggarwal, Data Classification: Algorithms and Applications. CRC Press, 2014.
[5] Chatterjee S, Sheikholeslami G, Zhang A. Wave cluster: A multi resolution clustering approach for very large spatial da­tabases. Proceedings Int Conf Very Large Data Bases (VLDB); 1998. p. 428–39.
[6] D. WUNSCH and R. XU, “Survey of clustering algorithms,” Neural Networks, IEEE Transactions, vol. 16, no 3, p. 645-678, 2005.
[7] Dr. Amit Ganatra, Prof. Neha Soni1, Comparative study of several Clustering Algorithms, International Journal of Advanced Computer Research, Volume-2 Number-4 Issue-6 December-2012.
[8] Ehrlich R, Bezdek JC, Full W. FCM: The Fuzzy C-Means Clus­tering algorithm. Computers and Geosciences. 1984; 10(2- 3):191–203.
[9] Erchart M, Schikuta E. The BANG – Clustering system: Grid– based data analysis. Lecture Notes in Computer Science. 1997; 1280:513–24.
[10] Ester M, Xu X, Sander J Krieger HP. A distribution-based clustering algorithm for mining in large spatial databases. Pro­ceedings 14th IEEE International Conference on Data Engi­neering (ICDE); Orlando, FL. 1998 Feb 23-27. p. 324.
[11] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, v.32 n.8, p.68-75, August 1999 [doi>10.1109/2.781637.
[12] G. Q. Wu, X. Wu, X. Zhu, and W. Ding, “Data mining with Big Data,” Knowledge and Data Engineering, IEEE Transactions on, vol. 26, no 1, p. 97-107, 2014.
[13] Grobelnik M, Brank J, Mladenic D. A survey of ontology eval­uation techniques. Proceedings Conf Data Mining and Data Warehouses; 2005. p. 166–9.
[14] Han EH, Karypis G, Kumar V. Chameleon: Hierarchical clus­tering using dynamic modeling. IEEE Computer. 1999 Aug; 32(8): 68–75.
[15] Han J, Ng RT. Efficient and effective clustering methods for spatial data mining. Proceedings Int Conf Very Large Data Bases (VLDB); 1994. p. 144–55.
[16] Han J. CLARANS, Ng RT: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge Data Engineering (TKDE). 2002 Sep/Oct; 14(5):1003–16.
[17] http://quantumcomputers.com.
[18] http://www.whitehouse.gov/sites/default/files/microsites/ostp/big-data-fact-sheet-final-1.pdf.
[19] J Macqueen. Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Sympo­sium on Mathematical Statistics Probability; Berkeley, CA, USA. 1967. p. 281–97.
[20] Jun CH , Park HS. A simple and fast algorithm for K-me­doids clustering. Expert Systems Applications. 2009 Mar; 36(2.2):3336–41.
[21] Karmasphere Studio and Analyst, 2012. .
[22] Keim DA, Hinneburg A. An efficient approach to clustering in large multimedia databases with noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad Data Mining (KDD); 1998. p. 58–65.
[23] Kriegel HP, Ester M, Sander J, Xu X. A density-based algo­rithm for discovering clusters in large spatial databases with noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad Data Mining (KDD); 1996. pp. 226–31.
[24] Leckie C, Mahmood AN, Udaya P. An efficient clustering scheme to exploit hierarchical data in network traffic analy­sis. IEEE Transactions on Knowledge. Data Engineering. 2008 Jun; 20(6):752–67.
[25] M.B.Vaidya, Yaminee S. Patil, A Technical Survey on Cluster Analysis in Data Mining, International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250 - 2459, Volume 2, Issue 9, September 2012).
[26] M.Renuka Devi, M.Vijayalakshmi, A Survey of Different Issue of Different clustering Algorithms Used in Large Datasets, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 3, March 2012.
[27] Markus M, Mihael Ankerst, Breunig, Hans-Peter Kriegel, Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States.
[28] P Berkhin. Survey of clustering data mining techniques in grouping multidimensional data. Springer. 2006; 25–71.
[29] P. Ahlawat MANN and P. Batra NAGPA, “Survey of Density Based Clustering Algorithms,” International journal of Computer Science and its Applications, vol. 1, no 1, p. 313-317,2011.
[30] Pentaho Business Analytics, 2012. .
[31] Philip Bernstein, Divyakant Agrawal, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, H.V. Jagadish, Jiawei Han Alon Halevy, Alexandros Labrinidis, Sam Madden, Yannis Papakon stantinou, Jignesh Patel, Raghu Ramakrishnan, Kenneth Ross, Shahabi Cyrus, Dan Suciu, Shiv Vaithyanathan, Jennifer Widom, Challenges and Opportunities with Big Data, CYBER CENTER TECHNICAL REPORTS, Purdue University, 2011.
[32] Rajeev Rastogi, Sudipto Guha, Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States.
[33] Ramakrishna R, Zhang T, Livny M. BIRCH: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1996 Jun; 25(2):103–14.
[34] Rastogi R, Guha S, Shim K. Cure: An efficient clustering algo­rithm for large data bases. Proceedings of the ACM SICMOID international Conference on Management of Data. 1998 Jun; 27(2):73–84.
[35] Rastogi R, Guha S, Shim K. Rock: A robust clustering algo­rithm for categorical attributes. 15th International Conference on Data Engineering; 1999. p. 512–21.
[36] Rousseau PJ, Kaufman L. Finding groups in data: An intro­duction to cluster analysis. USA, Johns and Sons Wiley; 2008.
[37] S. Aghabozorgi, A. S. Shirkhorshidi , T. Y. Wah, and T. Herawan, “Big Data Clustering: A Review,” In Computational Science and Its Applications–ICCSA 2014. Springer International Publishing, p. 707-720. 2014.
[38] Sheetal Sisodia, Deepti Sisodia, Lokesh Singh, Khushboo saxena, Clustering Techniques: A Brief Survey of Different Clustering Algorithms, International Journal of Latest Trends in Engineering and Technology (IJLTET). Vol. 1 Issue 3 September 2012.
[39] Storm, 2012. .
[40] Tari Z, Fahad A, Alshatri N, Alamri A. A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014 Sep; 2(3):267–79.
[41] Wang S, Yadav C, Kumar M. Algorithms and approaches to handle large data sets - A survey. International Journal of Computer Science and Network. 2013; 2(3):1–5.
[42] Wunsch D , Xu R. Survey of clustering algorithms. IEEE Trans­actions on Neural Networks. 2005 May; 16(3):645–78.
[43] Xu Xiaofei, He Zengyou , Deng Shengchun, Squeezer: an efficient algorithm for clustering categorical data, Journal of Computer Science and Technology, v.17 n.5, p.611-624, May 2002.
[44] Z Huang. A fast clustering algorithm to cluster very large cate­gorical data sets in data mining. Proceedings SIGMOD Work­shop Res Issues Data Mining Knowl Discovery; 1997. p. 1–8.
[45] Zhai C, Aggarwal C. A survey of text clustering algorithms. Mining Text Data. New York, NY, USA. Springer-Verlag: 2012. p. 77–128.