Big Data Clustering: A Comparative Study On Various Clustering Algorithms
[Full Text]
AUTHOR(S)
G Ashok Kumar
KEYWORDS
Big Data, Clustering, Data Cleaning, Dimensionality Reduction, Analysis, Volume, Velocity, Variety and Dynamic.
ABSTRACT
Analysts classify big data as volume, velocity, and variety. Big data analysis explores intelligence from extremely wide variety of dynamic and complex data. Data cleaning is an essential step in big data analytics for easy prediction / decision making / clustering using data organizing tools. Clustering performs grouping of similar data from a population data set so that the data points in the same group show high degree of similarity between them than to the data points of other groups. Big data clustering help researchers to perform dimensionality reduction in complex problems, designing spam filters, identifying fraudulent or criminal behavior, performing Document analysis, classifying network traffic and helping Marketing /Sales analysis. The paper makes analysis of prominent big data clustering techniques in classifying data points belonging to different level of complexities.
REFERENCES
[1] Abzetdin Adamov. Distributed file system as a basis of data-intensive computing, in: 2012 6th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3 (October).
[2] Breunig M, Ankerst M, Kriegel HP, Sander J. Optics: Ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1999 Jun; 28(2):49–60.
[3] C. YADAV, S. WANG, M. KUMAR, “Algorithm and approaches to handle large Data-A Survey,” International Journal of computer science and network, vol 2, issue 3, 2013.
[4] C.K. Reddy, C.C. Aggarwal, Data Classification: Algorithms and Applications. CRC Press, 2014.
[5] Chatterjee S, Sheikholeslami G, Zhang A. Wave cluster: A multi resolution clustering approach for very large spatial databases. Proceedings Int Conf Very Large Data Bases (VLDB); 1998. p. 428–39.
[6] D. WUNSCH and R. XU, “Survey of clustering algorithms,” Neural Networks, IEEE Transactions, vol. 16, no 3, p. 645-678, 2005.
[7] Dr. Amit Ganatra, Prof. Neha Soni1, Comparative study of several Clustering Algorithms, International Journal of Advanced Computer Research, Volume-2 Number-4 Issue-6 December-2012.
[8] Ehrlich R, Bezdek JC, Full W. FCM: The Fuzzy C-Means Clustering algorithm. Computers and Geosciences. 1984; 10(2- 3):191–203.
[9] Erchart M, Schikuta E. The BANG – Clustering system: Grid– based data analysis. Lecture Notes in Computer Science. 1997; 1280:513–24.
[10] Ester M, Xu X, Sander J Krieger HP. A distribution-based clustering algorithm for mining in large spatial databases. Proceedings 14th IEEE International Conference on Data Engineering (ICDE); Orlando, FL. 1998 Feb 23-27. p. 324.
[11] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, v.32 n.8, p.68-75, August 1999 [doi>10.1109/2.781637.
[12] G. Q. Wu, X. Wu, X. Zhu, and W. Ding, “Data mining with Big Data,” Knowledge and Data Engineering, IEEE Transactions on, vol. 26, no 1, p. 97-107, 2014.
[13] Grobelnik M, Brank J, Mladenic D. A survey of ontology evaluation techniques. Proceedings Conf Data Mining and Data Warehouses; 2005. p. 166–9.
[14] Han EH, Karypis G, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer. 1999 Aug; 32(8): 68–75.
[15] Han J, Ng RT. Efficient and effective clustering methods for spatial data mining. Proceedings Int Conf Very Large Data Bases (VLDB); 1994. p. 144–55.
[16] Han J. CLARANS, Ng RT: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge Data Engineering (TKDE). 2002 Sep/Oct; 14(5):1003–16.
[17] http://quantumcomputers.com.
[18] http://www.whitehouse.gov/sites/default/files/microsites/ostp/big-data-fact-sheet-final-1.pdf.
[19] J Macqueen. Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on Mathematical Statistics Probability; Berkeley, CA, USA. 1967. p. 281–97.
[20] Jun CH , Park HS. A simple and fast algorithm for K-medoids clustering. Expert Systems Applications. 2009 Mar; 36(2.2):3336–41.
[21] Karmasphere Studio and Analyst, 2012. .
[22] Keim DA, Hinneburg A. An efficient approach to clustering in large multimedia databases with noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad Data Mining (KDD); 1998. p. 58–65.
[23] Kriegel HP, Ester M, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad Data Mining (KDD); 1996. pp. 226–31.
[24] Leckie C, Mahmood AN, Udaya P. An efficient clustering scheme to exploit hierarchical data in network traffic analysis. IEEE Transactions on Knowledge. Data Engineering. 2008 Jun; 20(6):752–67.
[25] M.B.Vaidya, Yaminee S. Patil, A Technical Survey on Cluster Analysis in Data Mining, International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250 - 2459, Volume 2, Issue 9, September 2012).
[26] M.Renuka Devi, M.Vijayalakshmi, A Survey of Different Issue of Different clustering Algorithms Used in Large Datasets, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 3, March 2012.
[27] Markus M, Mihael Ankerst, Breunig, Hans-Peter Kriegel, Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States.
[28] P Berkhin. Survey of clustering data mining techniques in grouping multidimensional data. Springer. 2006; 25–71.
[29] P. Ahlawat MANN and P. Batra NAGPA, “Survey of Density Based Clustering Algorithms,” International journal of Computer Science and its Applications, vol. 1, no 1, p. 313-317,2011.
[30] Pentaho Business Analytics, 2012. .
[31] Philip Bernstein, Divyakant Agrawal, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, H.V. Jagadish, Jiawei Han Alon Halevy, Alexandros Labrinidis, Sam Madden, Yannis Papakon stantinou, Jignesh Patel, Raghu Ramakrishnan, Kenneth Ross, Shahabi Cyrus, Dan Suciu, Shiv Vaithyanathan, Jennifer Widom, Challenges and Opportunities with Big Data, CYBER CENTER TECHNICAL REPORTS, Purdue University, 2011.
[32] Rajeev Rastogi, Sudipto Guha, Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States.
[33] Ramakrishna R, Zhang T, Livny M. BIRCH: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1996 Jun; 25(2):103–14.
[34] Rastogi R, Guha S, Shim K. Cure: An efficient clustering algorithm for large data bases. Proceedings of the ACM SICMOID international Conference on Management of Data. 1998 Jun; 27(2):73–84.
[35] Rastogi R, Guha S, Shim K. Rock: A robust clustering algorithm for categorical attributes. 15th International Conference on Data Engineering; 1999. p. 512–21.
[36] Rousseau PJ, Kaufman L. Finding groups in data: An introduction to cluster analysis. USA, Johns and Sons Wiley; 2008.
[37] S. Aghabozorgi, A. S. Shirkhorshidi , T. Y. Wah, and T. Herawan, “Big Data Clustering: A Review,” In Computational Science and Its Applications–ICCSA 2014. Springer International Publishing, p. 707-720. 2014.
[38] Sheetal Sisodia, Deepti Sisodia, Lokesh Singh, Khushboo saxena, Clustering Techniques: A Brief Survey of Different Clustering Algorithms, International Journal of Latest Trends in Engineering and Technology (IJLTET). Vol. 1 Issue 3 September 2012.
[39] Storm, 2012. .
[40] Tari Z, Fahad A, Alshatri N, Alamri A. A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014 Sep; 2(3):267–79.
[41] Wang S, Yadav C, Kumar M. Algorithms and approaches to handle large data sets - A survey. International Journal of Computer Science and Network. 2013; 2(3):1–5.
[42] Wunsch D , Xu R. Survey of clustering algorithms. IEEE Transactions on Neural Networks. 2005 May; 16(3):645–78.
[43] Xu Xiaofei, He Zengyou , Deng Shengchun, Squeezer: an efficient algorithm for clustering categorical data, Journal of Computer Science and Technology, v.17 n.5, p.611-624, May 2002.
[44] Z Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings SIGMOD Workshop Res Issues Data Mining Knowl Discovery; 1997. p. 1–8.
[45] Zhai C, Aggarwal C. A survey of text clustering algorithms. Mining Text Data. New York, NY, USA. Springer-Verlag: 2012. p. 77–128.
|