URL Mining Using Agglomerative Clustering Algorithm
Chinmay R. Deshmukh, R .R. Shelke
Index Terms: Agglomerative Clustering Algorithm, URL Mining, Re-ranking, Query Log analysis.
Abstract: The tremendous growth of the web world incorporates application of data mining techniques to the web logs. Data Mining and World Wide Web encompasses an important and active area of research. Web log mining is analysis of web log files with web pages sequences. Web mining is broadly classified as web content mining, web usage mining and web structure mining. Web usage mining is a technique to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. URL mining refers to a subclass of Web mining that helps us to investigate the details of a Uniform Resource Locator. URL mining can be advantageous in the fields of security and protection. The paper introduces a technique for mining a collection of user transactions with an Internet search engine to discover clusters of similar queries and similar URLs. The information we exploit is a “clickthrough data”: each record consist of a user’s query to a search engine along with the URL which the user selected from among the candidates offered by search engine. By viewing this dataset as a bipartite graph, with the vertices on one side corresponding to queries and on the other side to URLs, one can apply an agglomerative clustering algorithm to the graph’s vertices to identify related queries and URLs.
 Jhoshi, A. and Krishnapuram, R., “ Robust fuzzy clustering methods to support web mining, proceedings of the workshop on Data Mining and Knowledge Discovery, SIGMOD ‘ 98, Seattle, pp. 15/1 – 15/8, June 1998.
 Cooley, R., Web Usage Mining: Discovery and Applications of Interesting Patterns from Web data. PhD thesis, Dept. of Computer Science, University of Minnesota, May 2000.
 S, K., Radha Krishna, P.: Mining web data using clustering technique for web personalization, Int. Jour. of Computational Intelligence and Applications, 2(3) (2002) 255-265.