A Survey On Various Web Template Detection And Extraction Methods

Neethu Mary Varghese, Tenny Thomas Soman



Index Terms: Cluster, Homogeneous web page, Heterogeneous web page, Page-level detection, Search engine, Site-level detection, Template Detection, Template Extraction.



Abstract: In today’s digital world, reliance on the World Wide Web as a source of information is extensive. Users increasingly rely on web based search engines to provide accurate search results on a wide range of topics that interest them. The search engines, in turn parse the vast repository of web pages searching for relevant information. However, majority of web portals are designed using web templates, which are designed to provide consistent look and feel to end users. The presence of these templates however can influence search results leading to inaccurate results being delivered to the users. Therefore to improve the accuracy and reliability of search results, identification and removal of web templates from the actual content is essential. A wide range of approaches are commonly employed to achieve this, and this paper focuses on the study of the various approaches of template detection and extraction that can be applied across homogenous as well as heterogeneous web pages.



